NiFi for Data Ingestion on HDInsight
When huge volume of data from various sources are to be imported and transformations are applied, it is highly recommended to go in for proven tools to support the solutions. There are many tools available and Apache NiFi is one popular tool that helps in moving data between systems and processing them. In this blog, we are going to discuss using NiFi as part of bigdata tool for Azure HDInsight.
HDInsight provides Hadoop-as-a-Service which means it helps to create and manage Hadoop cluster with less effort on top of the Azure platform. It uses Azure Blob storage as the default storage system. NiFi can be setup to work with Azure HDInsight and it takes advantage of using other services that HDInsight provides.
Let us demonstrate with a simple use case of moving data from SQL database to Hadoop cluster with Blob storage and Hive table on top of it. The FlowFile (Data object) output will be Avro format and will get stored in Blob storage container named mystore9.
|Azure HDInsight Cluster
Create HDInsight using the URL “portal.azure.com”, navigate to Data + Analytics and click HDInsight. Here we select Hadoop cluster type on Linux environment. Choose the storage account to be used as part of HDInsight cluster. The default container will be used as Hadoop related files/logs.
SQL Database Setup
Create a SQL database with options under “SQL servers” and “SQL Databases” services in Azure. Choose the option to select source ‘sample’. This creates sample tables automatically with data which will be used for our data flow test. Under ‘Firewall/Virtual Networks’, set the virtual network to be same as HDInsight cluster.
Note the below parameters of the database which will be used in NiFi processors.
Server name: milesserver.database.windows.net
Connection strings: JDBC/ODBC/ADO.NET/PHP url and corresponding drivers
Also note down the SQL server login and password to be used in NiFi processor.
Connect to headnode using the hostname of the HDInsight cluster. Download NiFi from the url https://nifi.apache.org/download.html. Untar the file, tar -xvf nifi-x.tar.gz.
Modify the following NiFi properties in nifi.properties file under nifi-x/conf directory.
Other properties for security, cluster-based authentication, etc. may be modified depending on the requirement.
Start NiFi to run in the background with command bin/nifi.sh start. (To stop NiFi, use bin/nifi.sh stop and to check status use bin/nifi.sh status.)
Data Flow Design
NiFi web interface can be accessed with URL http://hostname:9090/nifi. The data ingestion from SQL database into Azure Blob storage is designed using NiFi interface. The NiFi UI has a tool bar from where the user can drag components, drop on the canvas and construct the data flow.
Processor component is the key component that does the actual data transfer, manipulation and many more functions. For our sample use case, we add two processors one for querying the SQL database and another to store in Blob storage.
Processor 1: Choose “QueryDatabaseTable” from the processor list. Specify the database connection and other properties through DBCPConnectionPool. Choose “SalesLT.Customer” as the source table. (Refer the screen below.)
The incoming data arrives into this processor from the SQL database which we have created earlier. The data will be fetched as incremental load of field CustomerID as shown in View State screen below.
Processor 2: Choose “PutAzureBlobStorage” from the processor list. Specify the Azure account name and container name. The FlowFile get stored into Blob storage location in Avro format.
Connect Processor 1 to Processor 2 by clicking the connection icon in the first processor, drag it and drop on the second processor.
There are many processors available for doing various operations which can be used depending on different requirements. Please refer https://nifi.apache.org/docs.html for the list of processor components.
Verify Data Flow
Right click on both processor components and choose to start them running. The FlowFile movement can be seen on the canvas through the changes to the number of bytes for “In”, “Out” and “Tasks/Time”. The sample records from the source table starts to flow into the Blob storage.
Hive comes as part of HDInsight and is useful to query data by running Map and Reduce jobs on the cluster. We can test if the records are ingested by issuing HQL. Following is the procedure.
Connect to Hive using /usr/bin/hive through command line and create an external table pointing to mystore9 storage. Since the Blob storage contains Avro files, the hive table should be created using Avro format with below command.
create external table demo_avr row format serde 'org.apache.hadoop.hive.serde2.Avro.AvroSerDe'
stored as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.Avro.AvroContainerInputFormat'
Once the hive table is created, we can hit any query and analyse the results.
We saw a simple example with an input and output processor to show how NiFi is used on Azure platform with HDInsight. However, Apache NiFi is a much powerful solution to handle complex flows, processing and monitoring. The user must spend time to design, setup the appropriate processors’ properties and NiFi can manage any large flows with good performance. Documentation is aplenty online and is time to get started.
Author Credits: Natarajan Krishnamurthy is the Technical Lead, Bigdata at 8K Miles and you can reach him here.