Screen Shot 2016-08-13 at 8.18.54 PM

HAProxy load balancing in front of Apache NiFi

There is a lot of situations where load balancing is necessary even more so when you are in a clustered architecture. In the context of NiFi you may want to consider two situations:

  • You have a NiFi cluster and you don’t want to give users the IP address of your NiFi nodes to access the UI (remember: every node of the cluster can be used for the UI or to make REST API calls) and besides you want to load balance users on every node of the cluster. In addition, in case you scale up/down your cluster, you don’t want to inform all the users of such a change.
  • You have Listen[X] processors (HTTP, TCP, UDP, Syslog, etc) in your workflow that are not running on the primary node only and you want to have all the nodes receiving the data (to increase performances and have full scalability). Again you don’t want to configure/inform the data senders with all the IPs of your NiFi nodes, and in case of a cluster change you don’t want to make some changes on client side.

In such situations you need a load balancer that will stand in front of your NiFi cluster and will provide a VIP (Virtual IP). For this purpose, you have hardware options, software options and also cloud options. Keep in mind that, if you want a production setup, you’ll need to have your load balancer installed in a high availability fashion otherwise it’ll become a single point of failure.

In this article, I’ll use HAProxy which is the most widely used open source software based load balancing solution. And I’ll launch my HAProxy instance in a Docker container (I’ll also use Portainer as a UI helping me to manage my Docker environment).

Some useful links that can help you with the different tools I’m going to use:

I assume I already have my secured NiFi 3-nodes cluster up and running. And also, I’ve already setup my Docker and Portainer environment: I’m ready to pull HAProxy Docker image and to create my container.

Let’s start with my HAProxy instance. First of all, in Portainer, I just need to pull the image I want (in this case I want the latest version and I just need to enter haproxy):

Screen Shot 2017-02-08 at 12.23.09 PM.png

I now have the image available:

screen-shot-2017-02-08-at-12-24-14-pm

Let’s build a container. In order to ease the modification of the configuration file of HAProxy, I’ll define a bind mount to bind the configuration file inside the container to a file I have on my computer. I also need to define all the ports I want to expose in my container. Since my container will run on one of the nodes of my cluster I don’t want to use the same ports (but in practice, your HAProxy would probably be on its own host).

  • Open my container on 9443, and I’ll configure HAProxy to listen on 9443 and to redirect requests made on 9443 to my NiFi nodes on port 8443 (port of my NiFi UI)
  • Open my container on 9999, and I’ll configure HAProxy to listen on 9999 and to redirect requests made on 9999 to my NiFi nodes on port 8888 (port that I’ll use for my ListenHTTP processors)
  • Open my container on 1936 and map on port 1936 inside my container (that’s the port used by the HAProxy management UI)

In Portainer, I add a container and define the following:

Screen Shot 2017-02-08 at 2.42.02 PM.png

When I create my container, it will immediately stop because I didn’t configure my configuration file so far and the container won’t start if the configuration file is invalid or does not exist. You can check the logs of your container in Portainer in case of issue.

In our case we’ll have HTTP access to send data to our ListenHTTP processors and HTTPS to access the UI. In case of HTTPS, I don’t want to have my load balancer taking care of certificates and I just want to have my requests going through (pure TCP load balancing)… this will raise an issue to users accessing the UI because the certificate presented by the NiFi node won’t match the address the client is requesting (load balancer address), this could be solved by adding a SAN (Subject Alternative Name) in the certificate of the nodes but this will be discussed in another article (in the mean time you can have a look at NIFI-3331). In this case (and just because this is a demo!), I’ll accept the SSL exceptions in my browser.

OK so let’s see my HAProxy configuration file:

global

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000

frontend nifi-ui
    bind *:9443
    mode tcp
    default_backend nodes-ui

backend nodes-ui
    mode tcp
    balance roundrobin
    stick-table type ip size 200k expire 30m
    stick on src
    option httpchk HEAD / HTTP/1.1\r\nHost:localhost
    server nifi01 node-1:8443 check check-ssl verify none
    server nifi02 node-2:8443 check check-ssl verify none
    server nifi03 node-3:8443 check check-ssl verify none

frontend nifi-listen-http
    bind *:9999
    mode http
    default_backend nodes-http

backend nodes-http
    mode http
    balance roundrobin
    option forwardfor
    http-request set-header X-Forwarded-Port %[dst_port]
    option httpchk HEAD / HTTP/1.1\r\nHost:localhost
    server nifi01 node-1:8888 check
    server nifi02 node-2:8888 check
    server nifi03 node-3:8888 check

listen stats
    bind *:1936
    mode http
    stats enable
    stats uri /
    stats hide-version
    stats auth admin:password

Let’s see the different parts:

  • global – I put nothing in here
  • defaults – Just some timeouts and log configuration, nothing special
  • frontend nifi-ui – here I define a front end called “nifi-ui”. This is where I define on what port HAProxy should listen (in this case on port 9443, for TCP mode) and where I should redirect the requests (in this case to my back end called nodes-ui).
  • backend nodes-ui – here I define my back end where will be redirected my requests received by my front end. I define TCP mode, round robin load balancing, stickiness (to ensure that a connected user, based on its IP, will remain on the same node over multiple requests), and the nodes available. I also define the health check operation performed by HAProxy to confirm if nodes are up or down. In this case it’s a HEAD HTTP request and I don’t check the SSL certificates.
  • frontend nifi-listen-http – here I define the front end that will be used to send data to my ListenHTTP processors on port 9999.
  • backend nodes-ui – here I define the back end with my servers, the HTTP mode, the health check, and also the addition of two HTTP headers (x-forwarded-port and x-forwarded-for) to keep track of IP and port of my client (otherwise I’d only know about the IP and port of my load balancer when receiving data in NiFi).
  • listen stats – here are some parameters about the HAProxy management UI, like the port, the login and password, etc.

I restart my container to take into account the configuration, and I can have a look to the management UI:

screen-shot-2017-02-08-at-4-37-45-pm

All is green meaning that our health checks are OK.

Assuming my load balancer has the following host name “my-nifi-vip”, I can now access the UI through https://my-nifi-vip:9443/nifi :

screen-shot-2017-02-08-at-4-41-53-pm

On my UI, I can configure a ListenHTTP processor as below:

Screen Shot 2017-02-08 at 4.43.14 PM.png

screen-shot-2017-02-08-at-4-45-05-pm

And send data to my cluster using the virtual IP provided by my load balancer:

while true; do curl -X POST http://my-nifi-vip:9999/test; done;

In the HAProxy management UI, I can confirm that my requests are correctly load balanced on all the nodes of my cluster:

screen-shot-2017-02-08-at-5-17-01-pm

It is now really easy to add or remove new nodes in your NiFi cluster without impacting the clients sending data into NiFi since they only need to know about the virtual IP exposed by the load balancer. You just need to update the configuration file when you are adding nodes, and restart your HAProxy container.

Keep in mind that even if the data is correctly load balanced over the nodes, all the requests are going through a single point, the load balancer. Consequently, on a performance standpoint, your load balancer may become a bottleneck in case you need to handle a very large number of connections per second. However load balancers are designed to be as efficient as possible and you should be OK in most cases.

As always questions/comments are welcomed.

Screen Shot 2016-08-13 at 8.18.54 PM

Using counters in Apache NiFi

You may not know it but you have the availability to define and play with counters in NiFi. If policies are correctly configured (if your NiFi is secured), you should be able to access the existing counters using the menu:

Screen Shot 2017-02-07 at 5.33.59 PM.png

Counters are just values that you can increase or decrease of a given delta. This is useful if you want to monitor particular values along your workflow. At the moment, unless you use a processor that explicitly uses counters or provides a way to define counters, there is nothing available out of the box.

The best way to define and update counters is to use ExecuteScript processor with the following piece of Groovy code:

def flowFile = session.get()
if(!flowFile) return
session.adjustCounter("my-counter", 1, true)
session.transfer(flowFile, REL_SUCCESS)

With this example, the ExecuteScript processor will just transmit the flow file without any modification to the success relationship but will also increment the counter “my-counter” of 1. If this counter does not exist it will be initialized with the delta value given as argument.

Here is the documentation of this method:

    /**
     * Adjusts counter data for the given counter name and takes care of
     * registering the counter if not already present. The adjustment occurs
     * only if and when the ProcessSession is committed.
     *
     * @param name the name of the counter
     * @param delta the delta by which to modify the counter (+ or -)
     * @param immediate if true, the counter will be updated immediately,
     *            without regard to whether the ProcessSession is commit or rolled back;
     *            otherwise, the counter will be incremented only if and when the
     *            ProcessSession is committed.
     */
    void adjustCounter(String name, long delta, boolean immediate);

Let’s see an example: I want to confirm the correct behavior of the GetHDFS processor when I have multiple instances of this processor looking into the same directory but getting different flow files based on a regular expression.

Here is the first part of my flow:

screen-shot-2017-02-07-at-6-20-52-pm

Basically, I am generating flow files every 1ms with GenerateFlowFile. The generated flow files will be named after the generation date timestamp (without any extension). I am sending the files into HDFS and then I’m using a RouteOnAttribute where I check the filename according to a regular expression to split files according to even and uneven names. This way I can increment counters tracking the number of files I sent to HDFS with even names and with uneven names.

Here is the second part of my flow:

Screen Shot 2017-02-07 at 6.27.28 PM.png

I have two instances of GetHDFS processor configured to look into the same input directory but one with a regular expression to look for files with an even name, and one to look for files with an uneven name, this way there is no concurrent access. Besides, the processor is configured to delete the file on HDFS once the file is retrieved in NiFi. Then I update two different counters to track the number of files with an even name that I retrieved from HDFS, and one for files with an uneven name.

If everything is working correctly, I should be able to let run my workflow a bit, then stop the generation of flow files, wait for all the flow files to be processed and confirm that:

  • even-producer counter is equal to even-consumer counter
  • unenven-producer counter is equal to uneven-consumer counter

Let’s have a look into our counters table:

screen-shot-2017-02-07-at-6-47-02-pm

It looks like we are all good 😉

As a remark, if you have multiple processors updating the same counter, then you will have the global value of the counter but also the value at each processor level. For example, if I have:

screen-shot-2017-02-07-at-6-53-09-pm

With both ExecuteScript incrementing the same “test” counter, then, I’ll have:

screen-shot-2017-02-07-at-6-55-48-pm

Also, as a last remark, you can notice that it’s possible to reset a counter to 0 from the counters table with the button in the last column (assuming you have write access to the counters based on the defined policies). It can be useful when doing some tests.

As always, questions/comments are welcomed!

Screen Shot 2016-08-13 at 8.18.54 PM

NiFi and OAuth 2.0 to request WordPress API

A lot of famous websites are allowing you to develop custom applications to interact with their API. In a previous example, we saw how to use NiFi to perform OAuth 1.0A authentication against Flickr API. However a lot of websites are using OAuth 2.0 mechanism to authenticate your applications. You can find more details here, and check the differences between the two versions here.

Since this blog is hosted and powered by WordPress, and since WordPress is allowing you to develop applications and is using Oauth 2.0 as authentication mechanism, let’s try to get the statistics of my blog using NiFi.

Before going into the details, let’s recap the behavior in play with the example of WordPress: a user A develops an application X, this application X is running on the Internet. Then a user B is accessing to the application X. This application X is asking B to grant a set of permissions to access AS user B to WordPress. If user B accepts, then application X can interact with WordPress as user B.

This is something you must have experienced with some applications like Facebook, Google, Instagram, LinkedIn, etc… All asking your permissions to post some content in your name on other websites/applications.

Now let’s understand what is going on from an OAuth 2.0 point of view.

When user B accesses application X, the application X is issuing a request to WordPress saying that the application X is requesting access to WordPress resources. The user B will be asked to authenticate with its WordPress credentials and to approve the request of the application X to grant the application a set of permissions on the resources belonging to B. Once done, the application X will get from WordPress a short time limited code. Then, the application is going to issue another request to WordPress using this code and telling which resource the application wants to access. WordPress will then return an access token and the ressource ID the application is allowed to use in API calls. At this point, the application is able to request all the API endpoints to get all the data of the given resource (a WordPress blog in this example).

OK… So now, let’s build our application using NiFi!

I’ll demonstrate something “simple”: a web service exposed by NiFi that gives users access to the stats of their blog. (in the example, it will be my blog since I’ll be connecting to my application using my credentials, but that could be any WordPress user)

Let’s define my application in WordPress so that WordPress is aware of this application and generates me some secret tokens to identify my application. I go here and I create an application that I call NiFi. Notice that the redirect URL is http://localhost:9999/ because this is where the web service created in NiFi will be listening. This could be something online but my NiFi would need to be opened on the Internet.

screen-shot-2017-02-01-at-12-04-19-am

The redirect URL will be the endpoint where the user will be redirected once the user has granted access to the application to WordPress resources belonging to the user. In this case we want to send back the user to our listening web service. It might be easier to understand later with the example, don’t worry 😉

Once my application is created, WordPress gives me some information that will be particularly useful:

screen-shot-2017-02-01-at-12-07-42-am

That’s all we need on WordPress side. Let’s start building our NiFi workflow!

In the end the workflow will be:

Screen Shot 2017-02-01 at 12.24.29 AM.png

We start with a HandleHttpRequest that is listening to requests performed by the user. We specify the processor to listen on localhost:9999.

Then I use an UpdateAttribute processor to add all the “common properties” I want to access in all my processors through expression language:

Screen Shot 2017-02-01 at 12.13.51 AM.png

Then I use a RouteOnAttribute to route the request based on the URL. Indeed, I am expecting users to access my web service with the URL http://localhost:9999/getCode but WordPress will also send requests to my service when redirecting users on URL like http://localhost:9999/code=…&state.

Here is my RouteOnAttribute:

Screen Shot 2017-02-01 at 12.16.07 AM.png

When this is a request sent by a user (containing “getCode” in the URL), then I use a InvokeHTTP processor to send a request to WordPress. This will give me the page where I need to send my user so that the user can authenticate and grant my application all permissions.

Based on WordPress documentation, the URL to request with a GET is:

screen-shot-2017-02-01-at-12-18-07-am

And what I received from WordPress (basically the page to let the user authenticate himself) is what I return to the user through a HandleHttpResponse processor. This way the user will access the page to authenticate on WordPress and grant my application all permissions, then the user will be redirected back to my application with a URL containing the code I need to get a token (thanks to the redirect URL we defined).

When the redirection is performed, I am back to my HandleHttpRequest, but, this time, at the RouteOnAttribute, I’ll go in unmatched relationship (no “getCode” in the URL since, this time, this is the redirect URL). At this point, I use an UpdateAttribute to extract the code from the callback URL used by WordPress:

screen-shot-2017-02-01-at-12-20-34-am

I am now able to create the content that will be sent to WordPress in the next request using a ReplaceText processor (indeed, since it will be a POST request, I need to update the content of my FlowFile because this will be used as the body of my next HTTP request):

Screen Shot 2017-02-01 at 12.25.38 AM.png

And I can perform my POST request using a InvokeHTTP processor in which I specify the content type to “application/x-www-form-urlencoded”.

This request will give me back a JSON looking like:

Screen Shot 2017-02-01 at 12.27.00 AM.png

So I use an EvaluateJsonPath processor to extract the blog ID and the access token:

Screen Shot 2017-02-01 at 12.27.53 AM.png

And I am now able to perform my last request with a InvokeHttp to request the API endpoint of WordPress to get statistics associated to the blog ID:

Screen Shot 2017-02-01 at 12.28.58 AM.png

And I add a property to specify my access token as a header property:

Screen Shot 2017-02-01 at 12.29.33 AM.png

Then I send back the result to a HandleHttpResponse to display the result of the request to the user. Obviously at this point we could do something nicer with the statistics and display some charts for example… but that’s outside the purpose of this blog: I just return the JSON containing the statistics 🙂

That’s all! Let’s now see what it looks like when connecting to the web service while the full flow is running:

When I go to http://localhost:9999/getCode

I get to this page:

Screen Shot 2017-02-01 at 12.32.23 AM.png

I enter my credentials, and since I’ve a two-steps authentication, I get on a web page asking for another access code that I received on my smartphone. Once the code is entered, I am asking to grant permissions to the application:

Screen Shot 2017-02-01 at 12.35.44 AM.png

Then I approve, and I finally get the statistics in a JSON:

Screen Shot 2017-02-01 at 12.36.55 AM.png

That’s pretty cool, isn’t it? The template is available here.

Now I’m sure you can imagine a lot of great applications using OAuth 2.0 mechanism to interact with various existing APIs!

As always, comments and questions are welcomed! I hope you enjoyed this blog!

Screen Shot 2016-08-13 at 8.18.54 PM

Integration of NiFi with LDAP

Once your cluster is secured, you probably want to start allowing users to access the cluster and you may not want to issue individual certificates for each user. In this case, one of the option is to use LDAP as the authentication provider of NiFi. This is quite simple, and we’ll see in this post how to easily setup a local LDAP server and integrate NiFi with it.

In terms of configuration, everything is done with two files:

  • ./conf/nifi.properties
  • ./conf/login-identity-providers.xml

In nifi.properties, we are interested by two properties:

nifi.login.identity.provider.configuration.file
nifi.security.user.login.identity.provider

The first one is used to give the path to the login-identity-providers.xml and the second one is used to define the name of the identity provider to use from the XML file (in case you configured multiple providers).

A quick quote from the documentation:

NiFi supports user authentication via client certificates or via username/password. Username/password authentication is performed by a Login Identity Provider. The Login Identity Provider is a pluggable mechanism for authenticating users via their username/password. Which Login Identity Provider to use is configured in two properties in the nifi.properties file.

The nifi.login.identity.provider.configuration.file property specifies the configuration file for Login Identity Providers. The nifi.security.user.login.identity.provider property indicates which of the configured Login Identity Provider should be used. If this property is not configured, NiFi will not support username/password authentication and will require client certificates for authenticating users over HTTPS. By default, this property is not configured meaning that username/password must be explicitly enabled.

NiFi does not perform user authentication over HTTP. Using HTTP all users will be granted all roles.

In other words, if you want login/password authentication, your cluster needs to be secured first!

OK, so I set the following values in nifi.properties:

nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.security.user.login.identity.provider=ldap-provider

And then I just need to configure my XML files and to restart NiFi. Here are the LDAP parameters (and we can notice that the identifier is matching the value set in nifi.properties):

<provider>
        <identifier>ldap-provider</identifier>
        <class>org.apache.nifi.ldap.LdapProvider</class>
        <property name="Authentication Strategy">START_TLS</property>
        <property name="Manager DN"></property>
        <property name="Manager Password"></property>
        <property name="TLS - Keystore"></property>
        <property name="TLS - Keystore Password"></property>
        <property name="TLS - Keystore Type"></property>
        <property name="TLS - Truststore"></property>
        <property name="TLS - Truststore Password"></property>
        <property name="TLS - Truststore Type"></property>
        <property name="TLS - Client Auth"></property>
        <property name="TLS - Protocol"></property>
        <property name="TLS - Shutdown Gracefully"></property>
        <property name="Referral Strategy">FOLLOW</property>
        <property name="Connect Timeout">10 secs</property>
        <property name="Read Timeout">10 secs</property>
        <property name="Url"></property>
        <property name="User Search Base"></property>
        <property name="User Search Filter"></property>
        <property name="Identity Strategy">USE_DN</property>
        <property name="Authentication Expiration">12 hours</property>
    </provider>

And here is the associated documentation:

Identity Provider for users logging in with username/password against an LDAP server.

‘Authentication Strategy’ – How the connection to the LDAP server is authenticated. Possible values are ANONYMOUS, SIMPLE, LDAPS, or START_TLS.

‘Manager DN’ – The DN of the manager that is used to bind to the LDAP server to search for users.
‘Manager Password’ – The password of the manager that is used to bind to the LDAP server to search for users.

‘TLS – Keystore’ – Path to the Keystore that is used when connecting to LDAP using LDAPS or START_TLS.
‘TLS – Keystore Password’ – Password for the Keystore that is used when connecting to LDAP using LDAPS or START_TLS.
‘TLS – Keystore Type’ – Type of the Keystore that is used when connecting to LDAP using LDAPS or START_TLS (i.e. JKS or PKCS12).
‘TLS – Truststore’ – Path to the Truststore that is used when connecting to LDAP using LDAPS or START_TLS.
‘TLS – Truststore Password’ – Password for the Truststore that is used when connecting to LDAP using LDAPS or START_TLS.
‘TLS – Truststore Type’ – Type of the Truststore that is used when connecting to LDAP using LDAPS or START_TLS (i.e. JKS or PKCS12).
‘TLS – Client Auth’ – Client authentication policy when connecting to LDAP using LDAPS or START_TLS. Possible values are REQUIRED, WANT, NONE.
‘TLS – Protocol’ – Protocol to use when connecting to LDAP using LDAPS or START_TLS. (i.e. TLS, TLSv1.1, TLSv1.2, etc).
‘TLS – Shutdown Gracefully’ – Specifies whether the TLS should be shut down gracefully before the target context is closed. Defaults to false.

‘Referral Strategy’ – Strategy for handling referrals. Possible values are FOLLOW, IGNORE, THROW.
‘Connect Timeout’ – Duration of connect timeout. (i.e. 10 secs).
‘Read Timeout’ – Duration of read timeout. (i.e. 10 secs).

‘Url’ – Url of the LDAP server (i.e. ldap://<hostname>:<port>).
User Search Base’ – Base DN for searching for users (i.e. CN=Users,DC=example,DC=com).
User Search Filter’ – Filter for searching for users against the ‘User Search Base’. (i.e. sAMAccountName={0}). The user specified name is inserted into ‘{0}’.

‘Identity Strategy’ – Strategy to identify users. Possible values are USE_DN and USE_USERNAME. The default functionality if this property is missing is USE_DN in order to retain backward compatibility. USE_DN will use the full DN of the user entry if possible. USE_USERNAME will use the username the user logged in with.
‘Authentication Expiration’ – The duration of how long the user authentication is valid for. If the user never logs out, they will be required to log back in following this duration.

OK, enough theory, let’s install a LDAP server using Apache Directory Studio. This project provides an easy way to setup a LDAP server but is also providing a great GUI to manage/administrate existing LDAP servers.I’ll go quick because it’s quite simple to setup and if needed the documentation of the official website is very useful.

Once downloaded and installed, just launch it. On the workbench, we are going to create a new server. Click on the ‘+’ symbol in the “LDAP Servers” tab:

screen-shot-2017-01-24-at-10-04-01-pm

Then, select Apache DS and give it a name:

Screen Shot 2017-01-24 at 10.04.16 PM.png

Create a connection: right click on your server / create a connection. And start your server to access it. You should be able to access the Overview tab of your server. We are going to create a partition/branch for NiFi users:

Screen Shot 2017-01-24 at 10.04.52 PM.png

Click on Advanced Partitions configuration and then Add a new partition. Here I decided to call my partition “dc=nifi,dc=com”:

Screen Shot 2017-01-24 at 10.05.14 PM.png

At this point, you need to restart your server (right click / stop, right click / start).

Now we are going to create an organizational unit for groups and an organizational unit for people. In the ou=groups, we will define two groups, one for normal users and one for administrators. And we are going to create one user in each group, a user “test” in the group “users”, and a user “admin” in the group “admins”. This can be done through the GUI but in this case, I’ll do it by importing the below LDIF file:

dn: ou=people,dc=nifi,dc=com
objectclass: organizationalUnit
objectClass: extensibleObject
objectclass: top
ou: people

dn: ou=groups,dc=nifi,dc=com
objectclass: organizationalUnit
objectClass: extensibleObject
objectclass: top
ou: groups

dn: cn=users,ou=groups,dc=nifi,dc=com
objectClass: groupOfUniqueNames
objectClass: top
cn: users
uniqueMember: cn=test,ou=people,dc=nifi,dc=com

dn: cn=admins,ou=groups,dc=nifi,dc=com
objectClass: groupOfUniqueNames
objectClass: top
cn: admins
uniqueMember: cn=admin,ou=people,dc=nifi,dc=com

dn: cn=test,ou=people,dc=nifi,dc=com
objectclass: inetOrgPerson
objectclass: organizationalPerson
objectclass: person
objectclass: top
cn: test
description: A test user
sn: test
uid: test
mail: test@nifi.com
userpassword: password

dn: cn=admin,ou=people,dc=nifi,dc=com
objectclass: inetOrgPerson
objectclass: organizationalPerson
objectclass: person
objectclass: top
cn: admin
description: A admin user
sn: admin
uid: admin
mail: admin@nifi.com
userpassword: password

To import it, right click on dc=nifi,dc=com, then Import, then LDIF import and select your file.

This will give you the following structure:

Screen Shot 2017-01-24 at 10.27.40 PM.png

Now we want to configure NiFi to connect to our LDAP server. For that you have to note that, by default, the manager of the server (for an Apache DS LDAP server) has “uid=admin,ou=system” as DN and “secret” as password. Then the XML file is configured as below (no LDAPS/TLS in this example):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<loginIdentityProviders>
    <provider>
        <identifier>ldap-provider</identifier>
        <class>org.apache.nifi.ldap.LdapProvider</class>
        <property name="Authentication Strategy">SIMPLE</property>

        <property name="Manager DN">uid=admin,ou=system</property>
        <property name="Manager Password">secret</property>

        <property name="Referral Strategy">FOLLOW</property>
        <property name="Connect Timeout">10 secs</property>
        <property name="Read Timeout">10 secs</property>

        <property name="Url">ldap://localhost:10389</property>
        <property name="User Search Base">ou=people,dc=nifi,dc=com</property>
        <property name="User Search Filter">uid={0}</property>

        <property name="Identity Strategy">USE_USERNAME</property>
        <property name="Authentication Expiration">12 hours</property>
    </provider>
</loginIdentityProviders>

We need to restart NiFi to take into account the modifications. Note: if NiFi is clustered, configuration files must be the same on all nodes.

Now… if you try to connect as test or admin, you will get the following error:

Unknown user with identity ‘admin’. Contact the system administrator.

This is because you first need to add this user in the list of users through NiFi UI using the initial admin account (see Apache NiFi 1.1.0 – Secured cluster setup). At there is no syncing mechanism to automatically add LDAP users/groups into NiFi.

When connected with your initial admin account (using your individual certificate), go into users to add your users, and then into policies to grant access and rights to the users:

Screen Shot 2017-01-24 at 10.45.35 PM.png

Screen Shot 2017-01-24 at 10.45.46 PM.png

You have now a NiFi instance integrated with a LDAP server and you can connect as different users defined in your LDAP. It gives you the opportunity to add users and play with the policy model implemented in NiFi.

Important note: NiFi has a large and active community, new features regarding LDAP integration could be provided very soon (for example: NIFI-3115).

As always, comments/remarks are welcomed!

Screen Shot 2016-08-13 at 8.18.54 PM

Scaling up/down a NiFi cluster

Scaling up – add a new node in the cluster

You have a NiFi cluster and you are willing to increase the throughput by adding a new node? Here is a way to do it without restarting the cluster. This post starts with a 3-nodes secured cluster with embedded ZooKeeper and we are going to add a new node to this cluster (and this new node won’t have an embedded ZK).

The node I’m going to add is a CentOS 7 virtual machine running an Apache NiFi 1.1.0 instance. All prerequisites are met (network, firewall, java, etc) and I uncompressed the NiFi and NiFi toolkit binaries archives.

First thing to do is, if you don’t have DNS resolution, to update the /etc/hosts file on all the nodes so that each node can resolve the others.

Then, I want to issue a certificate signed by my Certificate Authority for my new node. For this purpose, I use the NiFi toolkit and issue a CSR against my running CA server (see secured cluster setup). I created a directory /etc/nifi, and from this directory I ran the command:

.../bin/tls-toolkit.sh client -c node-3 -t myTokenToUseToPreventMITM -p 9999

node-3 being the node where my CA server is running.

It generates:

  • config.json
  • keystore.jks
  • nifi-cert.pem
  • truststore.jks

Then I can configure my new node with the following properties in ./conf/nifi.properties (to match the configuration used on the running cluster):

 

nifi.cluster.protocol.is.secure=true
nifi.cluster.is.node=true
nifi.cluster.node.address=node-4
nifi.cluster.node.protocol.port=9998
nifi.remote.input.host=node-4
nifi.remote.input.secure=true
nifi.remote.input.socket.port=9997
nifi.web.https.host=node-4
nifi.web.https.port=8443
nifi.zookeeper.connect.string=node-1:2181,node-2:2181,node-3:2181

And the following properties based on the information from the generated config.json file:

nifi.security.keystore=/etc/nifi/keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=JBJUu5MGhGdBj6pTZkZaBR2jRijtcjEMlwqnqc9WkMk
nifi.security.keyPasswd=JBJUu5MGhGdBj6pTZkZaBR2jRijtcjEMlwqnqc9WkMk
nifi.security.truststore=/etc/nifi/truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=gQ7Bw9kbkbr/HLZMxyIvq3jtMeTkoZ/0anD1ygbtSt0
nifi.security.needClientAuth=true

Then we need to configure the ./conf/authorizers.xml file to specify the initial admin identity and the nodes identities:

<authorizers>
    <authorizer>
        <identifier>file-provider</identifier>
        <class>org.apache.nifi.authorization.FileAuthorizer</class>
        <property name="Authorizations File">./conf/authorizations.xml</property>
        <property name="Users File">./conf/users.xml</property>
        <property name="Initial Admin Identity">CN=pvillard, OU=NIFI</property>
        <property name="Legacy Authorized Users File"></property>

        <property name="Node Identity 1">CN=node-1, OU=NIFI</property>
        <property name="Node Identity 2">CN=node-2, OU=NIFI</property>
        <property name="Node Identity 3">CN=node-3, OU=NIFI</property>
    </authorizer>
</authorizers>

IMPORTANT: the authorizers.xml file must match the file that is on the other nodes. We do not specify the Node Identity of the new node, this will be handled when the node is joining the cluster. Also, before starting your new node, delete the ./conf/flow.xml.gz file so that this new node will pick up the current flow definition from the running cluster.

At this point, you can start the new node:

./bin/nifi.sh start && tail -f ./logs/nifi-app.log

And you will be able to see in the cluster view in the NiFi UI that your new node has joined the cluster:

screen-shot-2016-11-29-at-9-19-37-pm

You now need to add the new user corresponding to the new node:

screen-shot-2016-11-29-at-9-20-44-pm

screen-shot-2016-11-29-at-9-21-20-pm

And then update the policies to grant proxy “user requests” write access to the new node.

screen-shot-2016-11-29-at-9-21-37-pm

Now I want to confirm that my new node is used in the workflow processing. I am going to create a simple workflow that generates flow files on primary node, distribute the load on all the nodes of the cluster and store the generated files in /tmp. To do that, I first need to grant me the rights to modify the root process group content:

Click on the key symbol on the canvas and grant yourself all the access.

screen-shot-2016-11-29-at-9-26-59-pm

Then in order to simplify the next tasks (grant permissions to nodes for site-to-site communications), I recommend you to create a group called “nodes” and to add all the nodes user inside this group:

screen-shot-2016-11-29-at-9-42-19-pm

Then I create my workflow with a Remote Process Group and an Input Port to ensure the load balancing:

screen-shot-2016-11-29-at-9-43-48-pm

And I also specify that my GenerateFlowFile processor is running on primary node only:

screen-shot-2016-11-29-at-9-44-05-pm

Besides, I add specific rights to the Input Port to allow site-to-site communication between nodes. To do that, select the Input Port, then click on the key symbol in the Operate Panel and grant “receive site-to-site data” to the “nodes” group.

I can now start my workflow and confirm that data is stored in /tmp on each one of my nodes. I now have a new node participating in the execution of the workflow by the NiFi cluster!

Scaling down – decommission a node of the cluster

OK… now we want to check how a node can be decommissioned. There is a lot of reasons to be in this situation but the most common ones are:

  • upgrade the cluster in a rolling fashion (decommission a node, upgrade the node, move back the node in the cluster and move on to the next one)
  • add a new processor NAR to the cluster (decommission a node, add NAR in the node’s library, restart NiFi on this node, move back the node in the cluster and move on to the next one)

To do that, you just need to go in the cluster view, disconnect your node, stop your node, perform your modifications and restart the node to get it back in the cluster. Coordinator and primary roles will be automatically assigned to a new node if needed.

But let’s say we are in more a “destructive” situation (removing a node for good) and we want to ensure that the data currently processed by the node we are decommissioning is correctly processed before shutting down NiFi on this node.For this purpose, I am going to change the end point of my workflow with a PutHDFS instead of a PutFile and send the data to an external cluster. The objective is to check that all the generated data is correctly going into HDFS without any issue.Here is my workflow:screen-shot-2016-11-30-at-9-16-54-amAnd here is the current status of my cluster:screen-shot-2016-11-30-at-9-18-29-amMy GenerateFlowFile is running on the primary node only (node-1) and I’ll generate a file every 100ms. I’ll decommission the node-1 to also demonstrate the change of roles.

As a note: when you define a remote process group, you need to specify one of the nodes of the remote cluster you want to send data to. This specific node is only used when you are starting the remote process group to connect to the remote cluster. Once done, the RPG will be aware of all the nodes of the remote cluster and will be able to send data to all the nodes even though the specified remote node is disconnected. In a short future, it’ll be possible to specify multiple remote nodes in order to be more resilient (in case the RPG is stopped/restarted and the specified remote node is no more available).

To decommission a node, I only need to click on the “disconnect” button on the cluster view for the node I want to disconnect.Before disconnecting my node, I can check that my workflow is running correctly and I have flow files queued in my relationships:screen-shot-2016-11-30-at-9-31-46-amI can now disconnect my node-1 and check what is going on. Roles have been reassigned and the workflow is still running:screen-shot-2016-11-30-at-9-33-43-amscreen-shot-2016-11-30-at-9-33-55-amNote that, as I explained, the remote process group is still working as expected even though it was using node-1 as remote node (see note above).I can now access the UI of the node I just disconnected and check the current status of this node (https://node-1:8443/nifi). When I access the node, I’ve the following warning:screen-shot-2016-11-30-at-9-35-15-amAnd I can see that the workflow is still running on this node. This way, I can wait until all flow files currently on this node are processed:screen-shot-2016-11-30-at-9-35-27-amMy GenerateFlowFile is not running anymore since my node lost its role of Primary Node, and we can wait for the end of the processing of all flow files on this node. Once all flow files are processed, we have the guarantee that there will be no data loss and we can now shutdown the node if needed.If you want to reconnect it once you have performed your updates, you just need to go on the UI (from a node in the cluster), go in the cluster view and reconnect the node:screen-shot-2016-11-30-at-9-33-43-amThe node is now reconnected and used in the workflow processing:screen-shot-2016-11-30-at-9-43-24-amI can check that all the files I’ve generated have been transferred, as expected, on my HDFS cluster.screen-shot-2016-11-30-at-9-45-25-amThat’s all for this blog! As you can see, scaling up and down your NiFi cluster is really easy and it gives you the opportunity to ensure no service interruption.As always, comments and suggestions are welcomed.

 

Screen Shot 2016-08-13 at 8.18.54 PM

Apache NiFi 1.1.0 – Secured cluster setup

Apache NiFi 1.1.0 is now out, and I want to discuss a specific subject in a couple of posts: how to scale up and down a NiFi cluster without loosing data? Before going into this subject, I want to setup a 3-nodes secured cluster using the NiFi toolkit. It will be my starting point to scale up my cluster with an additional node, and then scale down my cluster.

There are already great posts describing how to setup a secured cluster using embedded ZK and taking advantage of the NiFi toolkit for the certificates, so I won’t go in too much details. For reference, here are some great articles I recommend:

OK… let’s move on. My initial setup is the following:

  • My OS X laptop
  • 2 CentOS 7 VM

All nodes have required prerequisites (network, java, etc), have the NiFi 1.1.0 binary archive uncompressed available, and have the NiFi toolkit 1.1.0 binary archive uncompressed available.

On my OS X laptop, I will use the NiFi TLS toolkit in server mode so that it acts as a Certificate Authority that can be used by clients to get Certificates.

Here is the description of how to use the TLS toolkit in server mode:

./bin/tls-toolkit.sh server

usage: org.apache.nifi.toolkit.tls.TlsToolkitMain [-a <arg>] [-c <arg>] [--configJsonIn <arg>] [-d <arg>] [-D <arg>] [-f <arg>] [-F] [-g] [-h] [-k <arg>] [-p
       <arg>] [-s <arg>] [-T <arg>] [-t <arg>]

Acts as a Certificate Authority that can be used by clients to get Certificates

 -a,--keyAlgorithm <arg>                   Algorithm to use for generated keys. (default: RSA)
 -c,--certificateAuthorityHostname <arg>   Hostname of NiFi Certificate Authority (default: localhost)
    --configJsonIn <arg>                   The place to read configuration info from (defaults to the value of configJson), implies useConfigJson if set.
                                           (default: configJson value)
 -d,--days <arg>                           Number of days issued certificate should be valid for. (default: 1095)
 -D,--dn <arg>                             The dn to use for the CA certificate (default: CN=YOUR_CA_HOSTNAME,OU=NIFI)
 -f,--configJson <arg>                     The place to write configuration info (default: config.json)
 -F,--useConfigJson                        Flag specifying that all configuration is read from configJson to facilitate automated use (otherwise configJson will
                                           only be written to.
 -g,--differentKeyAndKeystorePasswords     Use different generated password for the key and the keyStore.
 -h,--help                                 Print help and exit.
 -k,--keySize <arg>                        Number of bits for generated keys. (default: 2048)
 -p,--PORT <arg>                           The port for the Certificate Authority to listen on (default: 8443)
 -s,--signingAlgorithm <arg>               Algorithm to use for signing certificates. (default: SHA256WITHRSA)
 -T,--keyStoreType <arg>                   The type of keyStores to generate. (default: jks)
 -t,--token <arg>                          The token to use to prevent MITM (required and must be same as one used by clients)

In my case, I run the TLS toolkit on my node-3, and I run the following command:

./bin/tls-toolkit.sh server -c node-3 -t myTokenToUseToPreventMITM -p 9999

On each of my node, I created a nifi directory in /etc and I ran a command using the toolkit to get my signed certificates generated into my current directory:

.../bin/tls-toolkit.sh client -c node-3 -t myTokenToUseToPreventMITM -p 9999

And I have the following generated files on each of my nodes:

  • config.json
  • keystore.jks
  • nifi-cert.pem
  • truststore.jks

I now configure ZooKeeper as described here. Here is a short list of the tasks (in bold what slightly changed in comparison with my previous post):

  • configure conf/zookeeper.properties to list the ZK nodes
  • configure ZK state ID file
  • set nifi.state.management.embedded.zookeeper.start
  • set nifi.zookeeper.connect.string
  • set nifi.cluster.protocol.is.secure=true
  • set nifi.cluster.is.node=true
  • set nifi.cluster.node.address=node-<1-3>
  • set nifi.cluster.node.protocol.port=9998
  • set nifi.remote.input.host=node-<1-3>
  • set nifi.remote.input.secure=true
  • set nifi.remote.input.socket.port=9997
  • set nifi.web.https.host=node-<1-3>
  • set nifi.web.https.port=8443

(ports are changed to ensure there is no conflict with the TLS toolkit running the CA server)

I now configure the following properties with what has been generated by the toolkit:

nifi.security.keystore=
nifi.security.keystoreType=
nifi.security.keystorePasswd=
nifi.security.keyPasswd=
nifi.security.truststore=
nifi.security.truststoreType=
nifi.security.truststorePasswd=
nifi.security.needClientAuth=

For one of my node, it gives:

nifi.security.keystore=/etc/nifi/keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=4nVzULZ+CBiPCePvFSCw4LDAvzNoumAqu+TcyeDQ1ac
nifi.security.keyPasswd=4nVzULZ+CBiPCePvFSCw4LDAvzNoumAqu+TcyeDQ1ac
nifi.security.truststore=/etc/nifi/truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=0fFJ+pd4qkso0jC0jh7w7tLPPRSINYI6of+KnRBRVSw
nifi.security.needClientAuth=true

Then I generate a certificate for myself as a client to be able to authenticate against NiFi UI:

.../bin/tls-toolkit.sh client -c node-3 -t myTokenToUseToPreventMITM -p 9999 -D "CN=pvillard,OU=NIFI" -T PKCS12

Don’t forget the -T option to get your client certificate in a format that is easy to import in your browser (PKCS12). This command also generates a nifi-cert.pem file that corresponds to the CA certificate, you will need to import it in your browser as well (and you might need to manually update the trust level on this certificate to ensure you have access to the UI).

At this point I’m able to fill the authorizers.xml file. I need to specify myself as initial admin identity (to access the UI with full administration rights), and specify each nodes of my cluster (using the DN provided with the generated certificates). It gives:

<authorizers>
    <authorizer>
        <identifier>file-provider</identifier>
        <class>org.apache.nifi.authorization.FileAuthorizer</class>
        <property name="Authorizations File">./conf/authorizations.xml</property>
        <property name="Users File">./conf/users.xml</property>
        <property name="Initial Admin Identity">CN=pvillard, OU=NIFI</property>
        <property name="Legacy Authorized Users File"></property>

        <property name="Node Identity 1">CN=node-1, OU=NIFI</property>
        <property name="Node Identity 2">CN=node-2, OU=NIFI</property>
        <property name="Node Identity 3">CN=node-3, OU=NIFI</property>
    </authorizer>
</authorizers>

WARNING – Please be careful when updating this file because identities are case-sensitive and blank-sensitive. For example, even though I specified

-D "CN=pvillard,OU=NIFI"

when executing the command to generate the certificates, it introduced a white space after the comma. The correct string to use in the configuration file is given in the output of the TLS toolkit when executing the command.

Once I’ve updated this file on each node, I’m now ready to start each node of the cluster.

./bin/nifi.sh start && tail -f ./logs/nifi-app.log

Once all nodes are correctly started, I am now able to access the NiFi UI using any of the nodes in the cluster:

NiFi UI / 3-nodes secured cluster
NiFi UI / 3-nodes secured cluster

That’s all! Next post will use the current environment as a starting point to demonstrate how to scale up/down a NiFi cluster.

Screen Shot 2016-08-13 at 8.18.54 PM

Apache NiFi 1.0.0 – Cluster setup

As you may know a version 1.0.0-BETA of Apache NiFi has been released few days ago. The upcoming 1.0.0 release will be a great moment for the community as it it will mark a lot of work over the last few months with many new features being added.

The objective of the Beta release is to give people a chance to try this new version and to give a feedback before the official major release which will come shortly. If you want to preview this new version with a completely new look, you can download the binaries here, unzip it, and run it (‘./bin/nifi.sh start‘ or ‘./bin/run-nifi.bat‘ for Windows), then you just have to access http://localhost:8080/nifi/.

The objective of this post is to briefly explain how to setup an unsecured NiFi cluster with this new version (a post for setting up a secured cluster will come shortly with explanations on how to use a new tool that will be shipped with NiFi to ease the installation of a secured cluster).

One really important change with this new version is the new paradigm around cluster installation. From the NiFi documentation, we can read:

Starting with the NiFi 1.0 release, NiFi employs a Zero-Master Clustering paradigm. Each of the nodes in a NiFi cluster performs the same tasks on the data but each operates on a different set of data. Apache ZooKeeper elects one of the nodes as the Cluster Coordinator, and failover is handled automatically by ZooKeeper. All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. As a DataFlow manager, you can interact with the NiFi cluster through the UI of any node in the cluster. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points to the cluster.

zero-master-cluster

OK, let’s start with the installation. As you may know it is greatly recommended to use an odd number of ZooKeeper instances with at least 3 nodes (to maintain a majority also called quorum). NiFi comes with an embedded instance of ZooKeeper, but you are free to use an existing cluster of ZooKeeper instances if you want. In this article, we will use the embedded ZooKeeper option.

I will use my computer as the first instance. I also launched two virtual machines (with a minimal Centos 7). All my 3 instances are able to communicate to each other on requested ports. On each machine, I configure my /etc/hosts file with:

192.168.1.17 node-3
192.168.56.101 node-2
192.168.56.102 node-1

I deploy the binaries file on my three instances and unzip it. I now have a NiFi directory on each one of my nodes.

The first thing is to configure the list of the ZK (ZooKeeper) instances in the configuration file ‘./conf/zookeep.properties‘. Since our three NiFi instances will run the embedded ZK instance, I just have to complete the file with the following properties:

server.1=node-1:2888:3888
server.2=node-2:2888:3888
server.3=node-3:2888:3888

Then, everything happens in the ‘./conf/nifi.properties‘. First, I specify that NiFi must run an embedded ZK instance, with the following property:

nifi.state.management.embedded.zookeeper.start=true

I also specify the ZK connect string:

nifi.zookeeper.connect.string=node-1:2181,node-2:2181,node-3:2181

As you can notice, the ./conf/zookeeper.properties file has a property named dataDir. By default, this value is set to ./state/zookeeper. If more than one NiFi node is running an embedded ZK, it is important to tell the server which one it is.

To do that, you need to create a file name myid and placing it in ZK’s data directory. The content of this file should be the index of the server as previously specify by the server. property.

On node-1, I’ll do:

mkdir ./state
mkdir ./state/zookeeper
echo 1 > ./state/zookeeper/myid

The same operation needs to be done on each node (don’t forget to change the ID).

If you don’t do this, you may see the following kind of exceptions in the logs:

Caused by: java.lang.IllegalArgumentException: ./state/zookeeper/myid file is missing

Then we go to clustering properties. For this article, we are setting up an unsecured cluster, so we must keep:

nifi.cluster.protocol.is.secure=false

Then, we have the following properties:

nifi.cluster.is.node=true
nifi.cluster.node.address=node-1
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

I set the FQDN of the node I am configuring, and I choose the arbitrary 9999 port for the communication with the elected cluster coordinator. I apply the same configuration on my other nodes:

nifi.cluster.is.node=true
nifi.cluster.node.address=node-2
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

and

nifi.cluster.is.node=true
nifi.cluster.node.address=node-3
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

We have configured the exchanges between the nodes and the cluster coordinator, now let’s move to the exchanges between the nodes (to balance the data of the flows). We have the following properties:

nifi.remote.input.host=node-1
nifi.remote.input.secure=false
nifi.remote.input.socket.port=9998
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec

Again, I set the FQDN of the node I am configuring and I choose the arbitrary 9998 port for the Site-to-Site (S2S) exchanges between the nodes of my cluster. The same applies for all the nodes (just change the host property with the correct FQDN).

It is also important to set the FQDN for the web server property, otherwise we may get strange behaviors with all nodes identified as ‘localhost’ in the UI. Consequently, for each node, set the following property with the correct FQDN:

nifi.web.http.host=node-1

And that’s all! Easy, isn’t it?

OK, let’s start our nodes and let’s tail the logs to see what’s going on there!

./bin/nifi.sh start && tail -f ./logs/nifi-app.log

If you look at the logs, you should see that one of the node gets elected as the cluster coordinator and then you should see heartbeats created by the three nodes and sent to the cluster coordinator every 5 seconds.

You can connect to the UI using the node you want (you can have multiple users connected to different nodes, modifications will be applied on each node). Let’s go to:

http://node-2:8080/nifi

Here is what it looks like:

Screen Shot 2016-08-13 at 7.33.08 PM

As you can see in the top-left corner, there are 3 nodes in our cluster. Besides, if we go in the menu (button in the top-right corner) and select the cluster page, we have details on our three nodes:

Screen Shot 2016-08-13 at 7.35.28 PM

We see that my node-2 has been elected as cluster coordinator, and that my node-3 is my primary node. This distinction is important because some processors must run on a unique node (for data consistency) and in this case we will want it to run “On primary node” (example below).

We can display details on a specific node (“information” icon on the left):

Screen Shot 2016-08-13 at 7.35.48 PM

OK, let’s add a processor like GetTwitter. Since the flow will run on all nodes (with balanced data between the nodes), this processor must run on a unique processor if we don’t want to duplicate data. Then, in the scheduling strategy, we will choose the strategy “On primary node”. This way, we don’t duplicate data, and if the primary node changes (because my node dies or gets disconnected), we won’t loose data, the workflow will still be executed.

Screen Shot 2016-08-13 at 7.45.19 PM

Then I can connect my processor to a PutFile processor to save the tweets in JSON by setting a local directory (/tmp/twitter):

Screen Shot 2016-08-13 at 7.52.25 PM

If I run this flow, all my JSON tweets will be stored on the primary node, the data won’t be balanced. To balance the data, I need to use a RPG (Remote Process Group), the RPG will connect to the coordinator to evaluate the load of each node and balance the data over the nodes. It gives us the following flow:

Screen Shot 2016-08-13 at 8.00.26 PM

I have added an input port called “RPG”, then I have added a Remote Process Group that I connected to ” http://node-2:8080/nifi ” and I enabled transmission so that the Remote Process Group was aware of the existing input ports on my cluster. Then in the Remote Process Group configuration, I enabled the RPG input port. I then connected my GetTwitter to the Remote Process Group and selected the RPG input port. Finally, I connected my RPG input port to my PutFile processor.

When running the flow, I now have balanced data all over my nodes (I can check in the local directory ‘/tmp/twitter‘ on each node).

That’s all for this post. I hope you enjoyed it and that it will be helpful for you if setting up a NiFi cluster. All comments/remarks are very welcomed and I kindly encourage you to download Apache NiFi, to try it and to give a feedback to the community if you have any.