NiFi 1.8+ – Revolutionizing the List/Fetch pattern and more…

If you read my post about List/Fetch pattern and if you’re using this approach for some of your workflows, this new feature coming with NiFi 1.8.0 is going to be a revolution.

A quick recap about the context: in NiFi, unless you specifically do something to make the nodes of a NiFi cluster exchange data, a flow file will remain on the same node from the beginning to the end of the workflow. For some use cases, it is necessary to load-balance the flow files among the nodes of the cluster to take advantage of all the nodes (like the List/Fetch pattern).

The only way to load balance data in a NiFi cluster before NiFi 1.8 is to use the Site-to-Site (S2S) protocol in NiFi with a Remote Process Group (RPG) connected to the cluster itself with an input/output port. In addition to that the S2S forces you to have the input/output port defined at the root level of the canvas.  In a multi-tenant environment this can be a bit annoying and can make the workflows a little bit more complex.

What’s the revolution with NiFi 1.8+? For intra-cluster load balancing of the flow files, you now can do it directly by configuring it on the relationship between two processors. It could sound like a minor thing, but that’s HUGE! In addition to that, you have few options to configure the load-balancing which opens up new possibilities for new use cases!

The List/Fetch pattern is described in my previous post: in short… it’s the action to use a first processor (ListX) only running on the primary node to list the available data on X and generate one flow file per object to retrieve on X (the flow file does not have any content but contains the metadata to be used to fetch the object), then flow files are distributed among the NiFi nodes and then the FetchX processor running on all nodes will take care of actually retrieving the data. This way you ensure there is no concurrent access to the same object and you distribute the work in your cluster.

List/Fetch pattern before NiFi 1.8.0

If we have a project A retrieving data from a FTP server using the List/Fetch pattern to push the data into HDFS, it’d look like this:

  • Root Process Group level

Screen Shot 2018-10-18 at 10.03.23 PM.png

  • Inside the Process Group dedicated to Project A

Screen Shot 2018-10-18 at 10.04.01 PM.png

The ListFTP is running on the primary node and sends the data to the RPG which load balances the flow files among the nodes. Flow files are pushed to the input port at the root level and the data can then be moved back down to the process group of the project. Then the FetchFTP actually retrieves the data and the data is sent to HDFS.

List/Fetch pattern with NiFi 1.8+

Now… it looks like this:

  • Root Process Group level

Screen Shot 2018-10-18 at 10.09.29 PM

  • Inside the Process Group dedicated to Project A

Screen Shot 2018-10-18 at 10.10.30 PM

It’s crazy, no? You don’t have anything outside of the process group anymore, the workflow is cleaner/simpler, and authorizations are much easier to manage.

Where is the trick? Did you notice the small icon on the relationship between the ListFTP and the FetchFTP? It looks small but it’s HUGE :).

Load balancing strategies in NiFi 1.8+

Let’s have a look at the configuration of a connection:

Screen Shot 2018-10-18 at 10.18.24 PM.png

There is a new parameter available: the load balance strategy. By default it defaults to “do not load balance” and, unless you need to, you won’t change that parameter (you don’t want to move data between your nodes at each step of the workflow if there is no reason to do so).

Here are the available strategies:

Screen Shot 2018-10-18 at 10.20.24 PM

The Round robin strategy is the one you would probably use in a List/Fetch use case. It will ensure your data is evenly balanced between your nodes.

The Single node strategy allows you to send the data back to one single node. You can see it a bit like a reducer in a MapReduce job: you process the data on all the nodes and then you want to perform a step on a single node. One example could be: I have a zip file containing hundreds of files, I unzip the file on one node, load balance all the flow files (using Round Robin strategy for example) among the nodes, process the files on all the nodes and then send back the flow files to a single node to compress back the data into a single file. It could look like this:

Screen Shot 2018-10-18 at 10.31.08 PM.png

Then you have the Partition by attribute strategy allowing you to have all the flow files sharing the same value for an attribute to be sent on the same node. For example, let’s say you receive data on NiFi behind a load balancer, you might want to have all the data coming from a given group of sources on the same node.

I won’t go into much more details, but feel free to have a look at the documentation. Besides, I’m sure other members of the Apache NiFi community will publish blog posts on this subject… such as this excellent one!

Let’s just give it a try with each strategy… I’m using a GenerateFlowFile (GFF) connected to an UpdateAttribute and we will list the flow files queuing in the relationship to check where the flow files are located. Besides, the GenerateFlowFile is configured to set a random integer between 2 and 4 for the attribute ‘filename’.

Let’s start with a GFF running on the primary node only with no load balancing strategy:

Screen Shot 2018-10-18 at 10.44.06 PM.png

My primary node being my node2 (I have node2, node3 and node4 in my cluster):

Screen Shot 2018-10-18 at 10.45.05 PM.png

I can confirm all the flow files are on the primary node:

Screen Shot 2018-10-18 at 10.46.37 PM.png

Let’s change the Load Balance strategy to Round Robin. We can confirm the data is evenly distributed:

Screen Shot 2018-10-18 at 10.50.02 PM

Let’s change the strategy to One single node. We can confirm the data is now back to a single node (not necessarily the primary node):

Screen Shot 2018-10-18 at 10.53.22 PM.png

And now let’s try the partitioning by attribute using the ‘filename’ attribute. We can confirm that all the flow files sharing the same value in the Filename column are on the same node:

Screen Shot 2018-10-18 at 10.55.21 PM.png

Screen Shot 2018-10-18 at 10.56.12 PM.png

Again, I expect to see additional blog posts on this subject with more technical insights on how it actually works and what you should consider in case your cluster is scaling up and down.

Also… this new feature comes with another one which is as much exciting: the offloading of nodes in a cluster (look at the Admin guide in the documentation). In case you want to decommission a node in a cluster, this will take care of getting back the data on the other nodes before you actually remove the node for good. This will be particularly useful when deploying NiFi on Kubernetes and scaling up and down your cluster!

As always, feel free to comment and/or ask questions! Thanks for reading!

Monitoring Driven Development with NiFi 1.7+

I already discussed the Monitoring Driven Development (MDD) approach in this post and already covered some monitoring topics around NiFi in few posts (reading the posts would probably help you for reading this one – or at least refresh your memory on some concepts). In this post I want to present once again the MDD approach with some new features of NiFi 1.7 making things a bit easier and cleaner.

What do I mean when talking about MDD in NiFi? When I say MDD, I refer to the fact that, when a team is developing a “core workflow” for a use case, the team should also develop a “monitoring workflow” that will be in charge of monitoring and reporting the health of the “core workflow” based on their business requirements.

How does this work in NiFi? This approach completely relies on two things:

  • A proper naming convention for the components used in the workflows
  • The use of the Site-to-Site Reporting Tasks (see here)

Basically, the idea is to setup the S2S reporting tasks to send data to an input port dedicated to monitoring, then basic routing and filtering is performed assuming projects are following the naming convention so that all the data related to a given “core workflow” is automatically routed to the associated “monitoring workflow”. This way, it’s up to each project to implement the monitoring they want for their use case based on the requirements and SLA they have.

What’s new in NiFi that’s going to help me with MDD?

There are quite few interesting JIRAs but the two most important ones are:

  • [NIFI-4814] – Add distinctive attribute to S2S reporting tasks (NiFi 1.6.0)
  • [NIFI-5122] – Add record writer to S2S Reporting Tasks (NiFi 1.7.0)

Example

For my example, the naming convention is the following: any component that should be monitored should be named with a prefix ABC_ where ABC is the three letters abbreviation representing the use case/project.

I have two projects, one named PJA and one named PJB.

Let’s say the project A is using HandleHttpRequest/Response to receive XML data over HTTP and has two monitoring requirements:

  • Send an email to the team in case a bulletin is generated
  • Monitor in Grafana the amount of valid/invalid data being processed

Let’s say the project B is using ListenTCPRecord for a use case around logs ingestion and has one monitoring requirement:

  • Send an email to the team in case back pressure is enabled

The development of the workflows is outside the scope of this post. I’ll keep things simple as the objective is to demonstrate the MDD approach, not how to implement the workflows of the specific use cases I chose.

A *very basic* implementation of the “core workflow” of PJA could be:

Screen Shot 2018-08-29 at 3.55.19 PM.png

A *very basic* implementation of the “core workflow” of PJB could be:

Screen Shot 2018-08-29 at 4.36.29 PM

Now I’m going to configure the S2S Reporting Tasks – one for Controller Status data and one for Bulletins data. Note that there is a new feature allowing users to select a Record Writer for the generated data. I’m going to use the Avro one for the reporting tasks so that I don’t have to care about the schemas at all. For that I need to create a Controller Service (in Controller Settings menu since this controller service is exclusively used by reporting tasks).

Screen Shot 2018-08-29 at 5.02.26 PM.png

After enabling the CS, I can configure my reporting tasks.

The one for bulletins:

Screen Shot 2018-08-29 at 5.03.59 PM

And the one for controller status:

Screen Shot 2018-08-29 at 5.05.21 PM

Note that you have additional S2S Reporting Tasks if needed. The one about provenance events can be particularly useful for some monitoring requirements such as latency.

At the root level of my canvas, I’ve something looking like:

Screen Shot 2018-08-29 at 5.45.38 PM.png

I’m receiving all the monitoring data sent by the reporting tasks in the ‘monitoring’ input port. Then in my ‘Monitoring’ process group, I have something looking like:

Screen Shot 2018-08-29 at 5.46.48 PM.png

You can notice I’m sending all the data to a ‘Technical Monitoring’. That’s because, the team in charge of administrating/monitoring the platform probably wants to also have a look at the monitoring data to ensure the full system is performing OK (common examples: send the bulletins to an external tool like ES/Kibana, send email alerts in case back pressure is enabled on a connection, etc).

On the upper part of the workflow, that’s where I’m processing the monitoring data to split it and route the monitoring data by project. First I’m using a RouteOnAttribute to route the data based on the type (bulletins, controller status, etc):

Screen Shot 2018-08-29 at 5.50.48 PM

Then I’m using an UpdateRecord processor to add a ‘project’ field that I’m computing based on the component name.

For bulletins:

Screen Shot 2018-08-29 at 5.55.34 PM

For controller status:

Screen Shot 2018-08-29 at 5.52.43 PM.png

Since the data has been sent in Avro format by the reporting task, I’m using an Avro reader Controller Service with default settings. And, from now on, I want the data to be in JSON (as it’ll be easier in case I want to send the data via an email or via HTTP). I’m configuring few controller services: one Avro Schema Registry containing the schemas (that you can retrieve in the additional details of the reporting task documentation, see below), one JSON Reader and one JSON Writer.

For instance, to get the schema of the S2S Bulletin Reporting Task, go on the reporting task list:

Screen Shot 2018-08-29 at 5.59.40 PM.png

Click on the documentation icon on the left:

Screen Shot 2018-08-29 at 6.00.21 PM.png

And then click Additional Details:

Screen Shot 2018-08-29 at 6.00.58 PM.png

When adding the schemas in the Avro Schema Registry, use the type of the reporting task as the name of the schema and don’t forget to add a ‘project’ field in the schemas:

Screen Shot 2018-08-29 at 6.02.28 PM.png

Now you can easily reference the schemas to use in the configuration of the JSON Reader/Writer using the expression language because the flow files generated by the reporting tasks have an attribute with the type of the source reporting task.

Example of attributes on a flow file generated by a RT:

Screen Shot 2018-08-29 at 6.28.56 PM.png

JSON Reader:

Screen Shot 2018-08-29 at 6.03.44 PM.png

JSON Writer:

Screen Shot 2018-08-29 at 6.04.13 PM.png

Now I’ve a ‘project’ field in my data with the project name (if the naming convention has been correctly used), then you can use a QueryRecord processor to route the data for each project:

Screen Shot 2018-08-29 at 6.07.17 PM.png

And we can confirm that only the data for project A is routed to the ‘Monitoring Workflow’ for project A. If looking at the data in the relationship routing data for PJA:

Screen Shot 2018-08-29 at 6.08.53 PM.png

Now, when deploying a new project for the first time (ie importing the process group of the “core workflow”), you just need to update the QueryRecord processor to add a dynamic property for the new project as well as deploying the process group for the “monitoring workflow” of the project. In combination with the NiFi Registry it makes things really smooth when you need to update the workflows with a new version.

I’m not going to develop the monitoring workflows for PJA and PJB but it’d only be a matter of few record processors to filter the data (bulletin? back pressure enabled?), and then use a processor to send the data (ElasticSearch, PutEmail, InvokeHttp, etc):

  • “send an email when a bulletin is generated” – RouteOnAttribute to get the bulletins data and PutEmail to send an email
  • “monitor the number of valid/invalid XML files being processed” – filter the controller status statistics on the ‘Connection’ elements and keep only the statistics of the connections after the ValidateRecord processor, this can be easily done with a QueryRecord processor. The numbers can then be sent to an external system such as ElasticSearch for a dashboard in Kibana/Grafana
  • “send an email when back pressure is enabled” – use a QueryRecord processor on the controller status data by querying the field ‘isBackPressureEnabled’ and then use PutEmail processor.

The full workflow of this example is available as a gist here. Note that, since it’s a template, it does not contain the Reporting Tasks configuration. But you should be up and running quite quickly with the above screenshots.

As always, thanks for reading this post, feel free to comment and/or ask questions. In case you have suggestions to make things even better, just let me know and, maybe, I’ll have to publish a new version of this post to present the new features of a next release 😉

Discussion around Ranger policies for HDFS

OK… so… Apache Ranger is the piece you want to use to define authorizations in your cluster (and, most importantly, get all the audits coming with all the policies you define). Not only for HDFS but for all the components you’re using. In this post I just want to discuss about policies to secure how HDFS is used.

More specifically, here is my scenario:

  • I have a project called ‘myProject’ consisting of a Spark job accessing and processing sensitive data sitting in HDFS and the result of this processing is written back into HDFS as well. This data is so sensitive that users should not be able to access it unless specific permissions are granted. The job is running with the user ‘myProjectUser’.

Now… let’s say the team working on the project does not have permissions to access the production data and has mock datasets for development and staging environments. OK great, that’s usual situation. BUT… we have a rogue user, John Doe, in the development team and, somehow, he is able to introduce some lines of code so that the Spark job is writing the result in two places: the expected location where results should be written, and another unexpected place where John Doe has access on the production cluster. How can you prevent that?

Well… the immediate answer is a proper development process with pull requests and peer reviewing on any commit pushed to the code repository you’re using. But… let’s say that’s not enough, or, that two rogue users agreed to commit that code through a pull request. How can you set the Ranger policies to prevent this specific scenario?

Note – for this post, when I’m saying a folder has hdfs:hadoop:755, it means that chmod 755 and chown hdfs:hadoop have been set on that folder.

Let’s talk quickly about how Ranger works and, more specifically, how it works with HDFS. By default, Ranger allows you to define “allow” policies. Meaning that you grant specific access (read, write, execute) to a specific resource (file, folder) in HDFS. But something you need to recall is that, by default, HDFS ACLs will apply in combination with Ranger policies (both are used to check whether a user can access a resource). It means that, if you have:

  • a folder /myFolder with hdfs:hadoop:755 in HDFS
  • an “allow” policy saying that people belonging to group “sales” have RWX access to /myFolder

In reality, this “allow” policy does not prevent any access from people not belonging to the “sales” group. Since chmod is 755 anyone having access to HDFS can read the data inside that folder.

However, if you set chmod 000 to that folder, then, only the “allow” policy will apply and only people belonging to the “sales” group will have access to the folder.

Note – that’s not entirely true since there is one exception: the HDFS superusers you defined. Any HDFS superuser can access any folder. When sending a request to the Name Node (NN), the NN will check the identity of the requester and if it’s a superuser, it’ll not go through the authorization parts of the code. If you want to prevent HDFS superusers to access some data, then you probably want to look at Ranger KMS that will help ensuring segregation of duties by encrypting the data in HDFS using encrypting keys only available to specific users.

Note – it’s possible to prevent Ranger fallbacking on HDFS ACLs by setting
xasecure.add-hadoop-authorization=false
but it means you have to set Ranger policies for everything. That’s doable but can require some efforts to do it properly.

Now that you’re aware of that behavior, you probably want to look at this post with some best practices about Ranger and HDFS. The first thing I’d recommend is to change the default umask from 022 to 077 to ensure that, by default, a newly created file/directory can only be read by the owner unless a Ranger policy states otherwise.

Then, you want to take care of some specific folders in HDFS such as the ones under /apps (folders used by Hive, HBase, Zeppelin, etc). The idea is to set chmod 000 on the folders so that HDFS ACLs do not apply and only Ranger policies are enforced. Let’s say you have /apps/hive/warehouse where all your Hive databases are stored. If you set chmod 000 on /apps/hive, then no one can go inside a subfolder of /apps/hive unless there is an “allow” policy in Ranger for that user.

OK, that’s great… but let’s go back to our scenario where we have a rogue user… Our John Doe can SSH to an edge node on the production cluster where he has his home directory on HDFS (/user/jdoe with jdoe:jdoe:700). We could say that we are safe: the job running as myProjectUser does not have permissions to write in /user/jdoe. But there is nothing preventing jdoe to change the permissions on his home directory with a chmod 777 (however, only an HDFS superuser can execute a chown command to change ownership on a directory). Once John Doe changed the permissions, the job can write into that folder and then John can do whatever he wants with the data… Even though you’re defining “allow” policies in Ranger.

Let’s have a look at the folders potentially exposing this issue. We have:

  • /user
  • /tmp
  • /app-logs
  • /mr-history
  • /spark-history
  • /spark2-history

The above folders (except /user) have, by default, a chmod 777 technically allowing anyone to create a folder there and use that folder for an unexpected purpose. Let’s have a look and let’s try to ensure no one can create something we don’t want.

Example – /spark-history & /spark2-history (part 1)

That’s where the Spark History Server(s) are going to store data about ongoing Spark jobs, and completed jobs based on retention policies you defined. If we look at the structure:

/spark-history is spark:hadoop:777

and then we have folders such as:

/spark-history/<folder> with <user running the job>:hadoop:770

Folders are complying with a naming convention but we don’t really care here.

The objective is to get rid of the chmod 777 at the top level BUT… the issue here is that a user running a job needs the right to create a directory. So, basically, anyone needs the right to create a directory in /spark-history. I can’t just set chmod 000 and define “allow” policies… And that’s where Ranger introduced a very nice feature: the “deny” policies.

The “deny” policies in Ranger

If you want all the details, have a look here and here. The main idea is to add the option, when defining a policy, to:

  • exclude groups or users from the allow policy
  • deny access to users/groups on the resource
  • exclude groups or users from the deny policy

Then, here is how the policy is evaluated in the authorizer:

Ranger-Policy-Evaluation-Flow.png

Basically, if there is no “deny” policy, the most permissive access between Ranger policies and HDFS ACLs will be used to grant access to the resource. In other words, unless you specifically defined a “deny” policy that applies for the accessed resource, the HDFS ACLs will always be considered for authorization.

Enabling the “deny” policies

By default, the “deny” policies are not available in the Ranger UI. It’s mainly because the concept can be hard to understand and things can quickly become a mess when using the “deny” policies. Also, for most of the users, this feature will never be used. One needs to be careful and rigorous when using that feature.

To enable it, retrieve the HDFS service definition from the Ranger Admin server:

$ curl -k -u <user> https://<ranger_admin_server>:6182/service/public/v2/api/servicedef/1 > hdfs.json

Then, have a look at the content:

$ cat hdfs.json
{...,"isEnabled":true,"name":"hdfs","options":{"enableDenyAndExceptionsInPolicies":"false"}...}

In the “options” field, you can add (or modify it if it’s already there):

"options":{"enableDenyAndExceptionsInPolicies":"true"}

Then you just have to update the service definition:

$ curl -k -u <user> -X PUT -H "Accept: application/json" -H "Content-Type: application/json" -d @hdfs.json https://<ranger_admin_server>:6182/service/public/v2/api/servicedef/1

Variables in Ranger

In addition to the “deny” policies (introduced in Ranger 0.6), we’re going to use the variables (introduced in Ranger 0.7). We can use:

  • {OWNER} – owner of the resource
  • {USER} – user accessing the resource

It’s a very convenient way to define policies without specifically specifying users.

Example – /spark-history & /spark2-history (part 2)

We now have the tools to secure our /spark-history server by creating the following rules:

  • on /spark-history
    • Allow policy for group “public”, RWX permissions, non-recursive. This rule allows us to set chmod 000 in HDFS to keep things clean.

policy_spark_folder.png

  • on /spark-history/*
    • Allow policy for group “hadoop” and user “{OWNER}”, RWX permissions.
    • Deny policy for group “public”, RWX permissions. For any resource in /spark-history, no one has access.
    • Exclude deny policy for group “hadoop” and user “{OWNER}”, RWX permissions. For any resource in /spark-history, only users of the “hadoop” group (such as spark), and owner of the resources have RWX permissions.

policy_spark_content.png

Note – “public” group is a convention meaning anyone.
Note – same approach applies for /spark2-history

With the policies we defined, anyone can create a folder, but only the owner of that folder can access the data inside, or write data inside. We have now secured our /spark-history and /spark2-history folders.

Example – /mr-history

Let’s have a look at the folders structure for /mr-history

/mr-history with mapred:hadoop:777

/mr-history/done with mapred:hadoop:777
/mr-history/tmp with mapred:hadoop:777

/mr-history/tmp/<user> with <user>:hadoop:770

/mr-history/done/<year> with mapred:hadoop:770
/mr-history/done/<year>/<month> with mapred:hadoop:770
/mr-history/done/<year>/<month>/<day> with mapred:hadoop:770
/mr-history/done/<year>/<month>/<day>/<folderID> with mapred:hadoop:770

And in that last folder, we only have files (usually an XML file and a JHIST file for each job) with <user running the job>:hadoop:770.

When a job is launched by myUser, a folder (if not already existing) named myUser is created in /mr-history/tmp and this folder will be used to store metadata files about the running job. Once the job is completed, the user mapred will move the file from that folder into the corresponding /mr-history/done subfolder.

In conclusion, we can set chmod 000 on /mr-history and define the following rules:

  • on /mr-history/done
    • grant RWX permissions to group “hadoop” and user “activity_analyzer” (if you’re using SmartSense to provide statistics on your cluster)
  • on /mr-history/tmp, non-recursive
    • grant RWX permissions to group “public” so that anyone can create a folder
  • on /mr-history/tmp/*
    • grant RWX permissions to group “hadoop”, users “activity_analyzer” and “{OWNER}”
    • deny RWX permissions to group “public”
    • exclude deny RWX permissions to group “hadoop”, users “activity_analyzer” and “{OWNER}”

Note – it’s quite similar to the /spark-history approach.

Example – /user

For that specific folder, we know that no one except HDFS superusers should be able to create a folder in /user, and every folder in /user should be something like /user/<user> with permissions <user>:hdfs:700 (just like usual home directories).

Then, we just have to set chmod 000 on the /user folder and add an “allow” policy:

  • on /user/{USER}
    • allow policy granting RWX permissions to user “{USER}”

But we also have a particular situation to manage: the share lib folder of Oozie, by default, is located at /user/oozie/share/lib. We need to allow RX access to that folder to anyone launching Oozie jobs:

  • on /user/oozie/share/lib
    • allow policy granting RX access to group “public”
    • allow policy granting RWX access to group “hadoop”

Since /user is set with chmod 000 we are sure no one can go inside that folder unless an allow policy is created.

Example – /app-logs

Structure of /app-logs is very similar to what we’ve seen so far:

/app-logs with yarn:hadoop:777
/app-logs/<user> with <user>:hadoop:770

We’re adding the following rules:

  • on /app-logs, non-recursive
    • grant RWX permissions to group “public” so that anyone can create a folder (when a user is launching a job for the first time on the cluster, the user needs permission to create the folder)
  • on /app-logs/*
    • grant RWX permissions to group “hadoop” and user “{OWNER}”
    • deny RWX permissions to group “public”
    • exclude deny RWX permissions to group “hadoop” and user “{OWNER}”

We can now set chmod 000 on /app-logs to secure that folder.

Example – /tmp

The /tmp folder is similar to what we did so far, but with an exception: there is a /tmp/hive folder used by Hive to store temporary data when Hive queries are executed. Because of HIVE-18287, this folder needs to have chmod 733 or above (at HDFS ACLs level).

In /tmp/hive, we have folders like /tmp/hive/<user> with <user>:hdfs:700.

To summarize, we need to:

  • Allow users creating folders in /tmp
  • Keep chmod 733 on /tmp/hive
  • Allow users creating folders in /tmp/hive/
  • Only allow owners to modify data in /tmp/*

As you can see, it’s going to be difficult to manage because of the embedded folder used by Hive. The best approach is to change the Hive’s configuration to change the location of the scratch directory. What I suggest is:

hive.exec.scratchdir=/apps/hive/tmp

This way you can easily define policies on that particular folder by following the same approach we did so far on the other folders. Once this is done, it’s also easy to have the same approach to secure the /tmp folder:

  • on /apps/hive/tmp, non-recursive
    • grant RWX permissions to group “public” so that anyone can create a folder (when a user is launching a job for the first time on the cluster, the user needs permission to create the folder)
  • on /apps/hive/tmp/*
    • grant RWX permissions to group “hadoop” and user “{OWNER}”
    • deny RWX permissions to group “public”
    • exclude deny RWX permissions to group “hadoop” and user “{OWNER}”
  • on /tmp, non-recursive
    • grant RWX permissions to group “public” so that anyone can create a folder
  • on /tmp/*
    • grant RWX permissions to group “hadoop” and user “{OWNER}”
    • deny RWX permissions to group “public”
    • exclude deny RWX permissions to group “hadoop” and user “{OWNER}”

Note – if you are using Spark/Spark2, don’t forget that it relies on Hive’s configuration files. In case you’re using an Ambari managed Hortonworks cluster, you would have to add custom properties in “Custom spark-hive-site-override” for both Spark and Spark2 services.

Note – if you are using Oozie and have workflows using Spark actions, then you’d have to leverage the action configuration feature in Oozie. You can have a look here. In the case of the Hortonworks distribution, by default, you have

oozie.service.HadoopAccessorService.action.configurations=*=action-conf

And you have the following files/folders:

/etc/oozie/conf/action-conf/hive.xml
/etc/oozie/conf/action-conf/hive/hive-site.xml
/etc/oozie/conf/action-conf/hive/tez-site.xml
/etc/oozie/conf/action-conf/hive/atlas-application.properties

The hive-site.xml is automatically copied from /etc/hive/conf folder when Oozie server is restarted. But you need to do a change to have this data loaded for Spark actions. You can create spark and spark2 folders and create a symbolic link to the hive-site.xml file.

Conclusion

Objective of this post was to give a quick overview of how Ranger is working for HDFS and what you need to consider if you want to secure your cluster. As you saw in this post, if you also want to prevent rogue users accessing data by changing the behavior of an application, you need to set few rules in order to secure the folders having a default chmod 777 at the HDFS ACLs level.

This post is certainly not exhaustive but should give you an idea of what you can do. As usual, feel free to comment / ask questions, and thanks for reading thus far.

Hue/Oozie causing CPU overload

Quick post about an issue I faced today on one of the clusters: I received an alert about abnormal high CPU use on one of the master nodes. A quick htop gave me the culprit: the Oozie server hosted on this node.

I looked at the logs and didn’t see anything unusual in the oozie.log file. But by looking at the oozie-audit.log file, I noticed a very large number of requests being issued by Hue and proxifying users:

# sed 's/.* DoAs user \[\(.*\)\] Request .*/\1/g' oozie-audit.log | sort | uniq -c | sort -nr | head
279616               jdoe
27902                zoaks
16018                mparisien
14025                gkass
12211                lzastrow
9730                 sleaf
7460                 sladwig
6048                 vespinoza
5815                 lkonen
2862                 lrayburn

It appeared my John Doe was issuing more than 5 requests per second to the Oozie server using Hue causing the high CPU consumption.

When being on the Oozie dashboards in Hue with your browser, there is an auto-refresh feature issuing requests to the Oozie server every 5 seconds to get the latest statuses. Problem is if a user is opening multiple tabs in the browser, it can lead to a lot of requests. Now… if the user forgets to close the browser and remains connected, you have a nice DDoS-like situation.

By looking at the Hue documentation, I thought I found a solution with the below:

[desktop]
[[auth]]
# Users will automatically be logged out after ‘n’ seconds of inactivity.
# A negative number means that idle sessions will not be timed out.
idle_session_timeout=-1

I tried setting this value to 600 seconds (10 minutes) to get inactive people automatically logged out. It works fine when you’re staying on a static page in Hue but not if you’re staying on the Oozie pages… the auto-refresh is keeping you “active” even though you’re not.

The only option I found is to use the ttl (time-to-live) parameter to define when the cookie will expire and force the user to authenticate again. The issue with this parameter is that it’ll log out the user even though the user is active and actually using Hue.

To avoid any unpleasant user experience, you can set this parameter to something like 28800 (8 hours):

[desktop]
[[auth]]
ttl=28800

It does not solve the original issue because you’ll keep your Oozie server receiving a lot of requests for 8 hours but, at least, you limit how long this situation can last.

The best solution, assuming you have installed multiple Oozie instances for high availability behind a load balancer, is to configure the LB to extract the user name from the requested URL (&doAs=<user>) and to throttle the number of requests issued by a single user. That will provide the best protection without impacting the user experience. Look at your LB’s documentation to configure such a solution.

Using the SmartSense Activity Explorer for cluster reporting

In the Hortonworks Data Platform, there is SmartSense, a service that analyzes cluster diagnostic data, identifies potential issues, and recommends specific solutions and actions.

SmartSense is made of multiple components and one of the component is the Activity Explorer which is a customized Zeppelin notebook used to access and display the data collected by the Activity Analyzer instances stored in an HBase instance and accessed using Phoenix.

The Activity Explorer gives access to a lot of very useful data when administrating a cluster. For an exhaustive list, have a look to the documentation here.

By default, this Activity Explorer / Zeppelin is configured with the Phoenix interpreter only. The idea of this post is to describe how we can add the JDBC interpreter (or any other interpreter) to allow administration teams using this specific Zeppelin instance as a more general tool for cluster reporting.

One might wonder why I’m not using the Zeppelin service available in the HDP stack. The reason is quite simple: usually, the Zeppelin instances would be deployed on the edge nodes (to be used by the project teams / users of the cluster) while the Activity Explorer would be deployed on an administration node and only accessed by the administrators of the cluster. The idea is to keep Zeppelin instances separated based on the purpose.

First step is to package the JDBC interpreter. Go on the node where you installed the Zeppelin service (where all the interpreters are installed) – not the node where you installed the Activity Explorer component.

cd /usr/hdp/current/zeppelin-server/interpreter/
zip -r jdbc.zip jdbc/

And deploy this ZIP file on the node where is installed the Activity Explorer:

cd /usr/hdp/share/hst/activity-explorer/interpreter/
unzip jdbc.zip

Restart the Activity Explorer component so that the interpreter is available for configuration.

Go to the interpreter configuration page and add a new one, selecting the JDBC type. Configure the interpreter as needed based on your cluster (you can check the configuration you set for this interpreter in the Zeppelin service). In particular, you’ll need:

zeppelin.jdbc.auth.type=KERBEROS
zeppelin.jdbc.principal=<principal of the activity explorer>
zeppelin.jdbc.keytab.location=<keytab of the activity explorer>
hive.proxy.user.property=hive.server2.proxy.user

Note: do not use _HOST in the principal name, use the host FQDN instead.

I also strongly recommend you to configure SSL on the Activity Explorer as well as configuring proper authentication/authorization mechanisms. You can do all that through Ambari as you’d do for the Zeppelin service (have a look at the documentation here).

Since the Activity Explorer account is going to proxy your requests to Hive through the JDBC interpreter, you need to add the proper proxy rules:

hadoop.proxyuser.activity_explorer.groups=<administrator group>
hadoop.proxyuser.activity_explorer.hosts=<activity explorer host>

And you’ll have to restart the appropriate services.

If you stop here and restart the Activity Explorer component, you’ll loose your JDBC interpreter configuration because all of the interpreter configuration of the Activity Explorer is managed by Ambari and reset at each component restart. To prevent the loss of your configuration, you need to copy the content of the file:

/etc/smartsense-activity/conf/interpreter.json

(content of this file has been updated by the Activity Explorer after you added the JDBC interpreter)

And paste this content in Ambari / SmartSense / Advanced / Advanced activity-zeppelin-interpreter. This way, your configuration will remain the same.

Note: keep in mind that all this procedure might have to be done again after a SmartSense upgrade since it’s not the default deployment.

You’re now all set! If you’re wondering what can be done with the JDBC interpreter to enhance the cluster administration tasks… the first thing I can recommend is to create Hive tables on top of the Ranger audits stored in HDFS so that you can create long term reports based on all the cluster audits (if you’re using Solr for the Ranger audits, this data is only stored for a short period of time, default is 90 days). Creating Hive tables on top of the data sitting in HDFS can really be useful if you have compliance/security teams looking for audits reporting.

You could also use the JDBC interpreter to directly access the data in the database backend used for some services like Ambari, Ranger, Hive, etc. It can provide interesting data to build useful reports.

As always, thanks for reading, and feel free to ask questions / leave a comment.

NiFi 1.7+ – Terminate threads

One of the new features coming with NiFi 1.7.0 is the possibility to terminate stuck threads directly from the UI. Before this release, when you had a processor getting stuck (like a custom processor with a deadlock) you had no option but to restart NiFi… and that’s not really great in a multi-tenant setup.

Let’s see this new feature with an example: I’m using a GenerateFlowFile and, using the debug mode of my IDE, I’m going to simulate an issue with the thread by adding a break point in the onTrigger() method of the processor.

Screen Shot 2018-07-02 at 2.56.57 PM.png

When the processor is running, we can see the number of threads currently used by the processor (top right). If the number of tasks is not increasing (and equal to 0 after 5 minutes) and the number of threads is constant, it probably means you have a stuck thread. In this case, it is always recommended to do a thread dump of NiFi to see what is going on. To do that:

./bin/nifi.sh dump /tmp/thread-dump.txt

If we look at the content of the generated file and looking for my GenerateFlowFile processor, we can see something like:

"Timer-Driven Process Thread-5" Id=57 RUNNABLE (suspended)
 at org.apache.nifi.processors.standard.GenerateFlowFile.onTrigger(GenerateFlowFile.java:210)
 at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
 at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165)
 at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203)
 at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Number of Locked Synchronizers: 1
 - java.util.concurrent.ThreadPoolExecutor$Worker@2388307

In this case, the thread is shown as “suspended” at the line where I put my break point in my IDE.

Note that having a stuck thread is usually a symptom of a badly designed processor or an underlying issue with the code dependencies of the processor. Having the thread dump can help locating and fixing the issue.

If trying to stop the processor:

Screen Shot 2018-07-02 at 3.02.23 PM.png

I’ll now see two threads being used by the processor:

Screen Shot 2018-07-02 at 3.03.09 PM

The processor is now in the process of being stopped but you won’t be able to update or restart the processor until it is actually stopped. However, if the initial thread is stuck, you’ll remain in this state forever (and, eventually, you’ll have to restart NiFi if you’re running a version below 1.7.0).

With NiFi 1.7.0, you now have the possibility to terminate the thread (meaning you don’t have to wait for the thread to be completely stopped):

Screen Shot 2018-07-02 at 3.06.58 PM

If I terminate the thread, here is what I’ll see:

Screen Shot 2018-07-02 at 3.08.07 PM

Meaning there is no more active thread and there is one thread being terminated.

Under the hood, the framework is issuing an “interrupt” for the thread and will perform a “reload” of the processor meaning there is a new instance of the processor class being created. This allows the old class to eventually shut down gracefully, close connections, etc. However this needs to be used with care: if the class is maintaining some values in internal variables, this information will be lost in the process (I’m not talking about state information that can be saved at framework level by the processors).

An interrupt is an indication to a thread that it should stop what it is doing and do something else. It’s up to the programmer to decide exactly how a thread responds to an interrupt, but it is very common for the thread to terminate. However if, for some reasons, the thread does not respond to interrupt, then you will keep the thread as being terminated forever (in my above example, I can see the thread staying in the state of being terminated because of my break point).

You can now update the configuration and start again the processor (even if a thread is being terminated). Once the thread is terminated, you’ll get back to a nominal situation:

Screen Shot 2018-07-02 at 4.54.51 PM.png

If you don’t want to setup a debug environment, you can simulate a stuck thread with the following ExecuteScript processor:

Screen Shot 2018-07-02 at 4.56.49 PM.png

If starting/stopping the processor:

Screen Shot 2018-07-02 at 6.02.37 PM.png

And I’ll get the following message when terminating the thread:

Screen Shot 2018-07-02 at 6.02.58 PM.png

You also can use the DebugFlow processor with “@OnStopped Pause Time” set to something like “5 min” and “Ignore Interrupts When Paused” set to “true”:

Screen Shot 2018-07-02 at 6.49.29 PM.png

That’s it for this post! This new feature is really nice in a multi-tenant environment where you can’t afford a restart of the service. If you’re in a situation where you need to use this feature, remember to take a thread dump before actually terminating the thread. This will be really useful to investigate the issue. Also, remember that terminating a thread is not a “normal” operation, it means there is something wrong somewhere and it could very likely happen again.

Thanks for reading this post and, as always, feel free to comment / ask questions!

NiFi 1.7+ – XML Reader/Writer and ForkRecord processor

Starting with NiFi 1.7.0 and thanks to the work done by Johannes Peter on NIFI-4185 and NIFI-5113, it’s now possible to use an XML reader and writer in the Record processors to help you processing XML data. Before that, you had few options requiring a bit of additional work to get things working (see here).

I won’t go into the details because the reader/writer are really well documented (have a look at the additional details for examples):

Also, with NIFI-4227, a ForkRecord processor is now available to “explode” your data when you have one or multiple embedded arrays in your data set and you need to normalize your data. Have a look at the ForkRecord documentation (additional details) for examples.

Example – Introduction

Let’s go through an example using the XML Reader and the ForkRecord processor. I assume I’m receiving XML data with the following schema:

Screen Shot 2018-06-27 at 10.39.20 AM.png

And here is a dummy file I’m receiving that I’ll use for this example:

Screen Shot 2018-06-27 at 2.15.28 PM.png

The corresponding Avro schema can be found here. Defining the Avro schema corresponding to your data is the most “difficult” part but once it’s done, everything else is really easy with the Record processors. You can get some help using the InferAvroSchema processor but this should only be used to get an initial version of your Avro schema when you start developing your workflows (it should not be used as a processor used in your production workflows).

I’m going to quickly explain the following workflow (template available here):

Screen Shot 2018-06-27 at 2.24.43 PM.png

Example – XML to JSON conversion

The GenerateFlowFile is used to generate my XML data and to send the content to the Record processors I’m using. The upper part of the workflow is just a ConvertRecord processor to perform the XML to JSON conversion thanks to the schema. The other parts of the workflows are used to extract each “object” of my schema (customer, address, account and transaction).

Here are the configuration details for the XML to JSON conversion. The GenerateFlowFile is very simple:

Screen Shot 2018-06-28 at 2.25.23 PM

Note the attribute ‘schema.name’ set with the value of the schema I added in my Avro schema registry controller service:

Screen Shot 2018-06-28 at 2.26.24 PM.png

The XML Reader is configured to get the schema name from the ‘schema.name’ attribute of the processed flow files:

Screen Shot 2018-06-28 at 2.27.21 PM.png

The JSON writer does not contain any specific configuration, it’ll write the data using the schema inherited at the reader level:

Screen Shot 2018-06-28 at 2.28.22 PM

The ConvertRecord processor can now be configured for the XML to JSON conversion:

Screen Shot 2018-06-28 at 2.29.20 PM.png

That’s it! You now have a very easy and powerful way to perform your XML to JSON conversion. Here the flow file content before the processor:

Screen Shot 2018-06-27 at 2.15.28 PM

Here is the JSON generated by the processor:

Screen Shot 2018-06-28 at 2.31.23 PM

If you need to manipulate your records to go from one schema to another, you can use the UpdateRecord processor.

Note – using the XML Reader/Writer and Record processors, you can expect performances as good as the best options presented in my previous post about XML data processing.

Example – ForkRecord processor

The goal here is to extract the data contained in the arrays into separated CSV files. From one XML file, I want to generate 4 CSV files that I could use, for instance, to send data into database tables.

Let’s start with the first one regarding customer data. I’m adding a ‘customer’ Avro schema in the Avro schema registry:

Screen Shot 2018-06-28 at 2.42.58 PM.png

I’m using an UpdateAttribute, to set the name of the ‘output.schema’ that I want to be used by the CSV Record Writer. Then, I’m just using a ConvertRecord processor to process from XML to JSON and I only keep the root level fields (customer data). Here is the configuration of my CSV writer:

Screen Shot 2018-06-28 at 2.45.23 PM.png

Here is the output when processing my XML data:

Screen Shot 2018-06-28 at 2.46.32 PM.png

Now, I want to extract data contained in arrays and that’s where the ForkRecord is useful. To use it, I need to specify a custom property containing the Record path to the array that the processor should process, then you have two modes for the processor:

  • Split – that will preserve the input schema but will create one flow file per element contained in the array.
  • Extract – that will generate flow files using the element of the arrays with the possibility to include the parent fields up to the root level in case you want to keep some fields as keys/identifiers of your arrays elements. Note that you can define which parent fields you want to preserve by setting the appropriate schema with only the fields you want.

In our case we’re using the Extract mode and we are including the parent fields: for the ‘address’ data for instance, we want to keep the customer_id field in the output (that field would be used as a key between the customer data and the address data in the database world). Here is the schema I define in my Avro schema registry for the ‘address’ object:

Screen Shot 2018-06-28 at 2.52.03 PM

I’m using an UpdateAttribute to set the attributes I’m using in my reader and the ForkRecord processor:

Screen Shot 2018-06-28 at 2.55.04 PM

And I’m configuring the ForkRecord processor to use the ‘fork.path’ attribute defining the array of elements to be processed:

Screen Shot 2018-06-28 at 2.55.59 PM.png

Note – if you have multiple Record paths pointing to multiple arrays with elements using the same output schema, you can define as many custom properties as you want.

And here is the output after the ForkRecord processor:

Screen Shot 2018-06-28 at 2.56.40 PM

Here are the Record paths I’m setting for the other type of objects I want to extract:

  • Account – fork.path = /accounts/account
  • Transaction – fork.path = /accounts/account[*]/transactions/transaction
    In this case, I need to specify account[*] because I want to access all the transactions of all the elements in the account array. But I could use the Record path features to only access some specific accounts based on some predicates.

And here the output I get (after defining the appropriate schema in my schema registry):

  • Account data (I only keep the customer_id from the parent fields):

Screen Shot 2018-06-28 at 3.02.14 PM.png

  • Transaction data (I only keep the customer_id and account_id from the parent fields):

Screen Shot 2018-06-28 at 3.03.55 PM

Conclusion

The XML Reader & Writer are a really nice addition to NiFi if you need to process XML data and it will make your existing workflows much simpler! You really have to look at the Record processors and concepts as soon as you’ve a use case manipulating data complying to a schema.

The ForkRecord processor is a new processor that you can use if you need to manipulate schema oriented data containing a lot of arrays. That can be useful in case you need to normalize data before ingestion into databases.

Thanks for reading this post and, as always, feel free to comment and/or ask questions!