Note – look at the new features in NiFi 1.7+ about XML processing in this post
I recently had to work on a NiFi workflow to process millions of XML documents per day. One of the step being the conversion of the XML data into JSON. It raises the question of the performances and I will briefly expose my observations in this post.
The two most natural approaches to convert XML data with Apache NiFi are:
- Use the TransformXML processor with a XSLT file
- Use a scripted processor or use a custom Java processor relying on a library
There are few XSLT available on the internet providing a generic way to transform any XML into a JSON document. That’s really convenient and easy to use. However, depending of your use case, you might need specific features.
In my case, I’m processing a lot of XML files based on the same input schema (XSD) and I want the output to be compliant to the same Avro schema (in order to use the record-oriented processors in NiFi). The main issue is to force the generation of an array when you only have one single element in your input.
XSLT approach
Example #1:
<MyDocument> <MyList> <MyElement> <Text>Some text...</Text> <RecordID>1</RecordID> </MyElement> <MyElement> <Text>Some text...</Text> <RecordID>1</RecordID> </MyElement> </MyList> </MyDocument>
This XML document will be converted into the following JSON:
{ "MyDocument" : { "MyList" : { "MyElement" : [ { "Text" : "Some text...", "RecordID" : 1 }, { "Text" : "Some text...", "RecordID" : 2 } ] } } }
Example #2:
However, if you have the following XML document:
<MyDocument> <MyList> <MyElement> <Text>Some text...</Text> <RecordID>1</RecordID> </MyElement> </MyList> </MyDocument>
The document will be converted into:
{ "MyDocument" : { "MyList" : { "MyElement" : { "Text" : "Some text...", "RecordID" : 1 } } } }
Force array
And here start the problems… because we don’t have the same Avro schema. That is why I recommend using the XSLT file provided by Bram Stein here on Github. It provides a way to force the creation of an array. To do that, you need to insert a tag into your XML input file. The tag to insert is
json:force-array="true"
But for this tag to be correctly interpreted, you also need to specify the corresponding namespace:
xmlns:json="http://json.org/"
In the end, using ReplaceText processors with regular expressions, you need to have the following input (for the example #2):
<MyDocument xmlns:json="http://json.org/"> <MyList> <MyElement json:force-array="true"> <Text>Some text...</Text> <RecordID>1</RecordID> </MyElement> </MyList> </MyDocument>
And this will give you:
{ "MyDocument" : { "MyList" : { "MyElement" : [ { "Text" : "Some text...", "RecordID" : 1 } ] } } }
And now I do have the same schema describing my JSON documents. Conclusion: you need to use regular expressions to add a namespace in the first tag of your document and add the JSON array tag in every tag wrapping elements that should be part of an array.
Java approach
Now, let’s assume you’re not afraid about using scripted processors or developing your own custom processor. Then it’s really easy to have a processor doing the same using a Java library like org.json (note that library is *NOT* Apache friendly in terms of licensing and that’s why the following code cannot be released with Apache NiFi). Here is an example of custom processor doing the conversion. And here is a Groovy version for the ExecuteScript processor.
What about arrays with this solution? Guess what… It’s kind of similar: you have to use a ReplaceText processor before and after to ensure that arrays are arrays in the JSON output for any number of elements in your input. Also, you might have to do some other transformations like removing the namespaces or replacing empty strings
""
by
null
values (by default, everything will be converted to an empty string although you might want null record instead).
To force arrays, the easiest approach is to double every tag that should be converted into an array. With the example #2, I transform my input to have:
<MyDocument>
<MyList>
<MyElement /><MyElement>
<Text>Some text...</Text>
<RecordID>1</RecordID>
</MyElement>
</MyList>
</MyDocument>
It’ll give me the following JSON:
{
"MyDocument" : {
"MyList" : {
"MyElement" : [ "", {
"Text" : "Some text...",
"RecordID" : 1
} ]
}
}
}
And, then, I can use another ReplaceText processor to remove the unwanted empty strings created by the conversion.
Conclusion: with the two approaches you’ll need to be a bit intrusive in your data to get the expected results. What about the performances now?
Benchmark
I remove the ReplaceText processors from the equation as I usually need the same amount of regular expressions work in both cases. I want to only focus on:
- the TransformXML processor using the XSLT file provided above
- the custom Java processor I provided above
- the Groovy version that can be used with the ExecuteScript processor
I’ll compare the performances of each case using input of different sizes (data generated using a GenerateFlowFile processor) with default configuration (one thread, no change on run duration, etc) on my laptop.
Method: I’m generating as much data as possible (it’s always the same file during a single run) using the GenerateFlowFile processor. I wait at least 5 minutes to have a constant rate of processing and I get the mean on a 5 minutes window of constant processing.
For each run, I’m only running the GenerateFlowFile, one of the three processors I’m benchmarking, and the UpdateAttribute (used to only drop the data).
The input data used for the benchmark is a fairly complex XML document with arrays of arrays, lot of elements in the arrays, deeply nested records, etc. To reduce the size of the input size, I’m not changing the structure but only removing elements in the arrays. In other words: the schema describing the output data remains the same for each run.
Note that the custom Java/Groovy option is loading the full XML document in memory. To process very large XML document, a streaming approach with another library would certainly be better suited.
Here are the results with input data of 5KB, 10KB, 100KB, 500KB and 1000KB. The below graph gives the number of XML files processed per second based on the input size for each solution.
It’s clear that the custom Java processor is the most efficient one. The XSLT option is really nice when you want to do very specific transformations but it can quickly get slow. Using a generic XSLT file for XML to JSON transformation is easy and convenient but won’t be the most efficient option.
We can also notice that the Groovy option is a little bit less efficient than the Java one, but that’s expected. Nevertheless, the Groovy option provides pretty good performances and does not require building and compiling a custom processor: everything can be done directly from the NiFi UI.
To improve the performances, it’s then possible to play with the “run duration” parameter and increase the number of concurrent tasks. Actually it’s quite easy to reach the I/O limitations of the disks. Using a NiFi cluster and multiple disks for the content repository, it’s really easy to process hundreds of millions of XML documents per day.
If we display the performance ratio based on the file size between the XSLT solution and the Java based solution, we have:
We can see that with very small files, the processing using Java-based processor is about 13x more efficient than the XSLT approach. But with files over 100KB, the Java solution is about 26x more efficient. That’s because the NiFi framework is doing few things before and after a flow file has been processed. When processing thousands of flow files per second it creates a small overhead that explains the difference.
XML Record Reader
Since few versions, Apache NiFi contains record-oriented processors. It provides very powerful means to process record-oriented data. In particular, it allows users to process batches of data instead of a “per-file” processing. This provides a very robust and high rate processing. While I’m writing this post there is no reader for XML data yet. However there is a JIRA for it and it would provide few interesting features:
- By using a schema describing the XML data, it’d remove the need to use ReplaceText processors to handle the “array problem”.
- It’d give the possibility to merge XML documents together to process much more data at once providing even better performances.
This effort can be tracked under NIFI-4366.
As usual, feel free to post any comment/question/feedback.
https://gist.github.com/pvillard31/408c6ba3a9b53880c751a35cffa9ccea.js
[…] XML data processing with Apache NiFi by Pierre Villard […]
LikeLike
Hi, We are also trying to convert XML data into Avro using XSDs. We get multiple XML message types based on different XSD definitions. Is there a way we can transform XML data using XSDs saved in the schema registry without writing XSLT for each message type? TransformXML processor needs custom XSLTs and EvaluateXPath needs the manual definition of attributes. Why can’t we apply XSDs to transform XML data in NiFI?
LikeLike
Hi Srini. With the upcoming version of NiFi 1.7.0 (should be released soon), there will be an XML reader & writer allowing you to use the *Record processors with XML data assuming you can provide the Avro schema corresponding to your data. That will be much more easier and efficient (you can already use it if you build the master branch). There is no way, at the moment, to use XSDs but that would be a nice improvement. Feel free to file a JIRA on the NiFi project (https://issues.apache.org/jira/projects/NIFI).
LikeLike
Hi, I am working on transforming XML and load the values into Database table. How can i do that? Any idea/suggestion/articles will be greatly appreciated. Thank you.
LikeLike
Hi, easiest way is to use NiFi 1.7.0 (to be released tomorrow) that will contain a XML reader/writer allowing you to use the Record processors. In particular, you’ll be able to use PutDatabaseRecord processor in combination with the XML reader to read the data and send the values into a database. Obviously, if you have complex structures, you might need to use additional record processors to transform your data first.
LikeLiked by 1 person
hi, @ pvillard31, Thanks for your reply. It means nifi 1.7.0 is coming with lots of processors which will make complex tasks easy. Really eagerly waiting for the new release. π thank you.
LikeLike
Thanks, Pierre for the update. It is critical for our project and eagerly waiting to try it. It is a much-needed thing for real-time processing with XML event message data.
Best Regards,
Srini Alavala
LikeLike
Just to be clear – there is no issue at all for processing XML data at the moment using the approaches described in this post. I’ve successfully implemented workflows processing millions of XML files per day and it’s working completely fine in production environments. My comment regarding NiFi 1.7.0 and the XML reader/writer is just that things will be much more easier and workflows will require less processors to achieve the same goal. Nevertheless, everything can already be done with versions below 1.7.0.
LikeLike
[…] Starting with NiFi 1.7.0 and thanks to the work done by Johannes Peter on NIFI-4185 and NIFI-5113, it’s now possible to use an XML reader and writer in the Record processors to help you processing XML data. Before that, you had few options requiring a bit of additional work to get things working (see here). […]
LikeLike
Hi pvillard,
I am using NiFi 1.9.2 at Docker environment and can not find XML Reader/Writer from the processor list. But from your NiFi 1.7x, they should be available. Do you know the reason? Are not they available now?
LikeLike
XML Reader/Writer are not processor but controller services that can referenced in processors such as ConvertRecord. Hope this helps.
LikeLike
Hi
I decided to create a generic solution for converting XML files into table based on an XSD file.
My code has limitation and do not handle all XSD styles but if you use generic styling all works well.
The concept is simple, I process the XSD to identify all branches that requires its own table, if it does not its elements and attributes will be part of the parent table.
You can read my article series and access the code on GitHub
http://max.bback.se/index.php/2018/06/30/xml-to-tables-csv-with-nifi-and-groovy-part-2-of-2/
https://github.com/maxbback/nifi-xml
/Max
LikeLike
Hi Max,
Nice job, thanks for sharing!
LikeLiked by 1 person
Hello,
My NiFi is Version 1.7.0.3.2.0.0-520.
I have some problem with TransformXML processor. I have sample XML like this:
one
one
two
three
When i used ListFile processor to search new file in my directory and send from this processor xml file to TransformXML processor i have error:
I used library and settings from https://github.com/bramstein/xsltjson
2018-11-02 12:17:54,958 ERROR [Timer-Driven Process Thread-20] o.a.n.processors.standard.TransformXml TransformXml[id=9ae1b1ab-0166-1000-ffff-ffffabc6df1a] Unable to transform StandardFlowFileRecord[uuid=a60d9978-068f-4025-8ffd-97aea1bf9165,claim=,offset=0,name=n1.xml,size=0] due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from TransformXml[id=9ae1b1ab-0166-1000-ffff-ffffabc6df1a]: java.io.IOException: net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.: org.apache.nifi.processor.exception.ProcessException: IOException thrown from TransformXml[id=9ae1b1ab-0166-1000-ffff-ffffabc6df1a]: java.io.IOException: net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
org.apache.nifi.processor.exception.ProcessException: IOException thrown from TransformXml[id=9ae1b1ab-0166-1000-ffff-ffffabc6df1a]: java.io.IOException: net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
at org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2906)
at org.apache.nifi.processors.standard.TransformXml.onTrigger(TransformXml.java:236)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
at org.apache.nifi.processors.standard.TransformXml$2.process(TransformXml.java:263)
at org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2885)
… 12 common frames omitted
Caused by: net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:460)
at net.sf.saxon.event.Sender.send(Sender.java:171)
at net.sf.saxon.Controller.transform(Controller.java:1692)
at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:547)
at net.sf.saxon.jaxp.TransformerImpl.transform(TransformerImpl.java:179)
at org.apache.nifi.processors.standard.TransformXml$2.process(TransformXml.java:261)
… 13 common frames omitted
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1014)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)
… 18 common frames omitted
My TransformXML processor had set:
XSLT file name /home/nifi/xsltjson/conf/xml-to-json.xsl
What is wrong ?
I have tried any other processor to XML like Validate and also have same problem.
Caused by: net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
LikeLike
Hi, I’d double check the XML and XSLT data, sounds like badly formatted data.
LikeLike
If I try run shell script, everything is ok.
[nifi@4gt-nifi-m1 test]$ /usr/jdk64/jdk1.8.0_112/bin/java -jar ../xsltjson/lib/saxon/saxon9.jar n1.xml ../xsltjson/conf/xml-to-json.xsl
{“root”:{“set”:[{“record”:”one”},{“record”:[“one”,”two”,”three”]}]}}
LikeLike
can you share the files you’re using on a gist? (http://gist.github.com/) – I’ll try on my end
LikeLike
Ok. Finally i resolved my problem. I used FetchFile processor between ListFile and TransformXML processors. And now on exit from TransformXML i got JSON format π
I think that FetchFile processor deliver content file to TransformXML while ListFile only file which is not correct read by XSLT
Thanks for reply. Now I will continue design my project.
LikeLike
You’re correct, you can have a look at https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/ for more explanations about the list/fetch pattern. You could also consider GetFile according to your needs.
LikeLike