Last week, we had a jam-packed webinar on Hortonworks DataFlow, with over 700 registrants and so we were unable to get back to everyone to answer their questions. We’ve grouped the questions (and answers) below into the following categories, and if you have more questions, anytime, we encourage you to check out the Data Ingestion & Streaming track of Hortonworks Community Connection where an entire community of folks are monitoring and responding to questions.
For those who may have missed the session you can check out theon-demand webinar, slideshare and still sign up to attend the remaining webinars in the 7-part Data-In-Motion webinar series .
HDF Use Case Questions
Our plan with NiFi is to use it to ingest data from traditional data sources into our data lake – is that an appropriate use of the technology?
Yes. This is one of the primary use cases. NiFi is very good for “matching impedances” between disparate data sources and getting them into the format that different systems need, and then feeding that into various consumers such as a data lake, or streaming applications. Velocity and volume matters, and mapping data into the systems that need it, in the format the consumer needs it.
Is it possible to transfer data from HIVE to AWS redshift or Azure MSSQL Data warehouse?
There is not a single processor for direct connection to Redshift just yet, but there are a series of processors designed to work with a lot of AWS technologies that feed Redshift, such as Kinesis streams or S3.
Can you schedule the flow to auto run like one would with Coordinator?
By default, the processors are already continuously running. Unless you select to only run a processor on a hourly basis for example (also possible) it’s not a job oriented thing. Once you start a processor, it continuously runs. It is a source processor, and continuously Again, also scheduling options should you choose to use them but by default it is constantly running. It is also fine for processors to be a ‘always running’. The better way to think about it is that they’re always ‘eligible’ to execute.
Does NiFi Support Real Time processing or Streaming? e.g. JMS Queue as source?
Yes. There is a out of the box processor for pulling information out of JMS topics. A lot of the power of NiFi is it’s ability to tap into existing java ecosystem. There are a lot of existing protocols, technologies, languages, that allows you to leverage those and wrap it with the power of NiFI to allow you to use all the functions and features that the platform provides to provide extensions that meet your organization’s formats and needs. Extensibility is key as it’s impossible to know all the proprietary formats an organization may have, so NiFi provides a toolbox that works for common formats, but also extend to various needs you may have.
Can NiFi parse Apache log files?
Yes, there are a number of processors that handle various formats for data wrangling and data mapping. We’re happy to hear more ideas on needs to continually make this even easier.
Can we integrate the existing MR/SPARK process with HDF (data Ingestion via HDF)?
NiFi can certainly be used to drive data into and out of systems like HDFS, Spark and many others. In the case of systems like Spark, Storm, and several others Kafka or HDFS are likely great intermediaries and NiFi integrates with both very well. A common pattern for streaming apps is to use Kafka. For block-oriented apps consider integrating by exchange datasets via HDFS, Hive, HBase, etc
Can we use NiFi to process / push PCAP (wireshark) content to KAFKA ?
You can certainly use NiFi to capture, route, transform, and deliver PCAP data. There are PCAP libraries which can be leveraged by wrapping as a NiFi processor.
How do you decide between NiFi versus Flume and Sqoop?
NiFi supports all Flume use cases, and has a Flume processor out of the box. NiFi supports some similar capabilities of Sqoop – check out the GenerateTableFetch processor, that does incremental fetch and parallel fetch against source table partitions. Ultimately what you want to look at is whether you’re solving a specific or singular use case. If so then any one of the tools that works will work. NiFi’s benefits will really shine when you consider multiple use cases being handled at once and when critical flow management features like interactive and live command and control with full data provenance are applicable.
“How Does NiFI Work” Questions
Each processor metric with “5 min” next to it. Is that how often NiFi will refresh those metrics from the processor?
This is configurable. Out of the box it is UI driven, which is powered by the same REST API that is available to external systems and developers who want to interface with it. The default is 30 seconds, or manually right click on the canvas to refresh. There also some longer running metrics that show how the processors have been behaving have been running for the past day such as amount of xdata processed, or number of files process. In general what is being shown live on the UI is a rolling ‘last five minutes’ view so you understand recent behavior whereas the metrics are showing you historical five minute windows.
How is the Performance of NIFI Processors? Is there any comparison matrix of NIFI Performance with Other Processing tools?
Performance depends on what the flow is doing end-to-end. There are provisions in NiFi architecture to optimize physical data movement by only moving data pointers to the blobs of data stored in NiFi’s content repository. NiFi also records all of the metadata, history and content as it changes, so any comparison won’t be apples-to-apples. That said, NiFi’s performance is often said to be quite good particularly as flows can easily be setup that fully utilize the capabilities of the system in terms of network, CPU, disk, and memory.
How scalable are the internal NiFi repositories – what is the storage/persistence engine? Assume there is a mechanism to prune old or irrelevant audit data?
All these are repositories are pluggable mutations. Out of the box have some custom built repositories configured for key use cases. It is all on disk right now, and it’s able to talk to multiple volumes to help with throughput and IO.
There are 3 kinds of repositories – Content repository, Flowfile and Provenance repository, can all be spread across multiple volumes. They can also be configured through their properties, for how long you want to hold the data for, how much volume you are willing to allocate. Remember, NiFi is not a long-term storage solution but we do want to provide users with enough context to see how things are working, so it becomes a capital expenditure decision – how much to spend to to provide disks and storage.
The Apache NiFi community has a valuable and detailed document covering the internals here https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html.
Does NiFi get all the dependencies (associated processing libraries etc.) for all the 172 out-of-box processors?
Yes, NiFi comes with everything you need. If something is needed with the release that comes out, there is a structure, called a NAR format that provides all the dependencies needed, and provides class loader isolation. For example, if you have a set of processors that work against a certain set of libraries, you can use them in isolation from another set that may use another library. One such case where this comes in handy is when you look at XML that may have many different libraries and dependency versions. This sort of isolation allows you to extend and adapt as needed.
Does all processing happen in memory? In case of failure how does NIFI guarantee data recoverability if its in memory?
Recall the webinar covered copy on write and pass by reference. The key concept is the notion of passing objects through. When data is being mutated (ie encrypted or compressed), the initial data is never altered and instead creates a new content claim, and starts writing to it. Therefore, if anything happens, like the power goes out, or the processing fails, that processing thread is orphaned, and is cleaned up by internal framework. Meanwhile given the unit of work semantics of how NiFI transactionally processes things, NiFi doesn’t change the data.
NiFI allows supports guaranteed delivery ( depending on protocol, and the semantics associated with it) but supports at least once, or at most once. One of the key principles of how NiFi was designed was to be robust in the face of diversity, and repositories are designed to support this principle.
NIFI has many processors interacting with SQL databases, it boils down to your specific use cases and/or preferred processors, and up to the developer whether a particular processor would handle both blob and clob fields.
How is scheduling integrated?
NiFi internally schedules either on:
Data availability in the incoming connection
Timer for periodic tasks
Questions about how HDF relates to HDP
Is it available only from HDP 2.4 onwards?
No. HDF is available independently of HDP release. Downloads available here: http://hortonworks.com/downloads/#dataflow
Why are you not recommending Spark streaming instead of Storm?
Does NiFi require an HDP stack to run? Where does the HDF processes actually run on an HDP cluster? Or is it separate?
No, NiFI does not require and HDP stack to run, it is designed to run standalone. It does easily provide connectivity of HDP. Out of the box it’s one package you open up, execute one command to get up and running so minimal dependency.
Could I connect the HDF with HDP?
Can NiFi talk to Kerberized HBase in new version?
Yes, NiFi supports communicating with Kerberized HBase via the GetHBase, PutHBaseCell, and PutHBaseJSON processors.
Questions about the live demo shown during HDF webinar
For thatsmall demo where you tailed a log file and you put them to Kafka and then you put them to HDFS – suppose there is an additional step where a big Spark job or Hive job (that cannot run in real time) that’s supposed to run against that data in HDFS, do you recommend orchestrating that step still in NiFi? Or do you recommend delegating the scheduling of the Spark or Hive Job in Falcon/Oozie?
Can you use NIFI to schedule Spark jobs, yes absolutely? But it also depends on your expectations of the job schedulers. Please refer to the following HCC Q&A for more details: https://community.hortonworks.com/questions/60308/is-it-a-good-idea-generally-to-use-nifi-as-a-sched.html
I had couple of questions on thedata-flow live demo:
At the 2nd processor, the lines were split. i might have missed the step on how the lines were split , is it based on some kind of delimiter?
SplitText processor simply splits the input file into individual lines based on line boundaries. On the other side, SplitContent processor allows you to split on arbitrary patterns that you provide to it, which would be a better fit for more complex use cases.
In place of moving the files to Kafka queue, can the processor directly place the files on hdfs file server?
Yes, you certainly can directly place the files in HDFS, skipping Kafka. Kafka becomes critical when you have very dynamic updates on the consumer side. With NIFI, you need to add/remove processors accordingly; with Kafka, you don’t need to do anything to your broker regarding consumer side updates.
In the past, advanced middleware used to have orchestration (flow of data between systems), ESB, messaging (Kafka), etc. capabilities. Now we are splitting each of the capabilities into their own offerings/solutions. Understand that these solutions are Open Source and built for high volume, variety, velocity of data. Any comments?
Each of the components has some unique capabilities, i am curious about an example of a single project that addresses them all? In general, by separating these components, you have the fine grained control of each one, scale them in and out in a very flexible manner.
Slide 17/18 – Would you then have separate NiFi clusters installed on the edge (I guess MiNiFi agent for small devices like IoT devices), the regional infrastructure and also on the core infrastructure?
Yes, correct – you would have separate NiFi clusters, or MiNiFi agents installed on the edge, and then one ore more NiFi clusters in your regional centers, and then another one in the core infrastructure.