In this post I want to share with you how we (me and my customer) succeeded to stream events directly into CDH 5.8.3 Hive Orc-tables using the PutHiveStreaming processor in Apache Nifi 1.0.0. This is one of our goals in a larger project covering ingestion streaming data, company wide and selecting the toolset to work with.
As we started that project we were convinced to stream data into a NoSQL database like HBase, as we could minimize/eliminate transformations of the incoming event-data (Json). This should be ok for future processing with Spark (Streaming) and performant for data-access over JDBC.
But, for analytical purposes we were asked to also load data into a more SQL-friendly / relational environment.
As we already had our Apache Nifi flow in place to ingest into HBase, it seemed logical to branch of our stream of Json-objects into a PutHiveStreaming processor.
But, for PutHiveStreaming, we need AVRO input data …
Didn’t we fix that already?
As we asked ourselves before in which format we should store data in HBase, we already experimented with AVRO conversion. Nifi has a cool processor to facilitate such functionality: InferAvroSchema. But the output was a little disappointing as it was cluttered with documentation entries… The AVRO was no longer generic for all our events, but specific for each. This isn’t cool as I planned to use these inferred schema’s to deal with datadrift and schemachanges later on …
With an executeScript processor (Groovy) we cleaned the AVRO schema in de flowfile attribute inferred.avro.schema.
With that schema we converted our json into AVRO using ConvertJSONToAvro…
Putting it together…
When we put the pieces together on Nifi’s canvas, errors where raised.
An “required client_protocol is unset” error initiated internetsearch let us believe CDH 5.8.3. was incompatible with Nifi… and moved us away from Nifi for some tests.
As we are also evaluating StreamSets, our best option was trying to stream our objects into Hive with StreamSets. Our pipeline below just works …
After StreamSets we wanted to build and test this pipeline in Flume too. The Hive Streaming Sink in Flume consumes flat Json input or separated text only. For that we provided an flattened Json over Kafka as source. Out Flume pipeline also works.
Aha, we did oversee flattening our input for PutHiveStreaming. We put some flattening magic (execute script in place on our Nifi canvas and things started to work. Yihaa! Now we’ve got an working dataflow to Hive ORC with Nifi; below our canvas.
In the upper-left-corner you see our Streaming Data Source (ConsumeKakfa), in the upper-right-corner you see the PutHBaseCell processor, and in the lower-right-corner you see the PutHiveStreaming processor.
Lessons learned / ToDo
- Better start with AVRO
- Than, use ConvertAVROSchema to flatten structure and map/convert attributes -> No need to infer a (AVRO) schema and clean it afterwards.
- Request features for PutHiveStreaming processor to support expression language on parameters like schema and table name like PutHBaseCell does.