Fwd: Nifi partition data by date

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Nifi partition data by date

Santiago Ciciliani
I'm trying to split a stream of data into multiple different files based on the content date.

So imagine that you are receiving streams of logs and you want to save as a Hive partitioned table so for example all records with date 2016-01-01 into directory dt=2016-01-01.

Is this even possible?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Nifi partition data by date

James Wing
This is absolutely possible.  A sample sequence of processors might include:

1. UpdateAttribute - to extract a record date from the flowfile content into an attribute, 'recordgroup' for example
2. MergeContent - to group related records together, setting the Correlation Attribute Name property to use 'recordgroup'
3. UpdateAttribute - (optional) to apply the 'recordgroup' attribute to the 'path' and/or 'filename' attributes, depending on how you do #4.  May be useful to get customized filenames with extensions.
4. Put* - to write the grouped file to storage (PutFile, PutHDFS, PutS3Object, etc.).  With PutHDFS for example, use Expression Language in the Directory property to apply your grouping - like '/tmp/hive/records/${recordgroup}' to get '/tmp/hive/records/2016-01-01'.

In concept, it's that simple.  The #2 MergeContent step can be more complicated as you consider how many files should be output from the stream, how big they should be, how frequently, and how many bins are likely to be open collecting files at any one time.  You might also consider compressing the files.

Thanks,

James

On Wed, Nov 2, 2016 at 5:34 PM, Santiago Ciciliani <[hidden email]> wrote:
I'm trying to split a stream of data into multiple different files based on the content date.

So imagine that you are receiving streams of logs and you want to save as a Hive partitioned table so for example all records with date 2016-01-01 into directory dt=2016-01-01.

Is this even possible?

Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Nifi partition data by date

Joe Witt
I agree with James.  The general pattern here is

Split with Grouping:
  Take a look at RouteText.  This allows you to efficiently split up
line oriented data into groups based on matching values rather than
spilt text which will be a line for line split.

Merge Grouped Data:
  MergeContent processor will do the trick and you can use correlation
feature to align only those which are from the same group/pattern.

Write to destination:
  You can write directly to HDFS using PutHDFS or you can prepare the
data and write to Hive.

Thanks
Joe

On Wed, Nov 2, 2016 at 9:01 PM, James Wing <[hidden email]> wrote:

> This is absolutely possible.  A sample sequence of processors might include:
>
> 1. UpdateAttribute - to extract a record date from the flowfile content into
> an attribute, 'recordgroup' for example
> 2. MergeContent - to group related records together, setting the Correlation
> Attribute Name property to use 'recordgroup'
> 3. UpdateAttribute - (optional) to apply the 'recordgroup' attribute to the
> 'path' and/or 'filename' attributes, depending on how you do #4.  May be
> useful to get customized filenames with extensions.
> 4. Put* - to write the grouped file to storage (PutFile, PutHDFS,
> PutS3Object, etc.).  With PutHDFS for example, use Expression Language in
> the Directory property to apply your grouping - like
> '/tmp/hive/records/${recordgroup}' to get '/tmp/hive/records/2016-01-01'.
>
> In concept, it's that simple.  The #2 MergeContent step can be more
> complicated as you consider how many files should be output from the stream,
> how big they should be, how frequently, and how many bins are likely to be
> open collecting files at any one time.  You might also consider compressing
> the files.
>
> Thanks,
>
> James
>
> On Wed, Nov 2, 2016 at 5:34 PM, Santiago Ciciliani
> <[hidden email]> wrote:
>>
>> I'm trying to split a stream of data into multiple different files based
>> on the content date.
>>
>> So imagine that you are receiving streams of logs and you want to save as
>> a Hive partitioned table so for example all records with date 2016-01-01
>> into directory dt=2016-01-01.
>>
>> Is this even possible?
>>
>> Thanks
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nifi partition data by date

Santiago Ciciliani

Hello, thanks everyone for the prompt response.

With your aid I was able to figure it out.

Mostly my problem was to understand the difference between the Grouping Regular Expression and extracting the date parameter which in my case are pretty much the same expression.

Also I have to admit that the RouteText.Group attribute was not something easy to find even in the documentation.

I feel that reading a TCP connection with logs and store it partitioned directly to a Hive table should be a fairly common use case, so I'm attaching the template as a grain of sand contribution.

Thanks again



2016-11-02 22:10 GMT-03:00 Joe Witt <[hidden email]>:
I agree with James.  The general pattern here is

Split with Grouping:
  Take a look at RouteText.  This allows you to efficiently split up
line oriented data into groups based on matching values rather than
spilt text which will be a line for line split.

Merge Grouped Data:
  MergeContent processor will do the trick and you can use correlation
feature to align only those which are from the same group/pattern.

Write to destination:
  You can write directly to HDFS using PutHDFS or you can prepare the
data and write to Hive.

Thanks
Joe

On Wed, Nov 2, 2016 at 9:01 PM, James Wing <[hidden email]> wrote:
> This is absolutely possible.  A sample sequence of processors might include:
>
> 1. UpdateAttribute - to extract a record date from the flowfile content into
> an attribute, 'recordgroup' for example
> 2. MergeContent - to group related records together, setting the Correlation
> Attribute Name property to use 'recordgroup'
> 3. UpdateAttribute - (optional) to apply the 'recordgroup' attribute to the
> 'path' and/or 'filename' attributes, depending on how you do #4.  May be
> useful to get customized filenames with extensions.
> 4. Put* - to write the grouped file to storage (PutFile, PutHDFS,
> PutS3Object, etc.).  With PutHDFS for example, use Expression Language in
> the Directory property to apply your grouping - like
> '/tmp/hive/records/${recordgroup}' to get '/tmp/hive/records/2016-01-01'.
>
> In concept, it's that simple.  The #2 MergeContent step can be more
> complicated as you consider how many files should be output from the stream,
> how big they should be, how frequently, and how many bins are likely to be
> open collecting files at any one time.  You might also consider compressing
> the files.
>
> Thanks,
>
> James
>
> On Wed, Nov 2, 2016 at 5:34 PM, Santiago Ciciliani
> <[hidden email]> wrote:
>>
>> I'm trying to split a stream of data into multiple different files based
>> on the content date.
>>
>> So imagine that you are receiving streams of logs and you want to save as
>> a Hive partitioned table so for example all records with date 2016-01-01
>> into directory dt=2016-01-01.
>>
>> Is this even possible?
>>
>> Thanks
>>
>


RecordTextToPartition.xml (24K) Download Attachment