stream one large file, only once

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

stream one large file, only once

Raf Huys
I would like to read in a large (several gigs) of logdata, and route every line to a (potentially different) Kafka topic.

- I don't want this file to be in memory
- I want it to be read once, not more

using `GetFile` takes the whole file in memory. Same with `FetchFile` as far as I can see.

I also used a `ExecuteProcess` processor in which the file is `cat` and which splits off a flowfile every millisecond. This looked to be a somewhat streaming approach to the problem, but this processor runs continuously (or cron based) and by consequence the logfile is re-injected all the time. 

What's the typical Nifi for this? Tx

Raf Huys
Reply | Threaded
Open this post in threaded view
|

Re: stream one large file, only once

Andrew Grande
Neither GetFile nor FetchFile read the file into memory, they only deal with the file handle and pass the contents via a handle to the content repository (NiFi streams data into and reads as a stream).

What you will face, however, is an issue with a SplitText when you try to split it in 1 transaction. This might fail based on the JVM heap allocated and file size. A recommended best practice in this case is to introduce a series of 2 SplitText processors. 1st pass would split into e.g. 10 000 row chunks, 2nd - into individual. Adjust for your expected file sizes and available memory.

HTH,
Andrew

On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[hidden email]> wrote:
I would like to read in a large (several gigs) of logdata, and route every line to a (potentially different) Kafka topic.

- I don't want this file to be in memory
- I want it to be read once, not more

using `GetFile` takes the whole file in memory. Same with `FetchFile` as far as I can see.

I also used a `ExecuteProcess` processor in which the file is `cat` and which splits off a flowfile every millisecond. This looked to be a somewhat streaming approach to the problem, but this processor runs continuously (or cron based) and by consequence the logfile is re-injected all the time. 

What's the typical Nifi for this? Tx

Raf Huys
Reply | Threaded
Open this post in threaded view
|

Re: stream one large file, only once

Joe Witt
The pattern you want for this is

1) GetFile or (ListFile + FetchFile)
2) RouteText
3) PublishKafka

As Andrew points out GetFile and FetchFile do *not* read the file
contents into memory.  The whole point of NiFi's design in general is
to take advantage of the content repository rather than forcing
components to hold things in memory.  While they can elect to hold
things in memory they don't have to and the repository allows reading
from and writing to streams all within a unit of work pattern
transactional model.  There is a lot more to say on that topic but you
can see a good bit about it in the docs.

RouteText is the way to avoid the SplitText memory scenario where
there are so many lines that even holding pointers/metadata about
those lines itself becomes problematic.  You can also do as Andrew
points out and split in chunks which also works well.  RouteText will
likely yield higher performance though overall if it works for your
case.

Thanks
Joe

On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <[hidden email]> wrote:

> Neither GetFile nor FetchFile read the file into memory, they only deal with
> the file handle and pass the contents via a handle to the content repository
> (NiFi streams data into and reads as a stream).
>
> What you will face, however, is an issue with a SplitText when you try to
> split it in 1 transaction. This might fail based on the JVM heap allocated
> and file size. A recommended best practice in this case is to introduce a
> series of 2 SplitText processors. 1st pass would split into e.g. 10 000 row
> chunks, 2nd - into individual. Adjust for your expected file sizes and
> available memory.
>
> HTH,
> Andrew
>
> On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[hidden email]> wrote:
>>
>> I would like to read in a large (several gigs) of logdata, and route every
>> line to a (potentially different) Kafka topic.
>>
>> - I don't want this file to be in memory
>> - I want it to be read once, not more
>>
>> using `GetFile` takes the whole file in memory. Same with `FetchFile` as
>> far as I can see.
>>
>> I also used a `ExecuteProcess` processor in which the file is `cat` and
>> which splits off a flowfile every millisecond. This looked to be a somewhat
>> streaming approach to the problem, but this processor runs continuously (or
>> cron based) and by consequence the logfile is re-injected all the time.
>>
>> What's the typical Nifi for this? Tx
>>
>> Raf Huys
Reply | Threaded
Open this post in threaded view
|

Re: stream one large file, only once

Raf Huys
Thanks for making this clear!

I was distracted because I do have a `java.lang.OutOfMemoryError` on the GetFile processor itself (and a matching `bytes read` spike corresponding to the file size). 

On Mon, Nov 14, 2016 at 2:23 PM, Joe Witt <[hidden email]> wrote:
The pattern you want for this is

1) GetFile or (ListFile + FetchFile)
2) RouteText
3) PublishKafka

As Andrew points out GetFile and FetchFile do *not* read the file
contents into memory.  The whole point of NiFi's design in general is
to take advantage of the content repository rather than forcing
components to hold things in memory.  While they can elect to hold
things in memory they don't have to and the repository allows reading
from and writing to streams all within a unit of work pattern
transactional model.  There is a lot more to say on that topic but you
can see a good bit about it in the docs.

RouteText is the way to avoid the SplitText memory scenario where
there are so many lines that even holding pointers/metadata about
those lines itself becomes problematic.  You can also do as Andrew
points out and split in chunks which also works well.  RouteText will
likely yield higher performance though overall if it works for your
case.

Thanks
Joe

On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <[hidden email]> wrote:
> Neither GetFile nor FetchFile read the file into memory, they only deal with
> the file handle and pass the contents via a handle to the content repository
> (NiFi streams data into and reads as a stream).
>
> What you will face, however, is an issue with a SplitText when you try to
> split it in 1 transaction. This might fail based on the JVM heap allocated
> and file size. A recommended best practice in this case is to introduce a
> series of 2 SplitText processors. 1st pass would split into e.g. 10 000 row
> chunks, 2nd - into individual. Adjust for your expected file sizes and
> available memory.
>
> HTH,
> Andrew
>
> On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[hidden email]> wrote:
>>
>> I would like to read in a large (several gigs) of logdata, and route every
>> line to a (potentially different) Kafka topic.
>>
>> - I don't want this file to be in memory
>> - I want it to be read once, not more
>>
>> using `GetFile` takes the whole file in memory. Same with `FetchFile` as
>> far as I can see.
>>
>> I also used a `ExecuteProcess` processor in which the file is `cat` and
>> which splits off a flowfile every millisecond. This looked to be a somewhat
>> streaming approach to the problem, but this processor runs continuously (or
>> cron based) and by consequence the logfile is re-injected all the time.
>>
>> What's the typical Nifi for this? Tx
>>
>> Raf Huys



--
Mvg,

Raf Huys
Reply | Threaded
Open this post in threaded view
|

Re: stream one large file, only once

Joe Witt
OOM errors can often show you the symptom more readily than the cause.

If you have SplitText after it then what Andrew mentioned is almost
certainly the cause.  If RouteText will meet the need I think you'll
find that yields far better behavior.  The way I'd do what it sounds
like you're doing is:

ListFile
FetchFile
RouteText
PublishKafka (using demarcation strategy based on whatever the end of
line bytes you have are)

This will be very efficient and low memory.

Thanks
Joe

On Mon, Nov 14, 2016 at 8:32 AM, Raf Huys <[hidden email]> wrote:

> Thanks for making this clear!
>
> I was distracted because I do have a `java.lang.OutOfMemoryError` on the
> GetFile processor itself (and a matching `bytes read` spike corresponding to
> the file size).
>
> On Mon, Nov 14, 2016 at 2:23 PM, Joe Witt <[hidden email]> wrote:
>>
>> The pattern you want for this is
>>
>> 1) GetFile or (ListFile + FetchFile)
>> 2) RouteText
>> 3) PublishKafka
>>
>> As Andrew points out GetFile and FetchFile do *not* read the file
>> contents into memory.  The whole point of NiFi's design in general is
>> to take advantage of the content repository rather than forcing
>> components to hold things in memory.  While they can elect to hold
>> things in memory they don't have to and the repository allows reading
>> from and writing to streams all within a unit of work pattern
>> transactional model.  There is a lot more to say on that topic but you
>> can see a good bit about it in the docs.
>>
>> RouteText is the way to avoid the SplitText memory scenario where
>> there are so many lines that even holding pointers/metadata about
>> those lines itself becomes problematic.  You can also do as Andrew
>> points out and split in chunks which also works well.  RouteText will
>> likely yield higher performance though overall if it works for your
>> case.
>>
>> Thanks
>> Joe
>>
>> On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <[hidden email]> wrote:
>> > Neither GetFile nor FetchFile read the file into memory, they only deal
>> > with
>> > the file handle and pass the contents via a handle to the content
>> > repository
>> > (NiFi streams data into and reads as a stream).
>> >
>> > What you will face, however, is an issue with a SplitText when you try
>> > to
>> > split it in 1 transaction. This might fail based on the JVM heap
>> > allocated
>> > and file size. A recommended best practice in this case is to introduce
>> > a
>> > series of 2 SplitText processors. 1st pass would split into e.g. 10 000
>> > row
>> > chunks, 2nd - into individual. Adjust for your expected file sizes and
>> > available memory.
>> >
>> > HTH,
>> > Andrew
>> >
>> > On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[hidden email]> wrote:
>> >>
>> >> I would like to read in a large (several gigs) of logdata, and route
>> >> every
>> >> line to a (potentially different) Kafka topic.
>> >>
>> >> - I don't want this file to be in memory
>> >> - I want it to be read once, not more
>> >>
>> >> using `GetFile` takes the whole file in memory. Same with `FetchFile`
>> >> as
>> >> far as I can see.
>> >>
>> >> I also used a `ExecuteProcess` processor in which the file is `cat` and
>> >> which splits off a flowfile every millisecond. This looked to be a
>> >> somewhat
>> >> streaming approach to the problem, but this processor runs continuously
>> >> (or
>> >> cron based) and by consequence the logfile is re-injected all the time.
>> >>
>> >> What's the typical Nifi for this? Tx
>> >>
>> >> Raf Huys
>
>
>
>
> --
> Mvg,
>
> Raf Huys