Duplicate Attribute Values in Extract Text Processor Output

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate Attribute Values in Extract Text Processor Output

muhyid72
Dear All
I need an information about Flow Files Attribute of Extract Text Processor.
My flow is that;

1. Getting IIS Log files from Azure Blob Storage
2. Splitting each IIS Log File to line by line with Split Text Processor.
2.1. Line Split Count:1
2.2. Maximum Fragment Size: No value set
2.3. Header Line Count: 0
2.4. Header Line Marker Characters: No value set
2.5. Remove Trailing Newlines: True
3. Transferring new flow files which is produced by Split Text Processor to
Extract Text Processor.
3.1. All Properties are Default
3.2. I added one RegEx in the Properties. I would like to carry on Flow
Files attributes to Syslog
3.2.1. Property Name: msg
3.2.2. Value: (.*).
4. Transferring all flow files where is coming from Extract Text to Put
Syslog Processor.
4.1. All Properties are Default or configured properly for requirements
(such as IP address of the Syslog, port etc.)
4.2. Message Body: IISHttp${msg}

When I check Flow Files Attribute from Data Provenance in the Extract Text
Processor, I see 3 attributes same each other.
Msg: 2020-06-24 13:33:49 XXXX GET /Test/Service/test.css
YYYY 200 0 0 852 7005 921
Msg.1: 2020-06-24 13:33:49 XXXX GET /Test/Service/test.css
YYYY 200 0 0 852 7005 921
Msg.2: 2020-06-24 13:33:49 XXXX GET /Test/Service/test.css
YYYY 200 0 0 852 7005 921

How can I remove duplicate attributes from extract text output? Or I need to
use another way?
Do you have any comment or suggestion?

My environment details are below:
Apache NiFi 1.11.3
Windows Server 2016
Java JRE 1.8.0_241 (64 Bit)



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

Andy LoPresto
The regex you’re using contains a capture group, and so the entire string is captured as one attribute, and then the contained capture groups are also extracted as attributes. You can set the property “Include Capture Group 0” to false to remove one of them. The others are provided as expected. 

Andy LoPresto
[hidden email]
[hidden email]
He/Him
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Jun 25, 2020, at 8:27 AM, muhyid72 <[hidden email]> wrote:

Dear All
I need an information about Flow Files Attribute of Extract Text Processor.
My flow is that;

1. Getting IIS Log files from Azure Blob Storage
2. Splitting each IIS Log File to line by line with Split Text Processor.
2.1. Line Split Count:1
2.2. Maximum Fragment Size: No value set
2.3. Header Line Count: 0
2.4. Header Line Marker Characters: No value set
2.5. Remove Trailing Newlines: True
3. Transferring new flow files which is produced by Split Text Processor to
Extract Text Processor.
3.1. All Properties are Default
3.2. I added one RegEx in the Properties. I would like to carry on Flow
Files attributes to Syslog
3.2.1. Property Name: msg
3.2.2. Value: (.*).
4. Transferring all flow files where is coming from Extract Text to Put
Syslog Processor.
4.1. All Properties are Default or configured properly for requirements
(such as IP address of the Syslog, port etc.)
4.2. Message Body: IISHttp${msg}

When I check Flow Files Attribute from Data Provenance in the Extract Text
Processor, I see 3 attributes same each other.
Msg: 2020-06-24 13:33:49 XXXX GET /Test/Service/test.css
YYYY 200 0 0 852 7005 921
Msg.1: 2020-06-24 13:33:49 XXXX GET /Test/Service/test.css
YYYY 200 0 0 852 7005 921
Msg.2: 2020-06-24 13:33:49 XXXX GET /Test/Service/test.css
YYYY 200 0 0 852 7005 921

How can I remove duplicate attributes from extract text output? Or I need to
use another way?
Do you have any comment or suggestion?

My environment details are below:
Apache NiFi 1.11.3
Windows Server 2016
Java JRE 1.8.0_241 (64 Bit)



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

muhyid72
Hi Andy,

Thank you for your quick answer and interest.

Actually I tried that but there were still 2 attributes on the flow file. As
far as I understand it is by design, I can't set just one attribute, it has
at least 2. Am i right?

Can I use Route Text Processor instead of Extract Text (I have given my
Extract Text configuration at the above) Dou you have comment?



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

Andy LoPresto
The resulting flowfile will always have at least two attributes because the whole match is extracted as an attribute and every capture group is extracted as an attribute, and the expression must contain at least one capture group. 

What is the objective you are trying to accomplish? If you want to route flowfiles based on their text contents, you can use RouteText. If you want to extract text content to attributes, use ExtractText. 

The use case you described above basically retrieves a log file from blob storage, splits each file to individual lines, extracts the content of each line (minus the final character) into an attribute, and then sends the values to Syslog. 

You may want to look at the record processors to improve the performance and simplicity of the flow substantially. 


Andy LoPresto
[hidden email]
[hidden email]
He/Him
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Jun 25, 2020, at 11:53 AM, muhyid72 <[hidden email]> wrote:

Hi Andy,

Thank you for your quick answer and interest.

Actually I tried that but there were still 2 attributes on the flow file. As
far as I understand it is by design, I can't set just one attribute, it has
at least 2. Am i right?

Can I use Route Text Processor instead of Extract Text (I have given my
Extract Text configuration at the above) Dou you have comment?



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

muhyid72
Hi Andy,
Thank you for your great support
My aim is transferring all IIS logs to syslog line by line. Therefore i am
using split text for parsing line. I tried Route Text yesterday but i didn't
accomplish to transfer line by line to syslog.
Extract Text is transferring splitted line on the attribute, in this way i
can say to syslog processor "Message Body: IISHttp${msg}".
Actually my problem is botleneck on the Extract Text. I have to transfer IIS
Logs near-real time due to cyber security process. But it doesn't drain
number of the message in the queue properly. I tried increasing Thread
Number, changing Run Duration, increasing/reducing Queue size but i couldn't
achive my target. The queue between split text and extract text allways full
and i have log gap about 12 hours. I am trying find a way for that



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

Mark Payne
If performance is the problem, then you definitely want to get rid of any SplitText / Split* processors.
These processors are great when they are necessary but they should be avoided if at all possible, because splitting the data apart results in huge overhead for NiFi and will harm performance [1] (plus it just makes the flow a lot more complex).

Better options would be to use ReplaceText in order to convert the existing IIS Log message into a Syslog-formatted message or to use UpdateRecord with a CSV Reader and a Syslog Writer, adding in fields to the UpdateRecord processor like /hostname = localhost, /priority = 4, /message = CONCAT(‘IISHttp’, .), and so on.

If you’re sending data over TCP, there is no need to split the data up at all. You can just send the entire text, newline delimited, over TCP using PutTCP instead of PutSyslog.
If you want to send over UDP, you may end up needing to use a SplitText just before PutUDP, but at least that would offer better performance because you only have a single processor operating on tiny FlowFiles.

Thanks
-Mark





On Jun 26, 2020, at 3:51 AM, muhyid72 <[hidden email]> wrote:

Hi Andy,
Thank you for your great support
My aim is transferring all IIS logs to syslog line by line. Therefore i am
using split text for parsing line. I tried Route Text yesterday but i didn't
accomplish to transfer line by line to syslog.
Extract Text is transferring splitted line on the attribute, in this way i
can say to syslog processor "Message Body: IISHttp${msg}".
Actually my problem is botleneck on the Extract Text. I have to transfer IIS
Logs near-real time due to cyber security process. But it doesn't drain
number of the message in the queue properly. I tried increasing Thread
Number, changing Run Duration, increasing/reducing Queue size but i couldn't
achive my target. The queue between split text and extract text allways full
and i have log gap about 12 hours. I am trying find a way for that



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

muhyid72
Hi Mark,

Thank you so much for valuable advice.

I tried PutTCP it seems working. I would like to make a summary for your
explanation and ask a questions

If I understand correctly;

1. Getting IIS Log Files from Azure Blob Storage same as before
1.1. List Azure Blob Storage Processor
1.2. Route on Attribute Processor (I have date filter RegEx on it)
1.3. Fetch Azure Blob Storage Processor

2. I will not use Split Text Processor as you explained
3. I will not use Extract Text Processor as you explained
4. I will not use Put Syslog Processor as you explained

3. Fetch Azure Blob Storage Processor will be directly connecting to Put TCP
Processor
4. Put TCP Processor
4.1. Hostname: Syslog Server
4.2. Port: Syslog Server Port (TCP)
4.3. Outgoing Message Delimiter: \n (for splitting each line from entire IIS
Log file. I will have just 1 line to syslog transfer for each time)
4.4. SSL Context Service --> StandardRestrictedSSLContextService
(configuring for mutual authentication)
4.5. Rest of the Properties will be default

I need your help after that point because i didn't use PutTCP Processor
until today

5. I need to add some prefixes to each line which is produced by \n
delimiter for Syslog Server. How will I do these?
5.1. Each Line should be begin these prefixes:
5.1.1. Message Timestamp: ${now():format('MMM d HH:mm:ss')}
5.1.2. Message Hostname: ${hostname(true)}
5.2. After these two prefix Message Body should be include IISHttp (Message
Body: IISHttp ${msg}) wording.

Thanks for your help in advance



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

Mark Payne
You’ll want to connect FetchAzureBlob -> ReplaceText -> PutTCP.

ReplaceText would use the Evaluation Mode of Line-by-Line to update the text. Or, alternatively, you could use UpdateRecord.

Thanks
-Mark


> On Jun 26, 2020, at 2:36 PM, muhyid72 <[hidden email]> wrote:
>
> Hi Mark,
>
> Thank you so much for valuable advice.
>
> I tried PutTCP it seems working. I would like to make a summary for your
> explanation and ask a questions
>
> If I understand correctly;
>
> 1. Getting IIS Log Files from Azure Blob Storage same as before
> 1.1. List Azure Blob Storage Processor
> 1.2. Route on Attribute Processor (I have date filter RegEx on it)
> 1.3. Fetch Azure Blob Storage Processor
>
> 2. I will not use Split Text Processor as you explained
> 3. I will not use Extract Text Processor as you explained
> 4. I will not use Put Syslog Processor as you explained
>
> 3. Fetch Azure Blob Storage Processor will be directly connecting to Put TCP
> Processor
> 4. Put TCP Processor
> 4.1. Hostname: Syslog Server
> 4.2. Port: Syslog Server Port (TCP)
> 4.3. Outgoing Message Delimiter: \n (for splitting each line from entire IIS
> Log file. I will have just 1 line to syslog transfer for each time)
> 4.4. SSL Context Service --> StandardRestrictedSSLContextService
> (configuring for mutual authentication)
> 4.5. Rest of the Properties will be default
>
> I need your help after that point because i didn't use PutTCP Processor
> until today
>
> 5. I need to add some prefixes to each line which is produced by \n
> delimiter for Syslog Server. How will I do these?
> 5.1. Each Line should be begin these prefixes:
> 5.1.1. Message Timestamp: ${now():format('MMM d HH:mm:ss')}
> 5.1.2. Message Hostname: ${hostname(true)}
> 5.2. After these two prefix Message Body should be include IISHttp (Message
> Body: IISHttp ${msg}) wording.
>
> Thanks for your help in advance
>
>
>
> --
> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

muhyid72
Hi Mark,

Thanks for your answer

Actually i don't have so much experience on NiFi

I guess, i couldn't understand correctly your explanation

I want to append extra words beginning of each line

for example:
my IIS Log File line like this:
2020-03-13 13:59:19 XXX-YYY  GET /Maintenance/Status.svc
X-ARR-LOG-ID=267ed22c-f1b 200 0 0 1005 1086 46

My line will be like this:
*Jun 26 23:29:09 SERVER1 IISHttp *2020-03-13 13:59:19 XXX-YYY  GET
/Maintenance/Status.svc X-ARR-LOG-ID=267ed22c-f1b 200 0 0 1005 1086 46

When I investigate Replace Text and Update Record Processors I couldn't find
how can i do that

I added my current Flow in the message

SyslogTransferFlow2.jpg
<http://apache-nifi-users-list.2361937.n4.nabble.com/file/t893/SyslogTransferFlow2.jpg>  



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Attribute Values in Extract Text Processor Output

muhyid72
In reply to this post by Mark Payne
Hi Mark,

Hi Mark,
I would like to say thank you for your advice. I did your described method.
It is working and giving better performance.  



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/