Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Grok expressions in Beats #5790

Closed
wants to merge 12 commits into from

Conversation

ramon-garcia
Copy link

Hello, I have developed Grok expression support in Beats.

I think that has already been discussed. But I need this feature.
First of all, to save a lot of network traffic in transmission of events to Elasticsearch. Many events are frequent and redundant, like web hits from monitoring system or firewall logs produced by frequent broadcast traffic. To drop these events, one needs from to classify them.

It was argued that it would make the development complex. But the grok.go code is just 232 lines .

Please consider accepting this request

@elasticmachine
Copy link
Collaborator

Can one of the admins verify this patch?

@ramon-garcia
Copy link
Author

Will look into Travis build.

@karmi
Copy link

karmi commented Dec 1, 2017

Hi @ramon-garcia, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@ramon-garcia
Copy link
Author

I think that all the e-mail addresses are in my profile. Am I missing something?

@ramon-garcia
Copy link
Author

I am trying to fix the travis build issues. It seems to complain about formatting issues. Unfortunately I am unable to do a make check in my Windows machine.

@ramon-garcia
Copy link
Author

ramon-garcia commented Dec 2, 2017

Finally, I managed to do "make check" under a Linux machine. make check fails with beat/info.go

$ make check
warning: "github.com/elastic/beats/libbeat/..." matched no packages
./testing/console.go
./processors/actions/grok_test.go
./cmd/export/config.go
./cmd/instance/beat.go
./cmd/instance/beat_test.go
./beat/info.go
Code differs from goimports' style ^

"goimports -d" shows no difference in files grok.go and grok_test.go.

This branch is merged with the master branch.

If there is anything that I can help with, here I am.

@karmi
Copy link

karmi commented Dec 3, 2017

@ramon-garcia, some of the commits have an invalid address, and the CLA check cannot process those, but you have probably force pushed this branch with amended commits, since the CLA check is green now?

@karmi
Copy link

karmi commented Dec 3, 2017

Hmm, something is weird here. When I manually check the PR, it's red, but the status is green here. Can you please make sure you have configured Git with your email, amend the commits with git commit --amend --verbose --reset-author and force push?

@ruflin ruflin added discuss Issue needs further discussion. Filebeat Filebeat labels Dec 4, 2017
@ramon-garcia
Copy link
Author

I have tried to rebase but prehaps I screwed up my branch. I will continue tomorrow.

@ramon-garcia
Copy link
Author

Rebased with author changed. I hope now it is ok.

@tsg
Copy link
Contributor

tsg commented Dec 6, 2017

Hi @ramon-garcia, apologies for the delayed answer. There are two reasons for which we hesitate to add Grok support in Beats:

  • We've got Grok implementations in Logstash and Ingest Node, and we're worried that by adding Grok in Beats as well we're creating confusion around the roles of each component. Also, since the implementations are different, there can be subtle differences between them that create more confusion.

  • Grok, being compiled to regular expressions, tends to require significant CPU time, which is not ideal for a shipper that usually shares the hardware with the application. The Go regexp implementation is not the fastest, and it's going to dominate the CPU profile, so I'd expect to see similar CPU consumption in Filebeat as we see in Logstash or even larger.

For the two reasons above, we were thinking of adding a dissect filter, like Logstash has, before Grok, and see if that is enough for what people usually need. Dissect doesn't use regular expressions, and is faster on most work loads. Would you be interested in contributing a Dissect implementation?

Apologies again for not giving you timely feedback, we wanted to discuss it internally first.

@ramon-garcia
Copy link
Author

ramon-garcia commented Dec 8, 2017

My primary reason for developing this patch is that I want to store clean events, without noise. Storing millions of useless events that reflect repetitive activity is a waste of CPU, storage and obscures the important ones.

With regard to CPU time, in my case, it is usually more scarce in the Elasticsearch server than in other servers. Perhaps this is not the case for everyone, but offering this option helps a lot of use cases.

My experience with pipelines is that they are difficult to debug (there is no way to send a test event to the pipeline and see the output) and one must watch log files in Elasticsearch from time to time. By contrast, any test with filebeat can be done without touching an Elasticsearch server. For similar reasons, my experience with painless scripts was not good.

Finally, there are legal and ethical reasons for avoiding shipping irrelevant logs. This is a general comment with the philosophy of making beats as little processing as possible. Shipping logs from a coworker's workstation that are not relevant for objective security reasons or other demonstrable reasons, that show personal activity, is unethical and probably illegal. Or, for instance, consider the case of the logs of a proxy server that shows web sites visited by coworkers. One could learn, for instance, political affiliation from them. Shipping or storing more that what can be objectively justified must be avoided.

Anyway, in case of disagreement, anyone can take this version of beats and use themselves.

@ramon-garcia
Copy link
Author

I have been quite surprised by the very bad performance of regular expressions under Go.
I hope to help to contribute a fix. It seems that the RE2 regular expression library (on which the the Go regular expression implementation is based) is much faster. I got a 10x factor of performance with a simple benchmark.

Porting optimizations from RE2 to Go regexp library cannot be that complicated.

@ruflin
Copy link
Member

ruflin commented Dec 19, 2017

@ramon-garcia Also have a look at the optimizations to regexp that @urso made like #2433 (there are more PR's related to Matcher).

What do you think of introducing dissect instead of going with regexp?

@ramon-garcia
Copy link
Author

ramon-garcia commented Dec 19, 2017

@ruflin Regular expressions are difficult to replace. For instance, a log file contains many uninteresting things, and also ssh login successes and failures. One wants to extract the user, the IP address and the result of the authentication, and nothing else.

Exchange message tracking: since it is a CSV file, it looks like one can parse it by splitting by commas ... until one sees that the subject lines of message can also contain commas. So commas must be allowed in fields, if they are surrounded by double quotes. For instance, the line
[email protected],"Hello, how are you"
contains two fields, not three.

@ruflin
Copy link
Member

ruflin commented Dec 20, 2017

regexp is definitively more powerful the dissect. I'm kind of hoping dissect could already address a large portion of the problem, it will definitively not be able to solve all of them.

There are also 2 problems from my perspective here:

  • taking a message apart to do filtering on certain "entries"
  • taking a string, structure it to put it into json format

The two problems have a big overlap, but I think for the first one not necessarly all the features are needed that are needed for 2, as with option 2 an exact match of fields is required and in 1 often splitting up in junks is enough and then do on the specific field a regexp.

I think dissect can do more then just tokenize a string, but your problem above is interesting. @ph Do you know if dissect can handle the above "challenge"?

@ph
Copy link
Contributor

ph commented Dec 20, 2017

@ramon-garcia @ruflin regular expression are indeed more powerful than dissect at the expanse of speed. If we look at the dissect test suite we still support quite a few things though.

Concerning logstash we were also suggesting to use conditions with regexp to decided if a dissect filter could be applied or chain multiple dissect together.

%{sender},"%{subject}" I think this would be valid dissect syntax for the example above.

Dissect is more like a tokenizer.

@ph
Copy link
Contributor

ph commented Jan 17, 2020

Sorry for taking a long time to reply.

We have discussed this internally and we have decided to close this issue, even if been able to use grok expression in a beats processor would be a nice enhancement. We do think at the moment that the Grok processors in both the Logstash the Elasticsearch ingest pipeline are good enough for most cases. Using regular expressions on every event will slow thing down quite a bit which is a bit against the idea of getting event out of the machine as fast as possible. We also recommend using dissect instead if possible.

There is also the possibility to create a plugin that beats will load to add this feature, ping me if you want to go that route.

@ph ph closed this Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Filebeat Filebeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants