Introduce the dissect library #32297

jakelandis · 2018-07-23T19:44:52Z

The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.

Note to reviewer(s) -

The the ingest node processor that consumes this will be submitted soon.
I have not performed performance testing/optimization on this yet, however, I am pretty confident that as-is it will be faster then Grok, but will get some numbers to back that claim.
Functionality is based on the Logstash dissect plugin, however (mostly) due to tight coupling with Logstash specific code this implementation does not share any code with Logstash.
The Logstash unit tests have been manually ported here as 'testLogstashSpecs'. The tests most generally passed without any modification, however there are some edge case differences between the implementations that may require further discussion. The test has comments to the differences.

cc: @guyboertje @ph

The dissect library will be used for the ingest node as an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Note - Regular expressions are used during construction of the objects, but not in the hot path. A dissect pattern takes the form of: '%{a} %{b},%{c}' which is composed of 3 keys (a,b,c) and two delimiters (space and comma). This dissect pattern will match a string of the form: 'foo bar,baz' and will result a key/value pairing of 'a=foo, b=bar, and c=baz'. See the comments in DissectParser for a full explanation. This commit does not include the ingest node processor that will consume it. However, the consumption should be a trivial mapping between the key/value pairing returned by the parser and the key/value pairing needed for the IngestDocument.

elasticmachine · 2018-07-23T19:45:41Z

Pinging @elastic/es-core-infra

jakelandis · 2018-07-24T18:20:11Z

Changing this PR to a WIP. Running some micro benchmarks shows some potential performance issue in the code, in particular with the match validation (intersection of two lists) and post processing (grouping and sorting).

jakelandis · 2018-07-26T19:27:30Z

I have addressed the performance bottleneck and updated this PR with the 7.0 Dissect behavior (Beats and Logstash will make corresponding changes to support the same specification for 7.0). I will need to create a small fork of this for the 6.x version to better match Beats and Logstash 6.x behavior.

I ran some micro benchmarks comparing this implementation vs. (ingest node) grok. For parsing a basic syslog and apache log line, this is ~ 3x-5x faster. The score is number of times parsed per second (higher is better). Full details here: https://gist.github.com/jakelandis/663306337e370df9c8c0cc6d5a1f9104

# Run complete. Total time: 00:22:02

Benchmark                           Mode  Cnt       Score      Error  Units
DissectBenchmark.dissectApacheLog  thrpt   10  423663.010 ± 2727.341  ops/s
DissectBenchmark.dissectSysLog     thrpt   10  862173.761 ± 8636.887  ops/s
DissectBenchmark.grokApacheLog     thrpt   10   91178.469 ± 2172.043  ops/s
DissectBenchmark.grokSysLog        thrpt   10  300408.741 ± 3586.303  ops/s

I am not sure if this a meaningful difference in the context of the ingest node processing, and will run similar macro benchmarks against ingest node pipelines (on the Dissect Processor PR).

jakelandis · 2018-07-26T19:30:51Z

Removed the WIP. This is ready for review.

@guyboertje - can you please validate the main logic (there is a test that mirrors logstash specs)
@rjernst - could you also review

jakelandis · 2018-07-31T17:07:22Z

I removed the 6.5.0 label from this PR, since the 6.5.0 will require a fork of this PR that implements slightly different rules. After speaking with @ph (beats) and @guyboertje (logstash) the 7.x goal is for stricter compliance to a newly formalized dissect specification. The 6.x goal is for approximate like for like behavior between the 3 products.

This means that the 7.x will be a breaking change to the 6.x, hence the breaking tag.

Specifically, the breaking change will be:

disambiguate the meaning of ? - 6.x it can be either a named skip key or the key of the reference (indirect) key/value pair (? paired with an &)
- ? has been replaced by * for key/value reference pairing (formerly ? and & pairs, now * and & pairs).
- key/value reference pairings * and & (formerly ? and &) are required to come in pairs.

EDIT: After further discussion, this will NOT be a breaking change and will be backported as-is to 6.x. IGNORE THE ABOVE COMMENT.

rjernst

This looks ok from the build side. One general comment is I see a lot of package private methods. If they need to be package private for tests that is fine, but please note this with a comment like // pkg private for tests or something like that. Otherwise, please use the least visibility possible.

jakelandis · 2018-08-08T13:17:27Z

@rjernst - thanks , will add package private comments (or reduce scope if possible)

@guyboertje - could you please review ?

guyboertje

LGTM

jakelandis · 2018-08-14T16:33:31Z

@guyboertje @rjernst thanks!

reduced scope where possible in 9e43ba8

will merge when CI is green.

jakelandis · 2018-08-14T16:37:16Z

Jenkins test this please

* elastic/master: Revert "cluster formation DSL - Gradle integration - part 2 (#32028)" (#32876) cluster formation DSL - Gradle integration - part 2 (#32028) Introduce global checkpoint listeners (#32696) Move connection profile into connection manager (#32858) [ML] Temporarily disabling rolling-upgrade tests Use generic AcknowledgedResponse instead of extended classes (#32859) [ML] Removing old per-partition normalization code (#32816) Use JDK 10 for 6.4 BWC builds (#32866) Removed flaky test. Looks like randomisation makes these assertions unreliable. [test] mute IndexShardTests.testDocStats Introduce the dissect library (#32297) Security: remove password hash bootstrap check (#32440) Move validation to server for put user requests (#32471) [ML] Add high level REST client docs for ML put job endpoint (#32843) Test: Fix forbidden uses in test framework (#32824) Painless: Change fqn_only to no_import (#32817) [test] mute testSearchWithSignificantTermsAgg Watcher: Remove unused hipchat render method (#32211) Watcher: Remove extraneous auth classes (#32300) Watcher: migrate PagerDuty v1 events API to v2 API (#32285)

jakelandis · 2018-08-21T13:38:24Z

Removing the breaking label ... after further discussion it preferred to have a minor difference between ingest/Logstash/Beats in the 6.x instead of a breaking change in ES.

I will backport this PR as-is to 6.x (ignore any comments above about breaking changes).

The dissect library will be used for the ingest node as an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Note - Regular expressions are used during construction of the objects, but not in the hot path. A dissect pattern takes the form of: '%{a} %{b},%{c}' which is composed of 3 keys (a,b,c) and two delimiters (space and comma). This dissect pattern will match a string of the form: 'foo bar,baz' and will result a key/value pairing of 'a=foo, b=bar, and c=baz'. See the comments in DissectParser for a full explanation. This commit does not include the ingest node processor that will consume it. However, the consumption should be a trivial mapping between the key/value pairing returned by the parser and the key/value pairing needed for the IngestDocument.

* Introduce the dissect library (#32297) The dissect library will be used for the ingest node as an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Note - Regular expressions are used during construction of the objects, but not in the hot path. A dissect pattern takes the form of: '%{a} %{b},%{c}' which is composed of 3 keys (a,b,c) and two delimiters (space and comma). This dissect pattern will match a string of the form: 'foo bar,baz' and will result a key/value pairing of 'a=foo, b=bar, and c=baz'. See the comments in DissectParser for a full explanation. This commit does not include the ingest node processor that will consume it. However, the consumption should be a trivial mapping between the key/value pairing returned by the parser and the key/value pairing needed for the IngestDocument. * ingest: Introduce the dissect processor (#32884) The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work. * ingest: minor - update test to include dissect (#33211) This change also includes placing the bytes processor in the correct order (helps to avoid merge conflict when back patching processors)

jakelandis added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.5.0 labels Jul 23, 2018

minor cosmetic fixes

79c8843

jakelandis changed the title ~~Introduce the dissect library~~ WIP: Introduce the dissect library Jul 24, 2018

jakelandis added 5 commits July 25, 2018 17:29

Rework the validation and post processing for better performance

923ce5d

add a couple unit tests that match grok's tests

bb57f2f

change reference operator from ? to * (7.0 behavior)

6862333

Add support for named skiped fields (7.0 support)

f3bd24a

Merge branch 'master' into dissect_lib

b68120a

jakelandis changed the title ~~WIP: Introduce the dissect library~~ Introduce the dissect library Jul 26, 2018

updates to better match and test against specification

ce6bb75

jakelandis added >breaking and removed v6.5.0 labels Jul 31, 2018

rjernst approved these changes Aug 3, 2018

View reviewed changes

jakelandis requested a review from guyboertje August 8, 2018 13:17

jakelandis added 2 commits August 13, 2018 09:49

fix TODO in doc

c7a54a3

Merge branch 'master' into dissect_lib

dba3b1c

guyboertje approved these changes Aug 14, 2018

View reviewed changes

jakelandis added 2 commits August 14, 2018 11:30

reduce scope where possible

9e43ba8

Merge branch 'master' into dissect_lib

4eb3392

jakelandis merged commit be62092 into elastic:master Aug 15, 2018

jakelandis mentioned this pull request Aug 15, 2018

ingest: Introduce the dissect processor #32884

Merged

jakelandis added backport pending v6.5.0 and removed >breaking labels Aug 21, 2018

jakelandis mentioned this pull request Sep 5, 2018

ingest: Dissect processor and lib for 6x #33422

Merged

jakelandis removed the backport pending label Sep 5, 2018

Mpdreamz mentioned this pull request Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce the dissect library #32297

Introduce the dissect library #32297

jakelandis commented Jul 23, 2018

elasticmachine commented Jul 23, 2018

jakelandis commented Jul 24, 2018

jakelandis commented Jul 26, 2018

jakelandis commented Jul 26, 2018

jakelandis commented Jul 31, 2018 •

edited

Loading

rjernst left a comment

jakelandis commented Aug 8, 2018

guyboertje left a comment

jakelandis commented Aug 14, 2018

jakelandis commented Aug 14, 2018

jakelandis commented Aug 21, 2018

Introduce the dissect library #32297

Introduce the dissect library #32297

Conversation

jakelandis commented Jul 23, 2018

elasticmachine commented Jul 23, 2018

jakelandis commented Jul 24, 2018

jakelandis commented Jul 26, 2018

jakelandis commented Jul 26, 2018

jakelandis commented Jul 31, 2018 • edited Loading

rjernst left a comment

Choose a reason for hiding this comment

jakelandis commented Aug 8, 2018

guyboertje left a comment

Choose a reason for hiding this comment

jakelandis commented Aug 14, 2018

jakelandis commented Aug 14, 2018

jakelandis commented Aug 21, 2018

jakelandis commented Jul 31, 2018 •

edited

Loading