Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INGEST: Extend KV Processor (#31789) #32232

Merged
merged 2 commits into from
Jul 20, 2018

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Jul 20, 2018

Added more capabilities supported by LS to the KV processor:

  • Stripping of brackets and quotes from values (include_brackets in corresponding LS filter)
  • Adding key prefixes
  • Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

relates to #31786

@original-brownbear original-brownbear added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.4.0 labels Jul 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Closes elastic#31786
@original-brownbear
Copy link
Member Author

@rjernst @talevy this should be good to review.
Added all options that I figured would be relevant to help with problems like #31786 that LS has but Ingest doesn't.
Left out some other options, that are kind of pointless in Ingest (like stripping arbitrary chars from any position in keys and values) and can better be handled by grok or gsub.

@jakelandis
Copy link
Contributor

@original-brownbear I think all the changes here are good ... but I don't think it resolves #31786

If I understand correctly the #31786 is to treat anything in quotes for the value as a single entity. For example:

POST _ingest/pipeline/_simulate
{
  "pipeline" : {
    "processors" : [
      { "kv": 
        { "field": "message",
          "field_split": " ",
          "value_split": "="
        }
      }
    ]
  },
  "docs" : [
     {"_source": {
       "message": "a=foo b='bar baz' c=lol"
     }}
    ]
}

should result in b=bar baz , but since " " (space) is the split value, the code doesn't take into account that bar baz is inside a quote and results in an error does not contain value_split [=]. You can also reproduce this with any other field_split such as "field_split": "," with "message": "a=foo,b='bar,baz',c=lol"

@original-brownbear
Copy link
Member Author

@jakelandis yea, the example you're making is still not fixed, but that won't work in LS either (unless you're doing a bunch of Regex trickery).
But at least with this change we get some parity with LS for the case of e.g. all args quoted because we can neatly strip the delimiters now:

e.g.

POST _ingest/pipeline/_simulate
{
  "pipeline" : {
    "processors" : [
      { "kv": 
        { "field": "message",
          "field_split": "' ",
          "strip_brackets": true,
          "value_split": "="
        }
      }
    ]
  },
  "docs" : [
     {"_source": {
       "message": "a='foo' b='bar baz' c='lol'"
     }}
    ]
}
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "a": "foo",
          "b": "bar baz",
          "c": "lol",
          "message": "a='foo' b='bar baz' c='lol'"
        },
        "_ingest": {
          "timestamp": "2018-07-20T15:55:56.635799Z"
        }
      }
    }
  ]
}

I think that's what this comment #31786 (comment) is basically asking for after we suggest that the tricky mix of quotes and no quotes is better handled by Grok.

@@ -1732,6 +1732,10 @@ For example, if you have a log message which contains `ip=1.2.3.4 error=REFUSED`
| `include_keys` | no | `null` | List of keys to filter and insert into document. Defaults to including all keys
| `exclude_keys` | no | `null` | List of keys to exclude from document
| `ignore_missing` | no | `false` | If `true` and `field` does not exist or is `null`, the processor quietly exits without modifying the document
| `prefix` | no | `null` | Prefix to be added to extracted keys
| `trim_key` | no | `null` | String of characters to trim from extracted keys
Copy link
Contributor

@jakelandis jakelandis Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you want to call out these are regex character classes ?

edit: clarified question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakelandis I'd rather not talk about regex patterns here I think. If I do that instead of simply saying I'm trimming all the chars in the given string, then that's probably not really true if you look at the code (I'm building a pattern from the string, arbitrary patterns won't work here).

@jakelandis
Copy link
Contributor

@original-brownbear - fair points.

I think we could change the split on the field to a sequence of grouped regex pattern matches to enable support for mixed quotes. I took a quick swing but came up a bit short, and unfortunately you can't stream pattern matches (out of the box) till java9 (which would be useful for your functional pattern used here).

I did some light testing and the code looks good. So LGTM, but may want to wait for one more given my experience on this code base.

Also, can you leave #31786 open ? This change helps, but does not solve the original intent of that issue, and I think (given a bit more time) that a solution to the original issue is feasible.

@original-brownbear
Copy link
Member Author

@jakelandis thanks, sure I can leave it open, would be neat if we found an elegant solution here that isn't just a tricky regex :)

@original-brownbear
Copy link
Member Author

@rjernst could you take a look here? :)

@rjernst
Copy link
Member

rjernst commented Jul 20, 2018

I defer to @talevy on this one, as he is much more familiar with the logstash side this PR intends to mimic.

@talevy
Copy link
Contributor

talevy commented Jul 20, 2018

Cool! I think it is great to go through the shortcomings of some of our existing processors.

This comment (#32232 (comment)) causes me to wonder a few things. If this isn't fixing the issue, why are we claiming it closes the issue?

I have been away from KV use-cases in Logstash for a while, but I remember its specific behavior was built for very specific application log-lines in mind. Documenting this would be helpful. The reason it lacked parity in the first place was because the case for doing this extra regex magic did not feel right in a processor that simply is called "Key-Value". It felt like it was doing a lot more. I think it is totally reasonable to re-evaluate this and bring in the parity, but I think it would be helpful to bring in more examples as to why it does all this. If it is doing these extra things because a handful of specific log lines do these things, it may be best to provide examples of how to use grok+kv together to achieve proper parsing, rather than to bake in heuristics into this processor.

Let me know your thoughts!

@original-brownbear
Copy link
Member Author

@talevy I think I would agree with @jakelandis here that there should be an easy and clean way to achieve this kind of parsing in KV.
My gut feeling is, that it would be ok to simply handle <([ and '" as special cases here and enable their use as delimiters for matching values somehow (before the actual splitting happens). It seems from looking at the KV filter that the stripping of these delimiters is pretty standard (in fact it's LS default behaviour, here in Ingest I had to make it the optional behaviour to not change existing behaviour) => may be ok to simply identify delimited groups first, then split pairs and then extract the KVs? (just a thought for how to achieve this, I'm also fine with simply deferring to Grok here).

Probably better to have this discussion in the related issue though :), this PR is really more about getting feature parity with LS. I changed the PR description accordingly to "relates" instead of "fixes" :)

@talevy
Copy link
Contributor

talevy commented Jul 20, 2018

thanks @original-brownbear. that helps. I think that is a good plan

@original-brownbear
Copy link
Member Author

original-brownbear commented Jul 20, 2018

@talevy sweet :) Would you still review this one as a step 1 towards that?

@talevy
Copy link
Contributor

talevy commented Jul 20, 2018

yup! @original-brownbear

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

In addition to successful tests of trimming keys/values, and strip_brackets. Should we have tests for those when set and do not match the input keys/values? So, when their patterns do not match

| `prefix` | no | `null` | Prefix to be added to extracted keys
| `trim_key` | no | `null` | String of characters to trim from extracted keys
| `trim_value` | no | `null` | String of characters to trim from extracted values
| `strip_brackets` | no | `false` | If `true` strip brackets `()`, `<>`, `[]` as well as quotes `'` and `"` from extracted values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be renamed to be more general than just brackets. maybe strip_surrounding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talevy yea I agree, brackets isn't a good term here, I went with it because that's what LS uses (see https://github.com/logstash-plugins/logstash-filter-kv/blob/master/lib/logstash/filters/kv.rb#L290). surrounding sounds kinda strange too, not that I have a better suggestion ... maybe strip_delimiters but then that's somewhat weird too since we're already splitting twice elsewhere :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is part of the hidden magic that I didn't love in that filter. But I agree, I cannot think of a better word, and having a separate property just for quotes feels like too much variation

if (prefix != null) {
config.put("prefix", prefix);
}
return FACTORY.create(null, randomAlphaOfLength(10), config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should have additional explicit tests of the factory as we do in KeyValueProcessorFactoryTests for these new config options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talevy but that means adding more getters and fields to the processor that will just be test only code? Testing external behaviour here seems fine. @rjernst wasn't so happy about that approach in another PR and I tend to agree tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@original-brownbear
Copy link
Member Author

@talevy when patterns don't match nothing happens, I think that's kinda redundant to test isn't it? (Especially since this already happens implicitly in those tests that mix a few quote/bracket kinds)

@talevy
Copy link
Contributor

talevy commented Jul 20, 2018

@original-brownbear true. I guess I was looking for something more explicit just in case there was different behavior when none match.

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

@talevy thanks!

@original-brownbear original-brownbear merged commit 7aa8a0a into elastic:master Jul 20, 2018
@original-brownbear original-brownbear deleted the 31786 branch July 20, 2018 20:32
@jakelandis
Copy link
Contributor

I have POC [1] for supporting quotes (lifted some code from my dissect implementation). Reg-ex's got real ugly real quick. I will work on getting it cleaned up and better tested and submit a PR.

[1]
jakelandis@0653ae1

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 21, 2018
* INGEST: Extend KV Processor (elastic#31789)

Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Relates elastic#31786
* Add documentation
martijnvg added a commit that referenced this pull request Jul 21, 2018
* es/master: (23 commits)
  Switch full-cluster-restart to new style Requests (#32140)
  [DOCS] Clarified that you must remove X-Pack plugin when upgrading from pre-6.3. (#32016)
  Remove BouncyCastle dependency from runtime (#32193)
  INGEST: Extend KV Processor (#31789) (#32232)
  INGEST: Make a few Processors callable by Painless (#32170)
  Add region ISO code to GeoIP Ingest plugin (#31669)
  [Tests] Remove QueryStringQueryBuilderTests#toQuery class assertions (#32236)
  Make sure that field aliases count towards the total fields limit. (#32222)
  Switch rolling restart to new style Requests (#32147)
  muting failing test for internal auto date histogram to avoid failure before fix is merged
  MINOR: Remove unused `IndexDynamicSettings` (#32237)
  Fix multi level nested sort (#32204)
  Enhance Parent circuit breaker error message (#32056)
  [ML] Use default request durability for .ml-state index (#32233)
  Remove indices stats timeout from monitoring docs
  Rename ranking evaluation response section (#32166)
  Dependencies: Upgrade to joda time 2.10 (#32160)
  Remove aliases resolution limitations when security is enabled (#31952)
  Ensure that field aliases cannot be used in multi-fields. (#32219)
  TESTS: Check for Netty resource leaks (#31861)
  ...
original-brownbear added a commit that referenced this pull request Jul 21, 2018
* INGEST: Extend KV Processor (#31789)

Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Relates #31786
* Add documentation
dnhatn added a commit that referenced this pull request Jul 22, 2018
* 6.x:
  Improve message when JAVA_HOME not set (#32022)
  INGEST: Make a few Processors callable by Painless (#32170) (#32261)
  INGEST: Extend KV Processor (#31789) (#32232) (#32262)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants