Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow range types to be used for enrich matching #76110

Merged
merged 26 commits into from
Sep 7, 2021

Conversation

mjmbischoff
Copy link
Contributor

@mjmbischoff mjmbischoff commented Aug 4, 2021

Currently we hardcode that the data type for the match field on the enrich index, when the MATCH policy is used, this is (always) keyword. This breaks look ups when source indicies use a rangetype for lookup. For example when a ip_range is used the literal value from the source is indexed as a keyword and therefor doesn't match any of the ip's.

This PR improves the code by checking for the field type on the source indicies and then proceeds to use the range type in the enrich index if this is the type for all source indicies. If the types are different across source indices, we fallback to keyword. For non-range types we also fallback to keyword, emulating the current behavior.


  • Have you signed the contributor license agreement?
  • Have you followed the contributor guidelines?
  • If submitting code, have you built your formula locally prior to submission with gradle check?
  • If submitting code, is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed.
  • If submitting code, have you checked that your submission is for an OS and architecture that we support?

@elasticsearchmachine elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 4, 2021
@mjmbischoff mjmbischoff requested a review from martijnvg August 4, 2021 16:28
@mjmbischoff mjmbischoff requested a review from jbaiera August 6, 2021 10:22
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks pretty good here, though I can't seem to remember if we had plans to separate ranges out from the match enrich policy type. Searching for data using indexed ranges feels closer to how geo_match works, but the underlying query DSL is indeed closer to how match would write it.

@martijnvg do you remember if that was the plan or not?

@martijnvg
Copy link
Member

Apologies for responding late to this PR.
Like James, I think the code in this PR looks good.

@martijnvg do you remember if that was the plan or not?

Yes, the plan was to introduce a new enrich policy type for ranges.
This has the added benefit that actual range can be queried in addition to literal values.
Przemko started with a pr for this a while ago: #65781 (but this got stalled due to inactivity).
However maybe this enhancement to the match type can exist next with the future range policy type?
What do you think?

@jbaiera
Copy link
Member

jbaiera commented Aug 19, 2021

Since we have two separate policy types (match and geo_match) for different query types, I think I'd lean more toward keeping design parity and introducing a third policy type. While the underlying query DSL still uses a match query, we always intended the match policy type to enrich based on singular values. Adding a range policy type gives us the flexibility to support future growth, while adding range support in the match policy type may muddy the waters going forward should the functionality be duplicated.

Does the policy type pertain more to the query DSL created, or is it more concerned with the field type of the enrich key? I'll throw the team discuss label on here to see if we can reach consensus.

@jbaiera jbaiera added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP team-discuss labels Aug 19, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Aug 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@martijnvg
Copy link
Member

Since we have two separate policy types (match and geo_match) for different query types, I think I'd lean more toward keeping design parity and introducing a third policy type. While the underlying query DSL still uses a match query, we always intended the match policy type to enrich based on singular values. Adding a range policy type gives us the flexibility to support future growth, while adding range support in the match policy type may muddy the waters going forward should the functionality be duplicated.

Muddying the waters when it comes to policy types is a valid concern if this change would get merged. Thinking from that perspective, maybe the match processor should not be changed to use more field types for the match field.

Does the policy type pertain more to the query DSL created, or is it more concerned with the field type of the enrich key?

I think it is both. It controls what field type to use the match field (during policy execution) and the query that is executed against this field (in the processor implementation).

Adding tests for multi level + fix.
@mjmbischoff
Copy link
Contributor Author

Apologies for responding late to this PR.

No worries, you should enjoy your PTO.

Przemko started with a pr for this a while ago: #65781 (but this got stalled due to inactivity).

Missed this, will have a closer look at it to see what I've missed.

Added the changes to make it a separate policy but realized I forgot about the configuring part. I'll see if I can pick this up later.

@mjmbischoff
Copy link
Contributor Author

Looking into #65781 , it seems a format field is introduced for date but no such parameter is described for https://www.elastic.co/guide/en/elasticsearch/reference/7.14/query-dsl-term-query.html in our docs (which seems incomplete as it doesn't describe the range matching either) it is described in the type description for the mapping https://www.elastic.co/guide/en/elasticsearch/reference/7.14/range.html

It is described for the range query but that is for taking a range as the input to match on a value (or possible range), while I'm sure some user can find a reason to match like that I think that would be a different policy (and PR) it does make me wonder if the naming for the policy we introduce here is the right one. And potentially re-evaluate the split?

I should add some tests for dates including exotic date formats that it won't magically pick to check if we need the format configured and if it needs to be user specified or copied from the mapping of the source / enrich index.

@martijnvg
Copy link
Member

@mjmbischoff Looks like the format option in #65781 is used for format option of the date field in the enrich index. However perhaps this isn't needed and can the format option be extracted from the mapping of the source index during enrich policy execution.

It is described for the range query but that is for taking a range as the input to match on a value (or possible range), while I'm sure some user can find a reason to match like that I think that would be a different policy (and PR) it does make me wonder if the naming for the policy we introduce here is the right one. And potentially re-evaluate the split?

Right, perhaps ranges should be handled by a different kind of policy, like the one proposed in #65781.

@mjmbischoff
Copy link
Contributor Author

Going with RANGE_TYPE, allowing RANGE_MATCH_TYPE or better name to be used for matching ranges against values, ranges.

Added tests for date_range which indeed triggered on the format being missing. Similar to types, requiring a singular distinct format specification or otherwise we fail and inform the user. If someone feels they need can argue that they need multiple source indicies with different format fields feel free to open an ER arguing your case. Merging these is non-trivial, because one format might mask the other etc.

@mjmbischoff
Copy link
Contributor Author

mjmbischoff commented Aug 30, 2021

@martijnvg @jbaiera think it's ready, can you give it a check.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @mjmbischoff! I left a few more comments around simplification.

@martijnvg
Copy link
Member

@elasticmachine update branch

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update! I left a few more comments.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this PR @mjmbischoff.
I think this PR looks good now.
@jbaiera do you think this PR is good to get merged?

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes!

@mjmbischoff mjmbischoff merged commit c6d4dd8 into elastic:master Sep 7, 2021
@mjmbischoff mjmbischoff deleted the range_enrich branch September 7, 2021 15:38
@mjmbischoff
Copy link
Contributor Author

@martijnvg I did the honors, is it ok if I leave backporting (if applicable) to you?

@martijnvg
Copy link
Member

I did the honors

🎉

is it ok if I leave backporting (if applicable) to you?

Yes, I will do the back porting to the 7.x branch.

martijnvg pushed a commit to martijnvg/elasticsearch that referenced this pull request Sep 8, 2021
martijnvg added a commit that referenced this pull request Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team release highlight Team:Data Management Meta label for data/management team v7.16.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants