-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] [AIOps] Uses standard analyzer in log pattern analysis to ensure filter in Discover matches correct documents #172188
[ML] [AIOps] Uses standard analyzer in log pattern analysis to ensure filter in Discover matches correct documents #172188
Conversation
Pinging @elastic/ml-ui (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
const categorizationAnalyzer: AggregationsCustomCategorizeTextAnalyzer = { | ||
char_filter: ['first_line_with_letters'], | ||
tokenizer: 'standard', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a comment here saying:
- This is basically the default categorization analyzer but with the
standard
tokenizer instead ofml_standard
. - The
ml_standard
tokenizer splits tokens in a way that was observed to give better categories in testing many years ago, however, the downside of these better categories is then potential failures to find the original documents when using the category tokens to search for them. - Ideally we'd use the tokenizer from the mappings of the field being categorized, but that's too hard, so using
standard
is a quick compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added in 287e868
💚 Build Succeeded
Metrics [docs]Async chunks
History
To update your PR or re-run it, just comment with: |
## Summary Fixes #176387. The `standard` analyser for log pattern analysis introduced in #172188 might return patterns that mess with the identifying of significant patterns across time ranges, for example if a pattern matches different parts of a date or time. This adds an update that allows to set the analyser for log rate analysis to `ml_standard` but keep `standard` for log pattern analysis. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
…ic#176587) ## Summary Fixes elastic#176387. The `standard` analyser for log pattern analysis introduced in elastic#172188 might return patterns that mess with the identifying of significant patterns across time ranges, for example if a pattern matches different parts of a date or time. This adds an update that allows to set the analyser for log rate analysis to `ml_standard` but keep `standard` for log pattern analysis. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
…ic#176587) ## Summary Fixes elastic#176387. The `standard` analyser for log pattern analysis introduced in elastic#172188 might return patterns that mess with the identifying of significant patterns across time ranges, for example if a pattern matches different parts of a date or time. This adds an update that allows to set the analyser for log rate analysis to `ml_standard` but keep `standard` for log pattern analysis. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
Fixes #169523
The
categorize_text
agg uses theml_standard
tokenizer by default which produces slightly different tokens compared to thestandard
tokenizer, which is the default used for search.This means the category key (which is comprised of these tokens) will occasionally not match any documents when it is used as a filter in Discover to find docs in a category.
This PR ensures the
standard
tokenizer is always used in the pattern analysis query.A future enhancement would be to check which analyzer is specified in the mappings for the source field and to use that instead of unconditionally using
standard
. However for an initial fix, using thestandard
analyzer will be more likely to match the results from the majority of searches.