Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Replace the implementation of the categorize_text aggregation #85872

Merged
merged 18 commits into from
May 23, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/85872.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 85872
summary: Replace the implementation of the `categorize_text` aggregation
area: Machine Learning
type: enhancement
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -20,53 +20,6 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
[[bucket-categorize-text-agg-syntax]]
==== Parameters

`field`::
(Required, string)
The semi-structured text field to categorize.

`max_unique_tokens`::
(Optional, integer, default: `50`)
The maximum number of unique tokens at any position up to `max_matched_tokens`.
Must be larger than 1. Smaller values use less memory and create fewer categories.
Larger values will use more memory and create narrower categories.
Max allowed value is `100`.

`max_matched_tokens`::
(Optional, integer, default: `5`)
The maximum number of token positions to match on before attempting to merge categories.
Larger values will use more memory and create narrower categories.
Max allowed value is `100`.

Example:
`max_matched_tokens` of 2 would disallow merging of the categories
[`foo` `bar` `baz`]
[`foo` `baz` `bozo`]
As the first 2 tokens are required to match for the category.

NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
added and all new tokens at that position are matched by the `*` token.

`similarity_threshold`::
(Optional, integer, default: `50`)
The minimum percentage of tokens that must match for text to be added to the
category bucket.
Must be between 1 and 100. The larger the value the narrower the categories.
Larger values will increase memory usage and create narrower categories.

`categorization_filters`::
(Optional, array of strings)
This property expects an array of regular expressions. The expressions
are used to filter out matching sequences from the categorization field values.
You can use this functionality to fine tune the categorization by excluding
sequences from consideration when categories are defined. For example, you can
exclude SQL statements that appear in your log files. This
property cannot be used at the same time as `categorization_analyzer`. If you
only want to define simple regular expression filters that are applied prior to
tokenization, setting this property is the easiest method. If you also want to
customize the tokenizer or post-tokenization filtering, use the
`categorization_analyzer` property instead and include the filters as
`pattern_replace` character filters.

`categorization_analyzer`::
(Optional, object or string)
The categorization analyzer specifies how the text is analyzed and tokenized before
Expand Down Expand Up @@ -95,14 +48,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
=====

`shard_size`::
`categorization_filters`::
(Optional, array of strings)
This property expects an array of regular expressions. The expressions
are used to filter out matching sequences from the categorization field values.
You can use this functionality to fine tune the categorization by excluding
sequences from consideration when categories are defined. For example, you can
exclude SQL statements that appear in your log files. This
property cannot be used at the same time as `categorization_analyzer`. If you
only want to define simple regular expression filters that are applied prior to
tokenization, setting this property is the easiest method. If you also want to
customize the tokenizer or post-tokenization filtering, use the
`categorization_analyzer` property instead and include the filters as
`pattern_replace` character filters.

`field`::
(Required, string)
The semi-structured text field to categorize.

`max_matched_tokens`::
(Optional, integer)
The number of categorization buckets to return from each shard before merging
all the results.
This parameter does nothing now, but is permitted for compatibility with the original
implementation.

`size`::
(Optional, integer, default: `10`)
The number of buckets to return.
`max_unique_tokens`::
(Optional, integer)
This parameter does nothing now, but is permitted for compatibility with the original
implementation.

`min_doc_count`::
(Optional, integer)
Expand All @@ -113,8 +85,23 @@ The minimum number of documents for a bucket to be returned to the results.
The minimum number of documents for a bucket to be returned from the shard before
merging.

==== Basic use
`shard_size`::
(Optional, integer)
The number of categorization buckets to return from each shard before merging
all the results.

`similarity_threshold`::
(Optional, integer, default: `70`)
The minimum percentage of token weight that must match for text to be added to the
category bucket.
Must be between 1 and 100. The larger the value the narrower the categories.
Larger values will increase memory usage and create narrower categories.

`size`::
(Optional, integer, default: `10`)
The number of buckets to return.

==== Basic use

WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
Expand Down Expand Up @@ -149,19 +136,23 @@ Response:
"buckets" : [
{
"doc_count" : 3,
"key" : "Node shutting down"
"key" : "Node shutting down",
"max_matching_length" : 49
},
{
"doc_count" : 1,
"key" : "Node starting up"
"key" : "Node starting up",
"max_matching_length" : 47
},
{
"doc_count" : 1,
"key" : "User foo_325 logging on"
"key" : "User foo_325 logging on",
"max_matching_length" : 52
},
{
"doc_count" : 1,
"key" : "User foo_864 logged off"
"key" : "User foo_864 logged off",
"max_matching_length" : 52
}
]
}
Expand Down Expand Up @@ -202,19 +193,23 @@ category results
"buckets" : [
{
"doc_count" : 3,
"key" : "Node shutting down"
"key" : "Node shutting down",
"max_matching_length" : 49
},
{
"doc_count" : 1,
"key" : "Node starting up"
"key" : "Node starting up",
"max_matching_length" : 47
},
{
"doc_count" : 1,
"key" : "User logged off"
"key" : "User logged off",
"max_matching_length" : 52
},
{
"doc_count" : 1,
"key" : "User logging on"
"key" : "User logging on",
"max_matching_length" : 52
}
]
}
Expand All @@ -223,11 +218,15 @@ category results
--------------------------------------------------

Here is an example using `categorization_filters`.
The default analyzer is a whitespace analyzer with a custom token filter
which filters out tokens that start with any number.
The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
also uses the `first_line_with_letters` character filter, so that only the first meaningful line
of multi-line messages is considered.
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
tokens are held in memory for the categories.
custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
happen.)

[source,console]
--------------------------------------------------
Expand All @@ -238,8 +237,7 @@ POST log-messages/_search?filter_path=aggregations
"categorize_text": {
"field": "message",
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
"max_matched_tokens": 2, <2>
"similarity_threshold": 30 <3>
"similarity_threshold": 11 <2>
}
}
}
Expand All @@ -248,12 +246,12 @@ POST log-messages/_search?filter_path=aggregations
// TEST[setup:categorize_text]
<1> The filters to apply to the analyzed tokens. It filters
out tokens like `bar_123`.
<2> Require at least 2 tokens before the log categories attempt to merge together
<3> Require 30% of the tokens to match before expanding a log categories
to add a new log entry
<2> Require 11% of token weight to match before adding a message to an
existing category rather than creating a new one.

The resulting categories are now broad, matching the first token
and merging the log groups.
The resulting categories are now very broad, merging the log groups.
(A `similarity_threshold` of 11% is generally too low. Settings over
50% are usually better.)

[source,console-result]
--------------------------------------------------
Expand All @@ -263,11 +261,13 @@ and merging the log groups.
"buckets" : [
{
"doc_count" : 4,
"key" : "Node *"
"key" : "Node",
"max_matching_length" : 49
},
{
"doc_count" : 2,
"key" : "User *"
"key" : "User",
"max_matching_length" : 52
}
]
}
Expand Down Expand Up @@ -326,6 +326,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 2,
"key" : "Node shutting down",
"max_matching_length" : 49,
"hit" : {
"hits" : {
"total" : {
Expand All @@ -352,6 +353,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "Node starting up",
"max_matching_length" : 47,
"hit" : {
"hits" : {
"total" : {
Expand Down Expand Up @@ -387,6 +389,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "Node shutting down",
"max_matching_length" : 49,
"hit" : {
"hits" : {
"total" : {
Expand All @@ -413,6 +416,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "User logged off",
"max_matching_length" : 52,
"hit" : {
"hits" : {
"total" : {
Expand All @@ -439,6 +443,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "User logging on",
"max_matching_length" : 52,
"hit" : {
"hits" : {
"total" : {
Expand Down
20 changes: 16 additions & 4 deletions x-pack/plugin/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -118,16 +118,28 @@ tasks.named("yamlRestTestV7CompatTransform").configure { task ->
"ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed",
"behaviour change #44752 - not allowing to update datafeed job_id"
)
task.skipTest(
"ml/trained_model_cat_apis/Test cat trained models",
"A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
)
task.skipTest(
"ml/categorization_agg/Test categorization agg simple",
"categorize_text was changed in 8.3, but experimental prior to the change"
)
task.skipTest(
"ml/categorization_agg/Test categorization aggregation against unsupported field",
"categorize_text was changed in 8.3, but experimental prior to the change"
)
task.skipTest(
"ml/categorization_agg/Test categorization aggregation with poor settings",
"categorize_text was changed in 8.3, but experimental prior to the change"
)
task.skipTest("rollup/delete_job/Test basic delete_job", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/delete_job/Test delete job twice", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/delete_job/Test delete running job", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/get_jobs/Test basic get_jobs", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/put_job/Test basic put_job", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/start_job/Test start job twice", "rollup was an experimental feature, also see #41227")
task.skipTest(
"ml/trained_model_cat_apis/Test cat trained models",
"A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
)
task.skipTest("indices.freeze/30_usage/Usage stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
task.skipTest("indices.freeze/20_stats/Translog stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
task.skipTest("indices.freeze/10_basic/Basic", "#70192 -- the freeze index API is removed from 8.0")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import static org.hamcrest.Matchers.not;
import static org.hamcrest.Matchers.notANumber;

public class CategorizationAggregationIT extends BaseMlIntegTestCase {
public class CategorizeTextAggregationIT extends BaseMlIntegTestCase {

private static final String DATA_INDEX = "categorization-agg-data";

Expand Down Expand Up @@ -77,17 +77,17 @@ public void testAggregationWithBroadCategories() {
.setSize(0)
.setTrackTotalHits(false)
.addAggregation(
// Overriding the similarity threshold to just 11% (default is 70%) results in the
// "Node started" and "Node stopped" messages being grouped in the same category
new CategorizeTextAggregationBuilder("categorize", "msg").setSimilarityThreshold(11)
.setMaxUniqueTokens(2)
.setMaxMatchedTokens(1)
.subAggregation(AggregationBuilders.max("max").field("time"))
.subAggregation(AggregationBuilders.min("min").field("time"))
)
.get();
InternalCategorizationAggregation agg = response.getAggregations().get("categorize");
assertThat(agg.getBuckets(), hasSize(2));

assertCategorizationBucket(agg.getBuckets().get(0), "Node *", 4);
assertCategorizationBucket(agg.getBuckets().get(0), "Node", 4);
assertCategorizationBucket(agg.getBuckets().get(1), "Failed to shutdown error org.aaaa.bbbb.Cccc line caused by foo exception", 2);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
import org.elasticsearch.cluster.metadata.IndexMetadata;
import org.elasticsearch.cluster.routing.ShardRouting;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder;
import org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation;
import org.elasticsearch.xpack.ml.aggs.categorization.CategorizeTextAggregationBuilder;
import org.elasticsearch.xpack.ml.aggs.categorization.InternalCategorizationAggregation;
import org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase;

import java.util.Arrays;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1417,16 +1417,7 @@ public List<AggregationSpec> getAggregations() {
CategorizeTextAggregationBuilder::new,
CategorizeTextAggregationBuilder.PARSER
).addResultReader(InternalCategorizationAggregation::new)
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME)),
// TODO: in the long term only keep one or other of these categorization aggregations
new AggregationSpec(
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME,
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder::new,
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.PARSER
).addResultReader(org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation::new)
.setAggregatorRegistrar(
s -> s.registerUsage(org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME)
)
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME))
);
}

Expand Down
Loading