Skip to content

Commit

Permalink
[ML] Replace the implementation of the categorize_text aggregation (#…
Browse files Browse the repository at this point in the history
…85872)

This replaces the implementation of the categorize_text aggregation
with the new algorithm that was added in #80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs
(and now includes the fixes of elastic/ml-cpp#2277).

The docs are updated to reflect the workings of the new implementation.
  • Loading branch information
droberts195 authored May 23, 2022
1 parent 79990fa commit 93bc2e3
Show file tree
Hide file tree
Showing 45 changed files with 667 additions and 3,461 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/85872.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 85872
summary: Replace the implementation of the `categorize_text` aggregation
area: Machine Learning
type: enhancement
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -17,56 +17,14 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>>, or
<<search-aggregations-random-sampler-aggregation,random sampler>> to explore the created categories.

NOTE: The algorithm used for categorization was completely changed in version 8.3.0. As a result this aggregation
will not work in a mixed version cluster where some nodes are on version 8.3.0 or higher and others are
on a version older than 8.3.0. Upgrade all nodes in your cluster to the same version if you experience
an error related to this change.

[[bucket-categorize-text-agg-syntax]]
==== Parameters

`field`::
(Required, string)
The semi-structured text field to categorize.

`max_unique_tokens`::
(Optional, integer, default: `50`)
The maximum number of unique tokens at any position up to `max_matched_tokens`.
Must be larger than 1. Smaller values use less memory and create fewer categories.
Larger values will use more memory and create narrower categories.
Max allowed value is `100`.

`max_matched_tokens`::
(Optional, integer, default: `5`)
The maximum number of token positions to match on before attempting to merge categories.
Larger values will use more memory and create narrower categories.
Max allowed value is `100`.

Example:
`max_matched_tokens` of 2 would disallow merging of the categories
[`foo` `bar` `baz`]
[`foo` `baz` `bozo`]
As the first 2 tokens are required to match for the category.

NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
added and all new tokens at that position are matched by the `*` token.

`similarity_threshold`::
(Optional, integer, default: `50`)
The minimum percentage of tokens that must match for text to be added to the
category bucket.
Must be between 1 and 100. The larger the value the narrower the categories.
Larger values will increase memory usage and create narrower categories.

`categorization_filters`::
(Optional, array of strings)
This property expects an array of regular expressions. The expressions
are used to filter out matching sequences from the categorization field values.
You can use this functionality to fine tune the categorization by excluding
sequences from consideration when categories are defined. For example, you can
exclude SQL statements that appear in your log files. This
property cannot be used at the same time as `categorization_analyzer`. If you
only want to define simple regular expression filters that are applied prior to
tokenization, setting this property is the easiest method. If you also want to
customize the tokenizer or post-tokenization filtering, use the
`categorization_analyzer` property instead and include the filters as
`pattern_replace` character filters.

`categorization_analyzer`::
(Optional, object or string)
The categorization analyzer specifies how the text is analyzed and tokenized before
Expand Down Expand Up @@ -95,14 +53,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
=====

`shard_size`::
`categorization_filters`::
(Optional, array of strings)
This property expects an array of regular expressions. The expressions
are used to filter out matching sequences from the categorization field values.
You can use this functionality to fine tune the categorization by excluding
sequences from consideration when categories are defined. For example, you can
exclude SQL statements that appear in your log files. This
property cannot be used at the same time as `categorization_analyzer`. If you
only want to define simple regular expression filters that are applied prior to
tokenization, setting this property is the easiest method. If you also want to
customize the tokenizer or post-tokenization filtering, use the
`categorization_analyzer` property instead and include the filters as
`pattern_replace` character filters.

`field`::
(Required, string)
The semi-structured text field to categorize.

`max_matched_tokens`::
(Optional, integer)
The number of categorization buckets to return from each shard before merging
all the results.
This parameter does nothing now, but is permitted for compatibility with the original
pre-8.3.0 implementation.

`size`::
(Optional, integer, default: `10`)
The number of buckets to return.
`max_unique_tokens`::
(Optional, integer)
This parameter does nothing now, but is permitted for compatibility with the original
pre-8.3.0 implementation.

`min_doc_count`::
(Optional, integer)
Expand All @@ -113,8 +90,23 @@ The minimum number of documents for a bucket to be returned to the results.
The minimum number of documents for a bucket to be returned from the shard before
merging.

==== Basic use
`shard_size`::
(Optional, integer)
The number of categorization buckets to return from each shard before merging
all the results.

`similarity_threshold`::
(Optional, integer, default: `70`)
The minimum percentage of token weight that must match for text to be added to the
category bucket.
Must be between 1 and 100. The larger the value the narrower the categories.
Larger values will increase memory usage and create narrower categories.

`size`::
(Optional, integer, default: `10`)
The number of buckets to return.

==== Basic use

WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
Expand Down Expand Up @@ -149,27 +141,30 @@ Response:
"buckets" : [
{
"doc_count" : 3,
"key" : "Node shutting down"
"key" : "Node shutting down",
"max_matching_length" : 49
},
{
"doc_count" : 1,
"key" : "Node starting up"
"key" : "Node starting up",
"max_matching_length" : 47
},
{
"doc_count" : 1,
"key" : "User foo_325 logging on"
"key" : "User foo_325 logging on",
"max_matching_length" : 52
},
{
"doc_count" : 1,
"key" : "User foo_864 logged off"
"key" : "User foo_864 logged off",
"max_matching_length" : 52
}
]
}
}
}
--------------------------------------------------


Here is an example using `categorization_filters`

[source,console]
Expand Down Expand Up @@ -202,19 +197,23 @@ category results
"buckets" : [
{
"doc_count" : 3,
"key" : "Node shutting down"
"key" : "Node shutting down",
"max_matching_length" : 49
},
{
"doc_count" : 1,
"key" : "Node starting up"
"key" : "Node starting up",
"max_matching_length" : 47
},
{
"doc_count" : 1,
"key" : "User logged off"
"key" : "User logged off",
"max_matching_length" : 52
},
{
"doc_count" : 1,
"key" : "User logging on"
"key" : "User logging on",
"max_matching_length" : 52
}
]
}
Expand All @@ -223,11 +222,15 @@ category results
--------------------------------------------------

Here is an example using `categorization_filters`.
The default analyzer is a whitespace analyzer with a custom token filter
which filters out tokens that start with any number.
The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
also uses the `first_line_with_letters` character filter, so that only the first meaningful line
of multi-line messages is considered.
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
tokens are held in memory for the categories.
custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
happen.)

[source,console]
--------------------------------------------------
Expand All @@ -238,8 +241,7 @@ POST log-messages/_search?filter_path=aggregations
"categorize_text": {
"field": "message",
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
"max_matched_tokens": 2, <2>
"similarity_threshold": 30 <3>
"similarity_threshold": 11 <2>
}
}
}
Expand All @@ -248,12 +250,12 @@ POST log-messages/_search?filter_path=aggregations
// TEST[setup:categorize_text]
<1> The filters to apply to the analyzed tokens. It filters
out tokens like `bar_123`.
<2> Require at least 2 tokens before the log categories attempt to merge together
<3> Require 30% of the tokens to match before expanding a log categories
to add a new log entry
<2> Require 11% of token weight to match before adding a message to an
existing category rather than creating a new one.

The resulting categories are now broad, matching the first token
and merging the log groups.
The resulting categories are now very broad, merging the log groups.
(A `similarity_threshold` of 11% is generally too low. Settings over
50% are usually better.)

[source,console-result]
--------------------------------------------------
Expand All @@ -263,11 +265,13 @@ and merging the log groups.
"buckets" : [
{
"doc_count" : 4,
"key" : "Node *"
"key" : "Node",
"max_matching_length" : 49
},
{
"doc_count" : 2,
"key" : "User *"
"key" : "User",
"max_matching_length" : 52
}
]
}
Expand Down Expand Up @@ -326,6 +330,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 2,
"key" : "Node shutting down",
"max_matching_length" : 49,
"hit" : {
"hits" : {
"total" : {
Expand All @@ -352,6 +357,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "Node starting up",
"max_matching_length" : 47,
"hit" : {
"hits" : {
"total" : {
Expand Down Expand Up @@ -387,6 +393,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "Node shutting down",
"max_matching_length" : 49,
"hit" : {
"hits" : {
"total" : {
Expand All @@ -413,6 +420,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "User logged off",
"max_matching_length" : 52,
"hit" : {
"hits" : {
"total" : {
Expand All @@ -439,6 +447,7 @@ POST log-messages/_search?filter_path=aggregations
{
"doc_count" : 1,
"key" : "User logging on",
"max_matching_length" : 52,
"hit" : {
"hits" : {
"total" : {
Expand Down
20 changes: 16 additions & 4 deletions x-pack/plugin/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -118,16 +118,28 @@ tasks.named("yamlRestTestV7CompatTransform").configure { task ->
"ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed",
"behaviour change #44752 - not allowing to update datafeed job_id"
)
task.skipTest(
"ml/trained_model_cat_apis/Test cat trained models",
"A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
)
task.skipTest(
"ml/categorization_agg/Test categorization agg simple",
"categorize_text was changed in 8.3, but experimental prior to the change"
)
task.skipTest(
"ml/categorization_agg/Test categorization aggregation against unsupported field",
"categorize_text was changed in 8.3, but experimental prior to the change"
)
task.skipTest(
"ml/categorization_agg/Test categorization aggregation with poor settings",
"categorize_text was changed in 8.3, but experimental prior to the change"
)
task.skipTest("rollup/delete_job/Test basic delete_job", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/delete_job/Test delete job twice", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/delete_job/Test delete running job", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/get_jobs/Test basic get_jobs", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/put_job/Test basic put_job", "rollup was an experimental feature, also see #41227")
task.skipTest("rollup/start_job/Test start job twice", "rollup was an experimental feature, also see #41227")
task.skipTest(
"ml/trained_model_cat_apis/Test cat trained models",
"A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
)
task.skipTest("indices.freeze/30_usage/Usage stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
task.skipTest("indices.freeze/20_stats/Translog stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
task.skipTest("indices.freeze/10_basic/Basic", "#70192 -- the freeze index API is removed from 8.0")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import static org.hamcrest.Matchers.not;
import static org.hamcrest.Matchers.notANumber;

public class CategorizationAggregationIT extends BaseMlIntegTestCase {
public class CategorizeTextAggregationIT extends BaseMlIntegTestCase {

private static final String DATA_INDEX = "categorization-agg-data";

Expand Down Expand Up @@ -77,17 +77,17 @@ public void testAggregationWithBroadCategories() {
.setSize(0)
.setTrackTotalHits(false)
.addAggregation(
// Overriding the similarity threshold to just 11% (default is 70%) results in the
// "Node started" and "Node stopped" messages being grouped in the same category
new CategorizeTextAggregationBuilder("categorize", "msg").setSimilarityThreshold(11)
.setMaxUniqueTokens(2)
.setMaxMatchedTokens(1)
.subAggregation(AggregationBuilders.max("max").field("time"))
.subAggregation(AggregationBuilders.min("min").field("time"))
)
.get();
InternalCategorizationAggregation agg = response.getAggregations().get("categorize");
assertThat(agg.getBuckets(), hasSize(2));

assertCategorizationBucket(agg.getBuckets().get(0), "Node *", 4);
assertCategorizationBucket(agg.getBuckets().get(0), "Node", 4);
assertCategorizationBucket(agg.getBuckets().get(1), "Failed to shutdown error org.aaaa.bbbb.Cccc line caused by foo exception", 2);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
import org.elasticsearch.cluster.metadata.IndexMetadata;
import org.elasticsearch.cluster.routing.ShardRouting;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder;
import org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation;
import org.elasticsearch.xpack.ml.aggs.categorization.CategorizeTextAggregationBuilder;
import org.elasticsearch.xpack.ml.aggs.categorization.InternalCategorizationAggregation;
import org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase;

import java.util.Arrays;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1431,16 +1431,7 @@ public List<AggregationSpec> getAggregations() {
CategorizeTextAggregationBuilder::new,
CategorizeTextAggregationBuilder.PARSER
).addResultReader(InternalCategorizationAggregation::new)
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME)),
// TODO: in the long term only keep one or other of these categorization aggregations
new AggregationSpec(
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME,
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder::new,
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.PARSER
).addResultReader(org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation::new)
.setAggregatorRegistrar(
s -> s.registerUsage(org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME)
)
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME))
);
}

Expand Down
Loading

0 comments on commit 93bc2e3

Please sign in to comment.