elastic · droberts195 · May 23, 2022 · Apr 13, 2022 · Apr 13, 2022 · Apr 13, 2022
diff --git a/docs/changelog/85872.yaml b/docs/changelog/85872.yaml
@@ -0,0 +1,5 @@
+pr: 85872
+summary: Replace the implementation of the `categorize_text` aggregation
+area: Machine Learning
+type: enhancement
+issues: []
diff --git a/docs/reference/aggregations/bucket/categorize-text-aggregation.asciidoc b/docs/reference/aggregations/bucket/categorize-text-aggregation.asciidoc
@@ -20,53 +20,6 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
 [[bucket-categorize-text-agg-syntax]]
 ==== Parameters
 
-`field`::
-(Required, string)
-The semi-structured text field to categorize.
-
-`max_unique_tokens`::
-(Optional, integer, default: `50`)
-The maximum number of unique tokens at any position up to `max_matched_tokens`.
-Must be larger than 1. Smaller values use less memory and create fewer categories.
-Larger values will use more memory and create narrower categories.
-Max allowed value is `100`.
-
-`max_matched_tokens`::
-(Optional, integer, default: `5`)
-The maximum number of token positions to match on before attempting to merge categories.
-Larger values will use more memory and create narrower categories.
-Max allowed value is `100`.
-
-Example:
-`max_matched_tokens` of 2 would disallow merging of the categories
-[`foo` `bar` `baz`]
-[`foo` `baz` `bozo`]
-As the first 2 tokens are required to match for the category.
-
-NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
-added and all new tokens at that position are matched by the `*` token.
-
-`similarity_threshold`::
-(Optional, integer, default: `50`)
-The minimum percentage of tokens that must match for text to be added to the
-category bucket.
-Must be between 1 and 100. The larger the value the narrower the categories.
-Larger values will increase memory usage and create narrower categories.
-
-`categorization_filters`::
-(Optional, array of strings)
-This property expects an array of regular expressions. The expressions
-are used to filter out matching sequences from the categorization field values.
-You can use this functionality to fine tune the categorization by excluding
-sequences from consideration when categories are defined. For example, you can
-exclude SQL statements that appear in your log files. This
-property cannot be used at the same time as `categorization_analyzer`. If you
-only want to define simple regular expression filters that are applied prior to
-tokenization, setting this property is the easiest method. If you also want to
-customize the tokenizer or post-tokenization filtering, use the
-`categorization_analyzer` property instead and include the filters as
-`pattern_replace` character filters.
-
 `categorization_analyzer`::
 (Optional, object or string)
 The categorization analyzer specifies how the text is analyzed and tokenized before
@@ -95,14 +48,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
 include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
 =====
 
-`shard_size`::
+`categorization_filters`::
+(Optional, array of strings)
+This property expects an array of regular expressions. The expressions
+are used to filter out matching sequences from the categorization field values.
+You can use this functionality to fine tune the categorization by excluding
+sequences from consideration when categories are defined. For example, you can
+exclude SQL statements that appear in your log files. This
+property cannot be used at the same time as `categorization_analyzer`. If you
+only want to define simple regular expression filters that are applied prior to
+tokenization, setting this property is the easiest method. If you also want to
+customize the tokenizer or post-tokenization filtering, use the
+`categorization_analyzer` property instead and include the filters as
+`pattern_replace` character filters.
+
+`field`::
+(Required, string)
+The semi-structured text field to categorize.
+
+`max_matched_tokens`::
 (Optional, integer)
-The number of categorization buckets to return from each shard before merging
-all the results.
+This parameter does nothing now, but is permitted for compatibility with the original
+implementation.
 
-`size`::
-(Optional, integer, default: `10`)
-The number of buckets to return.
+`max_unique_tokens`::
+(Optional, integer)
+This parameter does nothing now, but is permitted for compatibility with the original
+implementation.
 
 `min_doc_count`::
 (Optional, integer)
@@ -113,8 +85,23 @@ The minimum number of documents for a bucket to be returned to the results.
 The minimum number of documents for a bucket to be returned from the shard before
 merging.
 
-==== Basic use
+`shard_size`::
+(Optional, integer)
+The number of categorization buckets to return from each shard before merging
+all the results.
+
+`similarity_threshold`::
+(Optional, integer, default: `70`)
+The minimum percentage of token weight that must match for text to be added to the
+category bucket.
+Must be between 1 and 100. The larger the value the narrower the categories.
+Larger values will increase memory usage and create narrower categories.
 
+`size`::
+(Optional, integer, default: `10`)
+The number of buckets to return.
+
+==== Basic use
 
 WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
          used in conjunction with <<async-search, Async search>>. Additionally, you may consider
@@ -149,19 +136,23 @@ Response:
       "buckets" : [
         {
           "doc_count" : 3,
-          "key" : "Node shutting down"
+          "key" : "Node shutting down",
+          "max_matching_length" : 49
         },
         {
           "doc_count" : 1,
-          "key" : "Node starting up"
+          "key" : "Node starting up",
+          "max_matching_length" : 47
         },
         {
           "doc_count" : 1,
-          "key" : "User foo_325 logging on"
+          "key" : "User foo_325 logging on",
+          "max_matching_length" : 52
         },
         {
           "doc_count" : 1,
-          "key" : "User foo_864 logged off"
+          "key" : "User foo_864 logged off",
+          "max_matching_length" : 52
         }
       ]
     }
@@ -202,19 +193,23 @@ category results
       "buckets" : [
         {
           "doc_count" : 3,
-          "key" : "Node shutting down"
+          "key" : "Node shutting down",
+          "max_matching_length" : 49
         },
         {
           "doc_count" : 1,
-          "key" : "Node starting up"
+          "key" : "Node starting up",
+          "max_matching_length" : 47
         },
         {
           "doc_count" : 1,
-          "key" : "User logged off"
+          "key" : "User logged off",
+          "max_matching_length" : 52
         },
         {
           "doc_count" : 1,
-          "key" : "User logging on"
+          "key" : "User logging on",
+          "max_matching_length" : 52
         }
       ]
     }
@@ -223,11 +218,15 @@ category results
 --------------------------------------------------
 
 Here is an example using `categorization_filters`.
-The default analyzer is a whitespace analyzer with a custom token filter
-which filters out tokens that start with any number.
+The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
+but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
+also uses the `first_line_with_letters` character filter, so that only the first meaningful line
+of multi-line messages is considered.
 But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
-custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
-tokens are held in memory for the categories.
+custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
+tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
+categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
+happen.)
 
 [source,console]
 --------------------------------------------------
@@ -238,8 +237,7 @@ POST log-messages/_search?filter_path=aggregations
       "categorize_text": {
         "field": "message",
         "categorization_filters": ["\\w+\\_\\d{3}"], <1>
-        "max_matched_tokens": 2, <2>
-        "similarity_threshold": 30 <3>
+        "similarity_threshold": 11 <2>
       }
     }
   }
@@ -248,12 +246,12 @@ POST log-messages/_search?filter_path=aggregations
 // TEST[setup:categorize_text]
 <1> The filters to apply to the analyzed tokens. It filters
 out tokens like `bar_123`.
-<2> Require at least 2 tokens before the log categories attempt to merge together
-<3> Require 30% of the tokens to match before expanding a log categories
-    to add a new log entry
+<2> Require 11% of token weight to match before adding a message to an
+    existing category rather than creating a new one.
 
-The resulting categories are now broad, matching the first token
-and merging the log groups.
+The resulting categories are now very broad, merging the log groups.
+(A `similarity_threshold` of 11% is generally too low. Settings over
+50% are usually better.)
 
 [source,console-result]
 --------------------------------------------------
@@ -263,11 +261,13 @@ and merging the log groups.
       "buckets" : [
         {
           "doc_count" : 4,
-          "key" : "Node *"
+          "key" : "Node",
+          "max_matching_length" : 49
         },
         {
           "doc_count" : 2,
-          "key" : "User *"
+          "key" : "User",
+          "max_matching_length" : 52
         }
       ]
     }
@@ -326,6 +326,7 @@ POST log-messages/_search?filter_path=aggregations
               {
                 "doc_count" : 2,
                 "key" : "Node shutting down",
+                "max_matching_length" : 49,
                 "hit" : {
                   "hits" : {
                     "total" : {
@@ -352,6 +353,7 @@ POST log-messages/_search?filter_path=aggregations
               {
                 "doc_count" : 1,
                 "key" : "Node starting up",
+                "max_matching_length" : 47,
                 "hit" : {
                   "hits" : {
                     "total" : {
@@ -387,6 +389,7 @@ POST log-messages/_search?filter_path=aggregations
               {
                 "doc_count" : 1,
                 "key" : "Node shutting down",
+                "max_matching_length" : 49,
                 "hit" : {
                   "hits" : {
                     "total" : {
@@ -413,6 +416,7 @@ POST log-messages/_search?filter_path=aggregations
               {
                 "doc_count" : 1,
                 "key" : "User logged off",
+                "max_matching_length" : 52,
                 "hit" : {
                   "hits" : {
                     "total" : {
@@ -439,6 +443,7 @@ POST log-messages/_search?filter_path=aggregations
               {
                 "doc_count" : 1,
                 "key" : "User logging on",
+                "max_matching_length" : 52,
                 "hit" : {
                   "hits" : {
                     "total" : {

diff --git a/x-pack/plugin/build.gradle b/x-pack/plugin/build.gradle
@@ -118,16 +118,28 @@ tasks.named("yamlRestTestV7CompatTransform").configure { task ->
     "ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed",
     "behaviour change #44752 - not allowing to update datafeed job_id"
   )
+  task.skipTest(
+    "ml/trained_model_cat_apis/Test cat trained models",
+    "A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
+  )
+  task.skipTest(
+    "ml/categorization_agg/Test categorization agg simple",
+    "categorize_text was changed in 8.3, but experimental prior to the change"
+  )
+  task.skipTest(
+    "ml/categorization_agg/Test categorization aggregation against unsupported field",
+    "categorize_text was changed in 8.3, but experimental prior to the change"
+  )
+  task.skipTest(
+    "ml/categorization_agg/Test categorization aggregation with poor settings",
+    "categorize_text was changed in 8.3, but experimental prior to the change"
+  )
   task.skipTest("rollup/delete_job/Test basic delete_job", "rollup was an experimental feature, also see #41227")
   task.skipTest("rollup/delete_job/Test delete job twice", "rollup was an experimental feature, also see #41227")
   task.skipTest("rollup/delete_job/Test delete running job", "rollup was an experimental feature, also see #41227")
   task.skipTest("rollup/get_jobs/Test basic get_jobs", "rollup was an experimental feature, also see #41227")
   task.skipTest("rollup/put_job/Test basic put_job", "rollup was an experimental feature, also see #41227")
   task.skipTest("rollup/start_job/Test start job twice", "rollup was an experimental feature, also see #41227")
-  task.skipTest(
-    "ml/trained_model_cat_apis/Test cat trained models",
-    "A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
-  )
   task.skipTest("indices.freeze/30_usage/Usage stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
   task.skipTest("indices.freeze/20_stats/Translog stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
   task.skipTest("indices.freeze/10_basic/Basic", "#70192 -- the freeze index API is removed from 8.0")

diff --git a/...egration/CategorizationAggregationIT.java → ...egration/CategorizeTextAggregationIT.java b/...egration/CategorizationAggregationIT.java → ...egration/CategorizeTextAggregationIT.java
@@ -27,7 +27,7 @@
 import static org.hamcrest.Matchers.not;
 import static org.hamcrest.Matchers.notANumber;
 
-public class CategorizationAggregationIT extends BaseMlIntegTestCase {
+public class CategorizeTextAggregationIT extends BaseMlIntegTestCase {
 
     private static final String DATA_INDEX = "categorization-agg-data";
 
@@ -77,17 +77,17 @@ public void testAggregationWithBroadCategories() {
             .setSize(0)
             .setTrackTotalHits(false)
             .addAggregation(
+                // Overriding the similarity threshold to just 11% (default is 70%) results in the
+                // "Node started" and "Node stopped" messages being grouped in the same category
                 new CategorizeTextAggregationBuilder("categorize", "msg").setSimilarityThreshold(11)
-                    .setMaxUniqueTokens(2)
-                    .setMaxMatchedTokens(1)
                     .subAggregation(AggregationBuilders.max("max").field("time"))
                     .subAggregation(AggregationBuilders.min("min").field("time"))
             )
             .get();
         InternalCategorizationAggregation agg = response.getAggregations().get("categorize");
         assertThat(agg.getBuckets(), hasSize(2));
 
-        assertCategorizationBucket(agg.getBuckets().get(0), "Node *", 4);
+        assertCategorizationBucket(agg.getBuckets().get(0), "Node", 4);
         assertCategorizationBucket(agg.getBuckets().get(1), "Failed to shutdown error org.aaaa.bbbb.Cccc line caused by foo exception", 2);
     }
 

diff --git a/...lClusterTest/java/org/elasticsearch/xpack/ml/integration/CategorizeTextDistributedIT.java b/...lClusterTest/java/org/elasticsearch/xpack/ml/integration/CategorizeTextDistributedIT.java
@@ -16,8 +16,8 @@
 import org.elasticsearch.cluster.metadata.IndexMetadata;
 import org.elasticsearch.cluster.routing.ShardRouting;
 import org.elasticsearch.common.settings.Settings;
-import org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder;
-import org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation;
+import org.elasticsearch.xpack.ml.aggs.categorization.CategorizeTextAggregationBuilder;
+import org.elasticsearch.xpack.ml.aggs.categorization.InternalCategorizationAggregation;
 import org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase;
 
 import java.util.Arrays;

diff --git a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java
@@ -1417,16 +1417,7 @@ public List<AggregationSpec> getAggregations() {
                 CategorizeTextAggregationBuilder::new,
                 CategorizeTextAggregationBuilder.PARSER
             ).addResultReader(InternalCategorizationAggregation::new)
-                .setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME)),
-            // TODO: in the long term only keep one or other of these categorization aggregations
-            new AggregationSpec(
-                org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME,
-                org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder::new,
-                org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.PARSER
-            ).addResultReader(org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation::new)
-                .setAggregatorRegistrar(
-                    s -> s.registerUsage(org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME)
-                )
+                .setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME))
         );
     }