[ML] A text categorization aggregation that works like ML categorization #80867

droberts195 · 2021-11-19T10:36:46Z

This PR adds a text categorization aggregation that uses the same
approaches as the categorization feature of ML anomaly detection
jobs.

droberts195 · 2021-11-19T10:43:21Z

At this time this is still very much a work-in-progress:

It's implemented as a new aggregation, categorize_text2, that exists in parallel to categorize_text. In the long run we clearly don't want both. We need to compare them and choose one to keep. If that's categorize_text2 then it will be renamed to categorize_text before release.
Memory accounting/circuit breaking is not properly implemented at present.
The code is clearly far more complex than the Drain algorithm used by the existing categorize_text algorithm, and transfers far more data between nodes per category in the reduce phase. Whether this is unacceptable in terms of resource usage needs to be determined.
There are some unit tests but they're very basic. Currently there are almost certainly bugs in the implementation. Testing on bigger data is required to smoke these out.

This PR adds a text categorization aggregation that uses the same approaches as the categorization feature of ML anomaly detection jobs.

(In C++ it also worked on strings)

By far the slowest thing was the RAM usage estimation. This is now cached in the two classes where performance matters most.

elasticmachine · 2022-04-07T15:28:27Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2022-04-07T15:37:53Z

Some more notes on this after another round of improvements:

Performance is now very close to that of the existing Drain-based categorize_text aggregation
There are now more tests, including an internal cluster test that proves that merging of results generated on different nodes works correctly
There are currently no docs
A key decision is whether we replace the existing experimental categorize_text aggregation with this, have an intermediate release where we ship both, or ship both indefinitely
What to do about docs and API specs obviously depends heavily on that decision
One option would be to merge this PR as-is, so that we have both aggregations available in parallel internally for a few weeks, then make a decision closer to 8.3.0 feature freeze and adjust docs and specs accordingly

One more note to reviewers: this PR is not really 80000 lines. Over 95% of these lines are in the categorization dictionary, which is an exact copy of the one we've been shipping for the C++ categorization code for many years.

elasticsearchmachine · 2022-04-11T08:25:49Z

Hi @droberts195, I've created a changelog YAML for you.

benwtrent

I will give it a second pass if/when we replace the current categorize_text with it.

benwtrent · 2022-04-11T12:07:12Z

...va/org/elasticsearch/xpack/ml/aggs/categorization2/CategorizationPartOfSpeechDictionary.java

+    public static CategorizationPartOfSpeechDictionary getInstance() throws IOException {
+        if (instance != null) {
+            return instance;
+        }
+        synchronized (INIT_LOCK) {
+            if (instance == null) {
+                try (InputStream is = CategorizationPartOfSpeechDictionary.class.getResourceAsStream(DICTIONARY_FILE_PATH)) {
+                    instance = new CategorizationPartOfSpeechDictionary(is);
+                }
+            }
+            return instance;
+        }
+    }
+}


++

I am not sure we need to add bytes to the circuit breaker for this or not. I would say if it is near a MB we may want to.

Basically, getInstance could take the circuit breaker and add bytes if it is loaded, ignoring it if not (since it would have already added bytes). And those bytes just stay for the lifetime of the node.

I think it's best not to add it to the same circuit breaker used by the rest of the aggregation.

Although it's large it's effectively static data, so it would make most sense to include it with the "Accounting requests circuit breaker" rather than the "Request circuit breaker". But if indices.breaker.total.use_real_memory is set to true, which it is by default, then that "memory usage of things held in memory that are not released when a request is completed" will take it into account automatically.

I guess we could try to explicitly add it into the "Accounting requests circuit breaker" for the case where real memory circuit breaking is disabled. But this will be messy within the code as the code is written on the basis that what the docs refer to as "memory usage of things held in memory that are not released when a request is completed" is actually field data related to Lucene indices.

The docs also say about the total memory all circuit breakers can use: "Defaults to 70% of JVM heap if indices.breaker.total.use_real_memory is false. If indices.breaker.total.use_real_memory is true, defaults to 95% of the JVM heap." So that implies that if you don't use the real memory circuit breaker to measure fixed overheads then you have to allow some space for unmeasured fixed overheads. So I think this dictionary can be treated as one of those fixed overheads that either gets captured by the real memory circuit breaker or by implicitly reserving a percentage of memory.

...n/java/org/elasticsearch/xpack/ml/aggs/categorization2/CategorizeTextAggregationBuilder.java

.../java/org/elasticsearch/xpack/ml/aggs/categorization2/InternalCategorizationAggregation.java

benwtrent · 2022-04-11T12:28:20Z

...main/java/org/elasticsearch/xpack/ml/aggs/categorization2/SerializableTokenListCategory.java

+     * Matches the value used in <a href="https://github.com/elastic/ml-cpp/blob/main/lib/model/CTokenListReverseSearchCreator.cc">
+     * <code>CTokenListReverseSearchCreator</code></a> in the C++ code.
+     */
+    public static final int KEY_BUDGET = 10000;


This could be configurable in the future (with probably this as the sensible default).

benwtrent · 2022-04-11T13:05:45Z

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java

+        while (commonIndex < commonUniqueTokenIds.size()) {
+            TokenAndWeight commonTokenAndWeight = commonUniqueTokenIds.get(commonIndex);
+            if (newIndex >= newUniqueTokenIds.size() || commonTokenAndWeight.getTokenId() < newUniqueTokenIds.get(newIndex).getTokenId()) {
+                commonUniqueTokenWeight -= commonTokenAndWeight.getWeight();
+                commonUniqueTokenIds.remove(commonIndex);
+                changed = true;
+            } else {
+                TokenAndWeight newTokenAndWeight = newUniqueTokenIds.get(newIndex);
+                if (commonTokenAndWeight.getTokenId() == newTokenAndWeight.getTokenId()) {
+                    if (commonTokenAndWeight.getWeight() == newTokenAndWeight.getWeight()) {
+                        ++commonIndex;
+                    } else {
+                        commonUniqueTokenWeight -= commonTokenAndWeight.getWeight();
+                        commonUniqueTokenIds.remove(commonIndex);
+                        changed = true;
+                    }
+                }
+                ++newIndex;
+            }
+        }


This may be a good place for a future optimization: https://stackoverflow.com/a/6103075/1818849

I am not sure how common remove would be, but iterating into a new array list may be much faster.

Actually, we could call set with a NULL or static value, then if changed, iterate through creating a new array list.

This way we amortize the runtime to be O(N) instead of something worse due to shifting the indices multiple times when there are multiple tokens being removed.

In that StackOverflow answer it looks like the "shift down" method is comparably fast. It has the benefit that the member variable can still be final, which is one less thing to worry about when writing other methods. I'll change it to use this method.

However, I doubt it will make much difference to the overall timings because what tends to happen is that very quickly the unique tokens get whittled down to the ones that will eventually define the category and then we don't make any further changes. So the first few merges result in removals but after that there aren't any more.

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java

benwtrent · 2022-04-11T13:15:16Z

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java

+    public List<TokenAndWeight> getKeyTokenIds() {
+        return baseWeightedTokenIds.stream().filter(this::isTokenIdCommon).collect(Collectors.toList());
+    }


This seems wasteful. I wonder if we could do a better filter predicate or something that adjusts given a provided stateful predicate (keeping track of the budget).

I suppose USUALLY, this is not a big issue (as our budget is never exceeded), but in the rare case that it is, we allocate a fairly large list for no reason.

Since this is only ever used with SerializableTokenListCategory, it might be good to make it smarter so it does the limitation due to budget.

Yes, good point. I combined it all in SerializableTokenListCategory and moved the comment with the history there too.

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java

(More changes still required)

benwtrent

I think it looks good as is. Would be nice to see the churn when replacing the current categorizer and do a final pass.

droberts195 · 2022-04-13T13:17:20Z

Would be nice to see the churn when replacing the current categorizer and do a final pass.

How about:

Merge this now.
Open a followup PR that renames it and adjusts the docs but don't merge that one yet. That one will show the code churn in the copied and pasted classes better.
While there are two parallel options in the master branch, do some more comparisons and document the differences.
Shortly before 8.3 feature freeze either merge the rename/docs adjustment PR or revert this one, so that there is only one categorization aggregation in the public 8.3 release.

This replaces the implementation of the `categorize_text` aggregation with the new algorithm that was added in elastic#80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs. The docs are updated to reflect the workings of the new implementation.

…n/elasticsearch into datastream-reuse-pipeline-source * 'datastream-reuse-pipeline-source' of github.com:weizijun/elasticsearch: (28 commits) Add JDK 19 to Java testing matrix [ML] add nlp config update serialization tests (elastic#85867) [ML] A text categorization aggregation that works like ML categorization (elastic#80867) [ML] Fix serialisation of text embedding updates (elastic#85863) TSDB: fix wrong initial value of tsidOrd in TimeSeriesIndexSearcher (elastic#85713) Enforce external id uniqueness during DesiredNode construction (elastic#84227) Fix Intellij integration (elastic#85866) Upgrade Azure SDK to version 12.14.4 (elastic#83884) [discovery-gce] Fix initialisation of transport in FIPS mode (elastic#85817) Remove unnecessary docs/changelog/85534.yaml Prevent ThreadContext header leak when sending response (elastic#68649) Add support for impact_areas to health impacts (elastic#85830) Reduce port range re-use in tests (elastic#85777) Fix TranslogTests#testStats (elastic#85828) Remove hppc from cat allocation api (elastic#85842) Fix BuildTests serialization (elastic#85827) Use urgent priority for node shutdown cluster state update (elastic#85838) Remove Task classes from HLRC (elastic#85835) Remove unused migration classes (elastic#85834) Remove uses of Charset name parsing (elastic#85795) ...

…85872) This replaces the implementation of the categorize_text aggregation with the new algorithm that was added in #80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs (and now includes the fixes of elastic/ml-cpp#2277). The docs are updated to reflect the workings of the new implementation.

droberts195 added >enhancement :ml Machine learning labels Nov 19, 2021

elasticsearchmachine added the v8.1.0 label Nov 19, 2021

droberts195 added the WIP label Nov 19, 2021

droberts195 added 3 commits November 19, 2021 16:18

[ML] A text categorization aggregation that works like ML categorization

c0f1237

This PR adds a text categorization aggregation that uses the same approaches as the categorization feature of ML anomaly detection jobs.

Adding dictionary weighting

e1e58d4

Merge branch 'master' into categorize_text2

c79c1b7

droberts195 force-pushed the categorize_text2 branch from f7491d4 to c79c1b7 Compare November 22, 2021 09:54

droberts195 added 8 commits November 23, 2021 16:32

Improve memory accounting

874edd4

Merge branch 'master' into categorize_text2

f79a52b

Renaming class since it only works on token lists in Java

5e75308

(In C++ it also worked on strings)

Merge branch 'master' into categorize_text2

6ed5dcb

Merge branch 'master' into categorize_text2

885103a

Merge branch 'master' into categorize_text2

c10dd22

Merge branch 'master' into categorize_text2

081e04d

Fix compilation

d5dae2b

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

droberts195 added 8 commits April 4, 2022 09:10

Merge branch 'master' into categorize_text2

65c9d38

Bring up-to-date and optimize performance

66f325c

By far the slowest thing was the RAM usage estimation. This is now cached in the two classes where performance matters most.

Remove redundant parameters

fce2f1d

Don't repeatedly reinitialize PoS dictionary

8e3967b

Bug fixes and tests for memory tracking

d985601

Avoid shallowSizeOf() as security manager can make it stall for seconds

930f9b8

Add a multi-node test to prove that post-serialization merges work

aca43fa

Part-of-speech dictionary lookup speed improvements

67d8359

droberts195 marked this pull request as ready for review April 7, 2022 15:28

elasticmachine added the Team:ML Meta label for the ML team label Apr 7, 2022

droberts195 removed the WIP label Apr 11, 2022

Update docs/changelog/80867.yaml

d7887b0

benwtrent self-requested a review April 11, 2022 12:02

benwtrent reviewed Apr 11, 2022

View reviewed changes

droberts195 added 4 commits April 11, 2022 14:47

Merge branch 'master' into categorize_text2

a9e9837

Address a few review comments

5590a47

(More changes still required)

Merge branch 'master' into categorize_text2

c267019

Address another review comment

f9b18b8

droberts195 requested a review from benwtrent April 12, 2022 16:13

benwtrent reviewed Apr 13, 2022

View reviewed changes

benwtrent approved these changes Apr 13, 2022

View reviewed changes

droberts195 merged commit fede927 into elastic:master Apr 13, 2022

droberts195 deleted the categorize_text2 branch April 13, 2022 13:27

droberts195 mentioned this pull request Apr 13, 2022

[ML] Replace the implementation of the categorize_text aggregation #85872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] A text categorization aggregation that works like ML categorization #80867

[ML] A text categorization aggregation that works like ML categorization #80867

droberts195 commented Nov 19, 2021

droberts195 commented Nov 19, 2021 •

edited

Loading

elasticmachine commented Apr 7, 2022

droberts195 commented Apr 7, 2022

elasticsearchmachine commented Apr 11, 2022

benwtrent left a comment

benwtrent Apr 11, 2022

droberts195 Apr 12, 2022

benwtrent Apr 11, 2022

benwtrent Apr 11, 2022

benwtrent Apr 11, 2022

droberts195 Apr 11, 2022

benwtrent Apr 11, 2022

benwtrent Apr 11, 2022

droberts195 Apr 12, 2022

benwtrent left a comment

droberts195 commented Apr 13, 2022

[ML] A text categorization aggregation that works like ML categorization #80867

[ML] A text categorization aggregation that works like ML categorization #80867

Conversation

droberts195 commented Nov 19, 2021

droberts195 commented Nov 19, 2021 • edited Loading

elasticmachine commented Apr 7, 2022

droberts195 commented Apr 7, 2022

elasticsearchmachine commented Apr 11, 2022

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

droberts195 commented Apr 13, 2022

droberts195 commented Nov 19, 2021 •

edited

Loading