[ML] Replace the implementation of the categorize_text aggregation #85872

droberts195 · 2022-04-13T15:05:21Z

This replaces the implementation of the categorize_text aggregation
with the new algorithm that was added in #80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs.

The docs are updated to reflect the workings of the new implementation.

This replaces the implementation of the `categorize_text` aggregation with the new algorithm that was added in elastic#80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs. The docs are updated to reflect the workings of the new implementation.

elasticsearchmachine · 2022-04-13T15:05:45Z

Hi @droberts195, I've created a changelog YAML for you.

…zation_agg

Although similar the results have changed a little. This is acceptable as the functionality was experimental. The REST layer _is_ actually compatible, it's the functionality that's changed slightly.

elasticmachine · 2022-04-20T10:44:37Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2022-04-20T16:09:41Z

...in/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizeTextAggregationBuilder.java

@@ -362,6 +315,6 @@ public String getType() {

    @Override
    public Version getMinimalSupportedVersion() {
-        return Version.V_7_16_0;
+        return Version.V_8_3_0;


How does this work when somebody does categorize_text from a 7.16 node and it attempts to serialize to a 8.3 node?

I know this minimal support version prevents us from WRITING to older nodes, but what if an older node is writing to it?

We may need to add a clause at the top of stream input parsing to make sure the input stream is at least 8.3?

This is a good point.

While I was making changes I realised that a newer node can actually use the new algorithm and respond to an older coordinating node in the format that the older node wants. So I have implemented that. The merging of categories on the coordinating node might not work brilliantly if different nodes have created their local categories in different ways, but it should be better than nothing.

With the changes of b3fe740 I think the communications between old and new nodes should at least be safe, even though the results might be a bit strange sometimes.

benwtrent · 2022-04-20T16:14:35Z

...n/java/org/elasticsearch/xpack/ml/aggs/categorization/InternalCategorizationAggregation.java

 import java.util.stream.Collectors;

-import static org.elasticsearch.xpack.ml.aggs.categorization.CategorizationBytesRefHash.WILD_CARD_REF;
-
 public class InternalCategorizationAggregation extends InternalMultiBucketAggregation<


My main concern is that, for some weird reason, older nodes did the aggregating and the coordinator is trying to read in old InternalCategorizationAggregation objects.

I would hope the getMinimalSupportedVersion in the builder would prevent this as the builder implies the factory is created on that node. Which means that only nodes after 8.3.0 would receive the new builder and reply back.

Good point. I've made the stream input and output safe for older nodes.

In the case of a coordinating node on an older version it's possible for the newer nodes to use the new algorithm and then serialize their results in the old format. So that scenario can work to some extent - probably not brilliantly as the merging might not work well for categories created in completely different ways, but better than nothing.

Like you say, the other way around should not be possible because of the minimal supported version, but I've made it safe if there's some unforeseen loophole.

1. If the coordinating node is 8.3 or higher then it won't allow the aggregation to be used if there are pre-8.3 nodes in the cluster. 2. If the coordinating node is pre-8.3 then the 8.3 or higher nodes will use the new algorithm but serialize the categories local to the node in the old format. The merging might not be great, but this should mean some categories can still be returned to the user.

benwtrent · 2022-04-21T11:33:43Z

...in/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizeTextAggregationBuilder.java

+        // If the coordinating node is an older version then we might still receive messages from older
+        // nodes. In this case we can send back results for this node created using the new algorithm.
+        // They won't necessarily merge well with results from other nodes, but are better than nothing.


Since its experimental, I think we should throw. This is not a critical functionality, nothing systemic is using this aggregation. We should throw, especially since running in a mixed cluster environment is a temporary situation.

The new internal agg implementation serializing back to the old coordinator will throw anyways (due to things not being read of the wire well). It would be cheaper for the users (and more obvious the cause of the issue) if we just throw.

The new internal agg implementation serializing back to the old coordinator will throw anyways (due to things not being read of the wire well).

It won't throw on the old coordinator with the changes of b3fe740 - I changed the new nodes to serialize in the format the old nodes will expect.

But still, you're probably right that a simple "not supported in mixed cluster" exception is better as it's less likely to cause people to waste time debugging strange results. I'll change the 4 methods to throw exceptions instead.

…zation_agg

benwtrent · 2022-04-21T12:14:51Z

...in/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizeTextAggregationBuilder.java

+        // Disallow this aggregation in mixed version clusters that cross the algorithm change boundary.
+        if (out.getVersion().before(ALGORITHM_CHANGED_VERSION)) {


we MAY get this for free with the versioned named writeable stuff. But, keeping this here is cool with me.

…zation_agg

droberts195 · 2022-05-05T08:13:04Z

@elasticmachine update branch

droberts195 · 2022-05-17T16:16:25Z

@elasticmachine update branch

droberts195 · 2022-05-19T09:36:58Z

@elasticmachine update branch

droberts195 · 2022-05-20T13:21:44Z

@elasticmachine update branch

More changes to sync with elastic/ml-cpp#2277

droberts195 · 2022-05-23T15:52:04Z

d5b0113 and 7589387 implement the changes of elastic/ml-cpp#2277 on the Java side.

droberts195 added >enhancement :ml Machine learning v8.3.0 labels Apr 13, 2022

droberts195 added 4 commits April 13, 2022 16:05

Update docs/changelog/85872.yaml

96b6350

Some fixes

fe710a7

Merge remote-tracking branch 'origin/master' into rename_new_categori…

baa1a3f

…zation_agg

Skip REST compatibility tests for categorize_text

73d49a1

Although similar the results have changed a little. This is acceptable as the functionality was experimental. The REST layer _is_ actually compatible, it's the functionality that's changed slightly.

droberts195 marked this pull request as ready for review April 20, 2022 10:44

elasticmachine added the Team:ML Meta label for the ML team label Apr 20, 2022

benwtrent self-requested a review April 20, 2022 11:27

benwtrent reviewed Apr 20, 2022

View reviewed changes

benwtrent reviewed Apr 21, 2022

View reviewed changes

droberts195 added 2 commits April 21, 2022 13:09

Throw exception in mixed version cluster instead

d6706a4

Merge remote-tracking branch 'origin/master' into rename_new_categori…

7d36f67

…zation_agg

benwtrent approved these changes Apr 21, 2022

View reviewed changes

droberts195 added 2 commits April 21, 2022 16:05

Adjust docs

2f53ce5

Merge remote-tracking branch 'origin/master' into rename_new_categori…

2aa1746

…zation_agg

Merge branch 'master' into rename_new_categorization_agg

40f6d4d

Merge branch 'master' into rename_new_categorization_agg

899305e

Merge branch 'master' into rename_new_categorization_agg

754198f

elasticmachine and others added 2 commits May 20, 2022 23:21

Merge branch 'master' into rename_new_categorization_agg

0b77573

Fixes and improvements

acc59f4

droberts195 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 22, 2022

droberts195 added 3 commits May 23, 2022 08:12

Don't discard tokens where IDs match but weights differ

d5b0113

Merge branch 'master' into rename_new_categorization_agg

fecd32c

Further weighting changes

7589387

More changes to sync with elastic/ml-cpp#2277

droberts195 mentioned this pull request May 23, 2022

[ML] Adjacency weighting fixes in categorization elastic/ml-cpp#2277

Merged

droberts195 merged commit 93bc2e3 into elastic:master May 23, 2022

droberts195 deleted the rename_new_categorization_agg branch May 23, 2022 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Replace the implementation of the categorize_text aggregation #85872

[ML] Replace the implementation of the categorize_text aggregation #85872

droberts195 commented Apr 13, 2022

elasticsearchmachine commented Apr 13, 2022

elasticmachine commented Apr 20, 2022

benwtrent Apr 20, 2022

droberts195 Apr 21, 2022

benwtrent Apr 20, 2022

droberts195 Apr 21, 2022

benwtrent Apr 21, 2022

benwtrent Apr 21, 2022

droberts195 Apr 21, 2022

benwtrent Apr 21, 2022

droberts195 commented May 5, 2022

droberts195 commented May 17, 2022

droberts195 commented May 19, 2022

droberts195 commented May 20, 2022

droberts195 commented May 23, 2022

		// Disallow this aggregation in mixed version clusters that cross the algorithm change boundary.
		if (out.getVersion().before(ALGORITHM_CHANGED_VERSION)) {

[ML] Replace the implementation of the categorize_text aggregation #85872

[ML] Replace the implementation of the categorize_text aggregation #85872

Conversation

droberts195 commented Apr 13, 2022

elasticsearchmachine commented Apr 13, 2022

elasticmachine commented Apr 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented May 5, 2022

droberts195 commented May 17, 2022

droberts195 commented May 19, 2022

droberts195 commented May 20, 2022

droberts195 commented May 23, 2022