Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Replace the implementation of the categorize_text aggregation #85872

Merged
merged 18 commits into from
May 23, 2022

Conversation

droberts195
Copy link
Contributor

This replaces the implementation of the categorize_text aggregation
with the new algorithm that was added in #80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs.

The docs are updated to reflect the workings of the new implementation.

This replaces the implementation of the `categorize_text` aggregation
with the new algorithm that was added in elastic#80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs.

The docs are updated to reflect the workings of the new implementation.
@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've created a changelog YAML for you.

Although similar the results have changed a little. This is
acceptable as the functionality was experimental.

The REST layer _is_ actually compatible, it's the functionality
that's changed slightly.
@droberts195 droberts195 marked this pull request as ready for review April 20, 2022 10:44
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 20, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent self-requested a review April 20, 2022 11:27
@@ -362,6 +315,6 @@ public String getType() {

@Override
public Version getMinimalSupportedVersion() {
return Version.V_7_16_0;
return Version.V_8_3_0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work when somebody does categorize_text from a 7.16 node and it attempts to serialize to a 8.3 node?

I know this minimal support version prevents us from WRITING to older nodes, but what if an older node is writing to it?

We may need to add a clause at the top of stream input parsing to make sure the input stream is at least 8.3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

While I was making changes I realised that a newer node can actually use the new algorithm and respond to an older coordinating node in the format that the older node wants. So I have implemented that. The merging of categories on the coordinating node might not work brilliantly if different nodes have created their local categories in different ways, but it should be better than nothing.

With the changes of b3fe740 I think the communications between old and new nodes should at least be safe, even though the results might be a bit strange sometimes.

import java.util.stream.Collectors;

import static org.elasticsearch.xpack.ml.aggs.categorization.CategorizationBytesRefHash.WILD_CARD_REF;

public class InternalCategorizationAggregation extends InternalMultiBucketAggregation<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that, for some weird reason, older nodes did the aggregating and the coordinator is trying to read in old InternalCategorizationAggregation objects.

I would hope the getMinimalSupportedVersion in the builder would prevent this as the builder implies the factory is created on that node. Which means that only nodes after 8.3.0 would receive the new builder and reply back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've made the stream input and output safe for older nodes.

In the case of a coordinating node on an older version it's possible for the newer nodes to use the new algorithm and then serialize their results in the old format. So that scenario can work to some extent - probably not brilliantly as the merging might not work well for categories created in completely different ways, but better than nothing.

Like you say, the other way around should not be possible because of the minimal supported version, but I've made it safe if there's some unforeseen loophole.

1. If the coordinating node is 8.3 or higher then it won't allow
   the aggregation to be used if there are pre-8.3 nodes in the
   cluster.
2. If the coordinating node is pre-8.3 then the 8.3 or higher
   nodes will use the new algorithm but serialize the categories
   local to the node in the old format. The merging might not
   be great, but this should mean some categories can still be
   returned to the user.
Comment on lines 121 to 123
// If the coordinating node is an older version then we might still receive messages from older
// nodes. In this case we can send back results for this node created using the new algorithm.
// They won't necessarily merge well with results from other nodes, but are better than nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since its experimental, I think we should throw. This is not a critical functionality, nothing systemic is using this aggregation. We should throw, especially since running in a mixed cluster environment is a temporary situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new internal agg implementation serializing back to the old coordinator will throw anyways (due to things not being read of the wire well). It would be cheaper for the users (and more obvious the cause of the issue) if we just throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new internal agg implementation serializing back to the old coordinator will throw anyways (due to things not being read of the wire well).

It won't throw on the old coordinator with the changes of b3fe740 - I changed the new nodes to serialize in the format the old nodes will expect.

But still, you're probably right that a simple "not supported in mixed cluster" exception is better as it's less likely to cause people to waste time debugging strange results. I'll change the 4 methods to throw exceptions instead.

Comment on lines +283 to +284
// Disallow this aggregation in mixed version clusters that cross the algorithm change boundary.
if (out.getVersion().before(ALGORITHM_CHANGED_VERSION)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we MAY get this for free with the versioned named writeable stuff. But, keeping this here is cool with me.

@droberts195
Copy link
Contributor Author

@elasticmachine update branch

@droberts195
Copy link
Contributor Author

@elasticmachine update branch

@droberts195
Copy link
Contributor Author

@elasticmachine update branch

@droberts195
Copy link
Contributor Author

@elasticmachine update branch

@droberts195 droberts195 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 22, 2022
@droberts195
Copy link
Contributor Author

d5b0113 and 7589387 implement the changes of elastic/ml-cpp#2277 on the Java side.

@droberts195 droberts195 merged commit 93bc2e3 into elastic:master May 23, 2022
@droberts195 droberts195 deleted the rename_new_categorization_agg branch May 23, 2022 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants