-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Replace the implementation of the categorize_text aggregation #85872
[ML] Replace the implementation of the categorize_text aggregation #85872
Conversation
This replaces the implementation of the `categorize_text` aggregation with the new algorithm that was added in elastic#80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs. The docs are updated to reflect the workings of the new implementation.
Hi @droberts195, I've created a changelog YAML for you. |
Although similar the results have changed a little. This is acceptable as the functionality was experimental. The REST layer _is_ actually compatible, it's the functionality that's changed slightly.
Pinging @elastic/ml-core (Team:ML) |
@@ -362,6 +315,6 @@ public String getType() { | |||
|
|||
@Override | |||
public Version getMinimalSupportedVersion() { | |||
return Version.V_7_16_0; | |||
return Version.V_8_3_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work when somebody does categorize_text
from a 7.16 node and it attempts to serialize to a 8.3 node?
I know this minimal support version prevents us from WRITING to older nodes, but what if an older node is writing to it?
We may need to add a clause at the top of stream input parsing to make sure the input stream is at least 8.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
While I was making changes I realised that a newer node can actually use the new algorithm and respond to an older coordinating node in the format that the older node wants. So I have implemented that. The merging of categories on the coordinating node might not work brilliantly if different nodes have created their local categories in different ways, but it should be better than nothing.
With the changes of b3fe740 I think the communications between old and new nodes should at least be safe, even though the results might be a bit strange sometimes.
import java.util.stream.Collectors; | ||
|
||
import static org.elasticsearch.xpack.ml.aggs.categorization.CategorizationBytesRefHash.WILD_CARD_REF; | ||
|
||
public class InternalCategorizationAggregation extends InternalMultiBucketAggregation< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is that, for some weird reason, older nodes did the aggregating and the coordinator is trying to read in old InternalCategorizationAggregation objects.
I would hope the getMinimalSupportedVersion
in the builder would prevent this as the builder implies the factory is created on that node. Which means that only nodes after 8.3.0 would receive the new builder and reply back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've made the stream input and output safe for older nodes.
In the case of a coordinating node on an older version it's possible for the newer nodes to use the new algorithm and then serialize their results in the old format. So that scenario can work to some extent - probably not brilliantly as the merging might not work well for categories created in completely different ways, but better than nothing.
Like you say, the other way around should not be possible because of the minimal supported version, but I've made it safe if there's some unforeseen loophole.
1. If the coordinating node is 8.3 or higher then it won't allow the aggregation to be used if there are pre-8.3 nodes in the cluster. 2. If the coordinating node is pre-8.3 then the 8.3 or higher nodes will use the new algorithm but serialize the categories local to the node in the old format. The merging might not be great, but this should mean some categories can still be returned to the user.
// If the coordinating node is an older version then we might still receive messages from older | ||
// nodes. In this case we can send back results for this node created using the new algorithm. | ||
// They won't necessarily merge well with results from other nodes, but are better than nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since its experimental, I think we should throw. This is not a critical functionality, nothing systemic is using this aggregation. We should throw, especially since running in a mixed cluster environment is a temporary situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new internal agg implementation serializing back to the old coordinator will throw anyways (due to things not being read of the wire well). It would be cheaper for the users (and more obvious the cause of the issue) if we just throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new internal agg implementation serializing back to the old coordinator will throw anyways (due to things not being read of the wire well).
It won't throw on the old coordinator with the changes of b3fe740 - I changed the new nodes to serialize in the format the old nodes will expect.
But still, you're probably right that a simple "not supported in mixed cluster" exception is better as it's less likely to cause people to waste time debugging strange results. I'll change the 4 methods to throw exceptions instead.
// Disallow this aggregation in mixed version clusters that cross the algorithm change boundary. | ||
if (out.getVersion().before(ALGORITHM_CHANGED_VERSION)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we MAY get this for free with the versioned named writeable stuff. But, keeping this here is cool with me.
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
d5b0113 and 7589387 implement the changes of elastic/ml-cpp#2277 on the Java side. |
This replaces the implementation of the
categorize_text
aggregationwith the new algorithm that was added in #80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs.
The docs are updated to reflect the workings of the new implementation.