-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] A text categorization aggregation that works like ML categorization #80867
Conversation
At this time this is still very much a work-in-progress:
|
This PR adds a text categorization aggregation that uses the same approaches as the categorization feature of ML anomaly detection jobs.
f7491d4
to
c79c1b7
Compare
(In C++ it also worked on strings)
By far the slowest thing was the RAM usage estimation. This is now cached in the two classes where performance matters most.
Pinging @elastic/ml-core (Team:ML) |
Some more notes on this after another round of improvements:
One more note to reviewers: this PR is not really 80000 lines. Over 95% of these lines are in the categorization dictionary, which is an exact copy of the one we've been shipping for the C++ categorization code for many years. |
Hi @droberts195, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give it a second pass if/when we replace the current categorize_text with it.
public static CategorizationPartOfSpeechDictionary getInstance() throws IOException { | ||
if (instance != null) { | ||
return instance; | ||
} | ||
synchronized (INIT_LOCK) { | ||
if (instance == null) { | ||
try (InputStream is = CategorizationPartOfSpeechDictionary.class.getResourceAsStream(DICTIONARY_FILE_PATH)) { | ||
instance = new CategorizationPartOfSpeechDictionary(is); | ||
} | ||
} | ||
return instance; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
I am not sure we need to add bytes to the circuit breaker for this or not. I would say if it is near a MB we may want to.
Basically, getInstance
could take the circuit breaker and add bytes if it is loaded, ignoring it if not (since it would have already added bytes). And those bytes just stay for the lifetime of the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best not to add it to the same circuit breaker used by the rest of the aggregation.
Although it's large it's effectively static data, so it would make most sense to include it with the "Accounting requests circuit breaker" rather than the "Request circuit breaker". But if indices.breaker.total.use_real_memory
is set to true
, which it is by default, then that "memory usage of things held in memory that are not released when a request is completed" will take it into account automatically.
I guess we could try to explicitly add it into the "Accounting requests circuit breaker" for the case where real memory circuit breaking is disabled. But this will be messy within the code as the code is written on the basis that what the docs refer to as "memory usage of things held in memory that are not released when a request is completed" is actually field data related to Lucene indices.
The docs also say about the total memory all circuit breakers can use: "Defaults to 70% of JVM heap if indices.breaker.total.use_real_memory
is false. If indices.breaker.total.use_real_memory
is true, defaults to 95% of the JVM heap." So that implies that if you don't use the real memory circuit breaker to measure fixed overheads then you have to allow some space for unmeasured fixed overheads. So I think this dictionary can be treated as one of those fixed overheads that either gets captured by the real memory circuit breaker or by implicitly reserving a percentage of memory.
...n/java/org/elasticsearch/xpack/ml/aggs/categorization2/CategorizeTextAggregationBuilder.java
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/ml/aggs/categorization2/CategorizeTextAggregationBuilder.java
Show resolved
Hide resolved
.../java/org/elasticsearch/xpack/ml/aggs/categorization2/InternalCategorizationAggregation.java
Show resolved
Hide resolved
* Matches the value used in <a href="https://github.com/elastic/ml-cpp/blob/main/lib/model/CTokenListReverseSearchCreator.cc"> | ||
* <code>CTokenListReverseSearchCreator</code></a> in the C++ code. | ||
*/ | ||
public static final int KEY_BUDGET = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be configurable in the future (with probably this as the sensible default).
while (commonIndex < commonUniqueTokenIds.size()) { | ||
TokenAndWeight commonTokenAndWeight = commonUniqueTokenIds.get(commonIndex); | ||
if (newIndex >= newUniqueTokenIds.size() || commonTokenAndWeight.getTokenId() < newUniqueTokenIds.get(newIndex).getTokenId()) { | ||
commonUniqueTokenWeight -= commonTokenAndWeight.getWeight(); | ||
commonUniqueTokenIds.remove(commonIndex); | ||
changed = true; | ||
} else { | ||
TokenAndWeight newTokenAndWeight = newUniqueTokenIds.get(newIndex); | ||
if (commonTokenAndWeight.getTokenId() == newTokenAndWeight.getTokenId()) { | ||
if (commonTokenAndWeight.getWeight() == newTokenAndWeight.getWeight()) { | ||
++commonIndex; | ||
} else { | ||
commonUniqueTokenWeight -= commonTokenAndWeight.getWeight(); | ||
commonUniqueTokenIds.remove(commonIndex); | ||
changed = true; | ||
} | ||
} | ||
++newIndex; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a good place for a future optimization: https://stackoverflow.com/a/6103075/1818849
I am not sure how common remove
would be, but iterating into a new array list may be much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we could call set
with a NULL or static value, then if changed, iterate through creating a new array list.
This way we amortize the runtime to be O(N)
instead of something worse due to shifting the indices multiple times when there are multiple tokens being removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that StackOverflow answer it looks like the "shift down" method is comparably fast. It has the benefit that the member variable can still be final
, which is one less thing to worry about when writing other methods. I'll change it to use this method.
However, I doubt it will make much difference to the overall timings because what tends to happen is that very quickly the unique tokens get whittled down to the ones that will eventually define the category and then we don't make any further changes. So the first few merges result in removals but after that there aren't any more.
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java
Outdated
Show resolved
Hide resolved
public List<TokenAndWeight> getKeyTokenIds() { | ||
return baseWeightedTokenIds.stream().filter(this::isTokenIdCommon).collect(Collectors.toList()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wasteful. I wonder if we could do a better filter predicate or something that adjusts given a provided stateful predicate (keeping track of the budget).
I suppose USUALLY, this is not a big issue (as our budget is never exceeded), but in the rare case that it is, we allocate a fairly large list for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only ever used with SerializableTokenListCategory
, it might be good to make it smarter so it does the limitation due to budget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I combined it all in SerializableTokenListCategory
and moved the comment with the history there too.
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java
Outdated
Show resolved
Hide resolved
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/TokenListCategory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good as is. Would be nice to see the churn when replacing the current categorizer and do a final pass.
How about:
|
This replaces the implementation of the `categorize_text` aggregation with the new algorithm that was added in elastic#80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs. The docs are updated to reflect the workings of the new implementation.
…n/elasticsearch into datastream-reuse-pipeline-source * 'datastream-reuse-pipeline-source' of github.com:weizijun/elasticsearch: (28 commits) Add JDK 19 to Java testing matrix [ML] add nlp config update serialization tests (elastic#85867) [ML] A text categorization aggregation that works like ML categorization (elastic#80867) [ML] Fix serialisation of text embedding updates (elastic#85863) TSDB: fix wrong initial value of tsidOrd in TimeSeriesIndexSearcher (elastic#85713) Enforce external id uniqueness during DesiredNode construction (elastic#84227) Fix Intellij integration (elastic#85866) Upgrade Azure SDK to version 12.14.4 (elastic#83884) [discovery-gce] Fix initialisation of transport in FIPS mode (elastic#85817) Remove unnecessary docs/changelog/85534.yaml Prevent ThreadContext header leak when sending response (elastic#68649) Add support for impact_areas to health impacts (elastic#85830) Reduce port range re-use in tests (elastic#85777) Fix TranslogTests#testStats (elastic#85828) Remove hppc from cat allocation api (elastic#85842) Fix BuildTests serialization (elastic#85827) Use urgent priority for node shutdown cluster state update (elastic#85838) Remove Task classes from HLRC (elastic#85835) Remove unused migration classes (elastic#85834) Remove uses of Charset name parsing (elastic#85795) ...
…85872) This replaces the implementation of the categorize_text aggregation with the new algorithm that was added in #80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs (and now includes the fixes of elastic/ml-cpp#2277). The docs are updated to reflect the workings of the new implementation.
This PR adds a text categorization aggregation that uses the same
approaches as the categorization feature of ML anomaly detection
jobs.