Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] A text categorization aggregation that works like ML categorization #80867

Merged
merged 24 commits into from
Apr 13, 2022

Conversation

droberts195
Copy link
Contributor

This PR adds a text categorization aggregation that uses the same
approaches as the categorization feature of ML anomaly detection
jobs.

@droberts195
Copy link
Contributor Author

droberts195 commented Nov 19, 2021

At this time this is still very much a work-in-progress:

  • It's implemented as a new aggregation, categorize_text2, that exists in parallel to categorize_text. In the long run we clearly don't want both. We need to compare them and choose one to keep. If that's categorize_text2 then it will be renamed to categorize_text before release.
  • Memory accounting/circuit breaking is not properly implemented at present.
  • The code is clearly far more complex than the Drain algorithm used by the existing categorize_text algorithm, and transfers far more data between nodes per category in the reduce phase. Whether this is unacceptable in terms of resource usage needs to be determined.
  • There are some unit tests but they're very basic. Currently there are almost certainly bugs in the implementation. Testing on bigger data is required to smoke these out.

This PR adds a text categorization aggregation that uses the same
approaches as the categorization feature of ML anomaly detection
jobs.
@droberts195 droberts195 marked this pull request as ready for review April 7, 2022 15:28
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

Some more notes on this after another round of improvements:

  • Performance is now very close to that of the existing Drain-based categorize_text aggregation
  • There are now more tests, including an internal cluster test that proves that merging of results generated on different nodes works correctly
  • There are currently no docs
  • A key decision is whether we replace the existing experimental categorize_text aggregation with this, have an intermediate release where we ship both, or ship both indefinitely
  • What to do about docs and API specs obviously depends heavily on that decision
  • One option would be to merge this PR as-is, so that we have both aggregations available in parallel internally for a few weeks, then make a decision closer to 8.3.0 feature freeze and adjust docs and specs accordingly

One more note to reviewers: this PR is not really 80000 lines. Over 95% of these lines are in the categorization dictionary, which is an exact copy of the one we've been shipping for the C++ categorization code for many years.

@droberts195 droberts195 removed the WIP label Apr 11, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've created a changelog YAML for you.

@benwtrent benwtrent self-requested a review April 11, 2022 12:02
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a second pass if/when we replace the current categorize_text with it.

Comment on lines +145 to +158
public static CategorizationPartOfSpeechDictionary getInstance() throws IOException {
if (instance != null) {
return instance;
}
synchronized (INIT_LOCK) {
if (instance == null) {
try (InputStream is = CategorizationPartOfSpeechDictionary.class.getResourceAsStream(DICTIONARY_FILE_PATH)) {
instance = new CategorizationPartOfSpeechDictionary(is);
}
}
return instance;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

I am not sure we need to add bytes to the circuit breaker for this or not. I would say if it is near a MB we may want to.

Basically, getInstance could take the circuit breaker and add bytes if it is loaded, ignoring it if not (since it would have already added bytes). And those bytes just stay for the lifetime of the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best not to add it to the same circuit breaker used by the rest of the aggregation.

Although it's large it's effectively static data, so it would make most sense to include it with the "Accounting requests circuit breaker" rather than the "Request circuit breaker". But if indices.breaker.total.use_real_memory is set to true, which it is by default, then that "memory usage of things held in memory that are not released when a request is completed" will take it into account automatically.

I guess we could try to explicitly add it into the "Accounting requests circuit breaker" for the case where real memory circuit breaking is disabled. But this will be messy within the code as the code is written on the basis that what the docs refer to as "memory usage of things held in memory that are not released when a request is completed" is actually field data related to Lucene indices.

The docs also say about the total memory all circuit breakers can use: "Defaults to 70% of JVM heap if indices.breaker.total.use_real_memory is false. If indices.breaker.total.use_real_memory is true, defaults to 95% of the JVM heap." So that implies that if you don't use the real memory circuit breaker to measure fixed overheads then you have to allow some space for unmeasured fixed overheads. So I think this dictionary can be treated as one of those fixed overheads that either gets captured by the real memory circuit breaker or by implicitly reserving a percentage of memory.

* Matches the value used in <a href="https://github.com/elastic/ml-cpp/blob/main/lib/model/CTokenListReverseSearchCreator.cc">
* <code>CTokenListReverseSearchCreator</code></a> in the C++ code.
*/
public static final int KEY_BUDGET = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be configurable in the future (with probably this as the sensible default).

Comment on lines 265 to 284
while (commonIndex < commonUniqueTokenIds.size()) {
TokenAndWeight commonTokenAndWeight = commonUniqueTokenIds.get(commonIndex);
if (newIndex >= newUniqueTokenIds.size() || commonTokenAndWeight.getTokenId() < newUniqueTokenIds.get(newIndex).getTokenId()) {
commonUniqueTokenWeight -= commonTokenAndWeight.getWeight();
commonUniqueTokenIds.remove(commonIndex);
changed = true;
} else {
TokenAndWeight newTokenAndWeight = newUniqueTokenIds.get(newIndex);
if (commonTokenAndWeight.getTokenId() == newTokenAndWeight.getTokenId()) {
if (commonTokenAndWeight.getWeight() == newTokenAndWeight.getWeight()) {
++commonIndex;
} else {
commonUniqueTokenWeight -= commonTokenAndWeight.getWeight();
commonUniqueTokenIds.remove(commonIndex);
changed = true;
}
}
++newIndex;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a good place for a future optimization: https://stackoverflow.com/a/6103075/1818849

I am not sure how common remove would be, but iterating into a new array list may be much faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we could call set with a NULL or static value, then if changed, iterate through creating a new array list.

This way we amortize the runtime to be O(N) instead of something worse due to shifting the indices multiple times when there are multiple tokens being removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that StackOverflow answer it looks like the "shift down" method is comparably fast. It has the benefit that the member variable can still be final, which is one less thing to worry about when writing other methods. I'll change it to use this method.

However, I doubt it will make much difference to the overall timings because what tends to happen is that very quickly the unique tokens get whittled down to the ones that will eventually define the category and then we don't make any further changes. So the first few merges result in removals but after that there aren't any more.

Comment on lines 465 to 467
public List<TokenAndWeight> getKeyTokenIds() {
return baseWeightedTokenIds.stream().filter(this::isTokenIdCommon).collect(Collectors.toList());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wasteful. I wonder if we could do a better filter predicate or something that adjusts given a provided stateful predicate (keeping track of the budget).

I suppose USUALLY, this is not a big issue (as our budget is never exceeded), but in the rare case that it is, we allocate a fairly large list for no reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only ever used with SerializableTokenListCategory, it might be good to make it smarter so it does the limitation due to budget.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I combined it all in SerializableTokenListCategory and moved the comment with the history there too.

@droberts195 droberts195 requested a review from benwtrent April 12, 2022 16:13
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good as is. Would be nice to see the churn when replacing the current categorizer and do a final pass.

@droberts195
Copy link
Contributor Author

Would be nice to see the churn when replacing the current categorizer and do a final pass.

How about:

  1. Merge this now.
  2. Open a followup PR that renames it and adjusts the docs but don't merge that one yet. That one will show the code churn in the copied and pasted classes better.
  3. While there are two parallel options in the master branch, do some more comparisons and document the differences.
  4. Shortly before 8.3 feature freeze either merge the rename/docs adjustment PR or revert this one, so that there is only one categorization aggregation in the public 8.3 release.

@droberts195 droberts195 merged commit fede927 into elastic:master Apr 13, 2022
@droberts195 droberts195 deleted the categorize_text2 branch April 13, 2022 13:27
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Apr 13, 2022
This replaces the implementation of the `categorize_text` aggregation
with the new algorithm that was added in elastic#80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs.

The docs are updated to reflect the workings of the new implementation.
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Apr 13, 2022
…n/elasticsearch into datastream-reuse-pipeline-source

* 'datastream-reuse-pipeline-source' of github.com:weizijun/elasticsearch: (28 commits)
  Add JDK 19 to Java testing matrix
  [ML] add nlp config update serialization tests (elastic#85867)
  [ML] A text categorization aggregation that works like ML categorization (elastic#80867)
  [ML] Fix serialisation of text embedding updates (elastic#85863)
  TSDB: fix wrong initial value of tsidOrd in TimeSeriesIndexSearcher (elastic#85713)
  Enforce external id uniqueness during DesiredNode construction (elastic#84227)
  Fix Intellij integration (elastic#85866)
  Upgrade Azure SDK to version 12.14.4 (elastic#83884)
  [discovery-gce] Fix initialisation of transport in FIPS mode (elastic#85817)
  Remove unnecessary docs/changelog/85534.yaml
  Prevent ThreadContext header leak when sending response (elastic#68649)
  Add support for impact_areas to health impacts  (elastic#85830)
  Reduce port range re-use in tests (elastic#85777)
  Fix TranslogTests#testStats (elastic#85828)
  Remove hppc from cat allocation api (elastic#85842)
  Fix BuildTests serialization (elastic#85827)
  Use urgent priority for node shutdown cluster state update (elastic#85838)
  Remove Task classes from HLRC (elastic#85835)
  Remove unused migration classes (elastic#85834)
  Remove uses of Charset name parsing (elastic#85795)
  ...
droberts195 added a commit that referenced this pull request May 23, 2022
…85872)

This replaces the implementation of the categorize_text aggregation
with the new algorithm that was added in #80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs
(and now includes the fixes of elastic/ml-cpp#2277).

The docs are updated to reflect the workings of the new implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants