Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add categorizer stats ML result type #57978

Merged
merged 2 commits into from
Jun 11, 2020

Conversation

droberts195
Copy link
Contributor

This type of result will store stats about how well categorization
is performing. When per-partition categorization is in use, separate
documents will be written for every partition so that it is possible
to see if categorization is working well for some partitions but not
others.

This PR is a minimal implementation to allow the C++ side changes to
be made. More Java side changes related to per-partition
categorization will be in followup PRs. However, even in the long
term I do not see a major benefit in introducing dedicated APIs for
querying categorizer stats. Like forecast request stats the
categorizer stats can be read directly from the job's results alias.

This type of result will store stats about how well categorization
is performing.  When per-partition categorization is in use, separate
documents will be written for every partition so that it is possible
to see if categorization is working well for some partitions but not
others.

This PR is a minimal implementation to allow the C++ side changes to
be made.  More Java side changes related to per-partition
categorization will be in followup PRs.  However, even in the long
term I do not see a major benefit in introducing dedicated APIs for
querying categorizer stats.  Like forecast request stats the
categorizer stats can be read directly from the job's results alias.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent benwtrent self-requested a review June 11, 2020 11:30
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor things. Overall looks good.

Comment on lines +97 to +102
this.categorizedDocCount = categorizedDocCount;
this.totalCategoryCount = totalCategoryCount;
this.frequentCategoryCount = frequentCategoryCount;
this.rareCategoryCount = rareCategoryCount;
this.deadCategoryCount = deadCategoryCount;
this.failedCategoryCount = failedCategoryCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check that all of these are > 0. This is probably OK since we are getting this data from C++ and there is an implicit trust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. This is dodgy when we're using VLong serialization. It means someone can DoS us by updating a document to contain a negative number.

This same problem applies to most of our other results classes too, so I think it's best to deal with it in a separate PR.

Like you say, the C++ won't send negative values for unsigned counters, so it's not a likely problem.

@droberts195 droberts195 merged commit 355958f into elastic:master Jun 11, 2020
@droberts195 droberts195 deleted the add_categorizer_stats branch June 11, 2020 16:59
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 11, 2020
This type of result will store stats about how well categorization
is performing.  When per-partition categorization is in use, separate
documents will be written for every partition so that it is possible
to see if categorization is working well for some partitions but not
others.

This PR is a minimal implementation to allow the C++ side changes to
be made.  More Java side changes related to per-partition
categorization will be in followup PRs.  However, even in the long
term I do not see a major benefit in introducing dedicated APIs for
querying categorizer stats.  Like forecast request stats the
categorizer stats can be read directly from the job's results alias.

Backport of elastic#57978
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 11, 2020
After merging elastic#58001 the BWC constants added in elastic#57978 are no
longer needed.
droberts195 added a commit that referenced this pull request Jun 12, 2020
This type of result will store stats about how well categorization
is performing.  When per-partition categorization is in use, separate
documents will be written for every partition so that it is possible
to see if categorization is working well for some partitions but not
others.

This PR is a minimal implementation to allow the C++ side changes to
be made.  More Java side changes related to per-partition
categorization will be in followup PRs.  However, even in the long
term I do not see a major benefit in introducing dedicated APIs for
querying categorizer stats.  Like forecast request stats the
categorizer stats can be read directly from the job's results alias.

Backport of #57978
droberts195 added a commit that referenced this pull request Jun 12, 2020
After merging #58001 the BWC constants added in #57978 are no
longer needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants