Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs UI] Include the dataset information in categorization warning message #60392

Closed
mukeshelastic opened this issue Mar 17, 2020 · 7 comments · Fixed by #75351
Closed

[Logs UI] Include the dataset information in categorization warning message #60392

mukeshelastic opened this issue Mar 17, 2020 · 7 comments · Fixed by #75351
Assignees
Labels
Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.10.0
Milestone

Comments

@mukeshelastic
Copy link

mukeshelastic commented Mar 17, 2020

ℹ️ This has been split out of #59005.

UPDATE: ML has made it possible to return per-partition errors for problematic partitions, see: #60392 (comment)

Summary

Here, we'd like to show a more meaningful warning message with a call to action to get around the warning root cause, when a dataset categorization job returns categorization_status = warn.

If the status is warn, we will perform per-partition queries to determine which partitions likely cause the high rare categories count or a high category count in respect to the overall count and then display a warning message at the top, calling out the specific datasets that have the categorization_status = warn. The message will also include a link to job configuration to allow users, which when clicked will show the warning indicator alongside the index which containts the problematic dataset. The warning message UI will be

image

and the job configuration UI will be

image

Display a warning that summarizes the results.

ℹ️ Implementation hints

  • [ML] Add support for per-partition categorization jobs #74592 exposes the categorizer stats both on the results service as well as on the HTTP API. And the type definitions added therein could help with understanding the data structure.
  • If per-partition categorization is enabled, the categorization_status of the job is a summary of the individual categorizer's status. But because these are written independently they might be temporarily inconsistent.

Use-case description

TODO

@mukeshelastic mukeshelastic added the Feature:Logs UI Logs UI feature label Mar 17, 2020
@weltenwort weltenwort added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.8.0 labels Mar 18, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@weltenwort weltenwort changed the title Include the dataset information in categorization warning message [Logs UI] Include the dataset information in categorization warning message May 4, 2020
@weltenwort
Copy link
Member

The question of how to determine the categories responsible for the warning is still not resolved, so I would dispute this being ready.

As @sophiec20 helpfully suggested in #59005 (comment), this might be best achieved by enhancing the stats collected by the ML functionality while processing the documents.

@jasonrhodes
Copy link
Member

A couple things to update this for future prioritization:

  1. I've checked in with @sophiec20 about whether there is a ticket representing the work mentioned in the linked comment [Logs UI] categorisation setup screen #59005 (comment), which said:

For beyond 7.7, then there are options for a smoother experience from the ml side, such as making categorization_status partition aware or perhaps having some self correcting logic in the job to exclude partitions that are not suited or perhaps having a data validation endpoint.

We should make this ticket dependent on a real ticket that exists for the ML side, or if one doesn't exist, re-think this ticket in light of what's possible.

  1. To that end (doing what's possible now), that same linked comment also mentioned a few other ideas:

For 7.7, so our end-users can get the most benefit from categorizing data that is categorize-able, then I think a pragmatic approach would be to

  • Identify datasets where the category count is v high (likely common) and allow end-user to de-select these.
  • Identify well-known datasets that are not suited to categorization (likely common) and allow end-user to de-select these.
  • Educate end-user (via on-screen help) on what type of data is best suited to categorization and allow them to use their judgement to exclude datasets from being analyzed. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting.
  • Educate the end-user (via on-screen help) on what other reasons may have caused the job to be in a warn status and allow them to use their judgement to exclude datasets.

Have we done this already / are we interested in any of these improvements while we wait for the ML-side improvements?

cc @mukeshelastic @weltenwort

@jasonrhodes
Copy link
Member

OK I just heard back from ML about this (thanks @droberts195) and there is a new value available in a job module's analysis_config block called per_partition_categorization which can be added, like this:

"per_partition_categorization": {
  "enabled": true,
  "stop_on_warn": true
}

as a sibling to categorization_field_name and detectors. This should return errors to us per partition_field_name, or event.dataset in our case.

@jasonrhodes
Copy link
Member

So here are the decisions we still need to make, I think:

  • Can we query for this information about a job after the fact, to support the Job Configuration screen scenario described below, or should we just allow the job to be created and let ML remove the unwanted partitions behind the scenes?
  • How should we handle already created jobs which may have problematic datasets? They were created in beta so I'm inclined to say that we should just encourage folks to re-create jobs on every release rather than trying to message about that in the UI, until we are GA.
  • Are there any other improvements we need/want to make to educate users about problematic datasets? Seems like this new feature from ML is good enough that we don't need to worry so much about it?

@droberts195
Copy link
Contributor

If the status is warn, we will perform per-partition queries to determine which partitions likely cause the high rare categories count or a high category count in respect to the overall count and then display a warning message at the top, calling out the specific datasets that have the categorization_status = warn.

For 7.10 you can just check the categorizer_stats result type of the ML job for each partition - added in elastic/elasticsearch#57978.

You won't need to do separate calculations to work out which dataset is responsible, as ML will tell you. If ML's current definition of warn status is bad in some way then ideally we should change it once in the ML code (see https://github.com/elastic/ml-cpp/blob/b9a3e4b9e0cef324d21572881d6c7dcd3798baa4/lib/model/CTokenListDataCategorizerBase.cc#L609-L650) and then the logs UI just uses the warn/ok flag to pick up the result of that calculation instead of doing something slightly different. We can work together to hone that definition if you see data sets that inappropriately end up with warn status or inappropriately keep an ok status when they shouldn't.

@weltenwort
Copy link
Member

@droberts195, the new categorizer stats look awesome and should help us a lot. 🤯

Have we done this already / are we interested in any of these improvements while we wait for the ML-side improvements?

As to what we've done so far that didn't depend on the ML changes, we already implemented a setup enhancement that enables the user to (de)select specific datasets on job (re)creation.

How should we handle already created jobs which may have problematic datasets? They were created in beta so I'm inclined to say that we should just encourage folks to re-create jobs on every release rather than trying to message about that in the UI, until we are GA.

We have a mechanism in place that informs the user about job definition changes in the UI and prompts for re-creation of the job.

To me it sounds like this is what we would have to do in order to take advantage of the new per-partition warnings:

  • If the job has per-partition categorization enabled, query the categorizer stats document for jobs with categorization status warn (as @droberts195 wrote).
  • Display an informative message on the results page that prompts the user to re-create without the problematic categories.
  • Display the per-dataset warnings in the setup screen during re-creation (as shown in the mockup above).

The new stop-on-warn parameter also looks extremely useful and if we include it in our job config we would have to adapt the warning messages accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.10.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants