[Logs UI] Include the dataset information in categorization warning message #60392

mukeshelastic · 2020-03-17T15:39:20Z

ℹ️ This has been split out of #59005.

UPDATE: ML has made it possible to return per-partition errors for problematic partitions, see: #60392 (comment)

Summary

Here, we'd like to show a more meaningful warning message with a call to action to get around the warning root cause, when a dataset categorization job returns categorization_status = warn.

If the status is warn, we will perform per-partition queries to determine which partitions likely cause the high rare categories count or a high category count in respect to the overall count and then display a warning message at the top, calling out the specific datasets that have the categorization_status = warn. The message will also include a link to job configuration to allow users, which when clicked will show the warning indicator alongside the index which containts the problematic dataset. The warning message UI will be

and the job configuration UI will be

Display a warning that summarizes the results.

ℹ️ Implementation hints

[ML] Add support for per-partition categorization jobs #74592 exposes the categorizer stats both on the results service as well as on the HTTP API. And the type definitions added therein could help with understanding the data structure.
If per-partition categorization is enabled, the categorization_status of the job is a summary of the individual categorizer's status. But because these are written independently they might be temporarily inconsistent.

Use-case description

TODO

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-18T09:48:45Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

weltenwort · 2020-05-04T09:44:47Z

The question of how to determine the categories responsible for the warning is still not resolved, so I would dispute this being ready.

As @sophiec20 helpfully suggested in #59005 (comment), this might be best achieved by enhancing the stats collected by the ML functionality while processing the documents.

jasonrhodes · 2020-07-16T16:25:36Z

A couple things to update this for future prioritization:

I've checked in with @sophiec20 about whether there is a ticket representing the work mentioned in the linked comment [Logs UI] categorisation setup screen #59005 (comment), which said:

For beyond 7.7, then there are options for a smoother experience from the ml side, such as making categorization_status partition aware or perhaps having some self correcting logic in the job to exclude partitions that are not suited or perhaps having a data validation endpoint.

We should make this ticket dependent on a real ticket that exists for the ML side, or if one doesn't exist, re-think this ticket in light of what's possible.

To that end (doing what's possible now), that same linked comment also mentioned a few other ideas:

For 7.7, so our end-users can get the most benefit from categorizing data that is categorize-able, then I think a pragmatic approach would be to

Identify datasets where the category count is v high (likely common) and allow end-user to de-select these.

Identify well-known datasets that are not suited to categorization (likely common) and allow end-user to de-select these.

Educate end-user (via on-screen help) on what type of data is best suited to categorization and allow them to use their judgement to exclude datasets from being analyzed. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting.

Educate the end-user (via on-screen help) on what other reasons may have caused the job to be in a warn status and allow them to use their judgement to exclude datasets.

Have we done this already / are we interested in any of these improvements while we wait for the ML-side improvements?

cc @mukeshelastic @weltenwort

jasonrhodes · 2020-07-16T16:46:50Z

OK I just heard back from ML about this (thanks @droberts195) and there is a new value available in a job module's analysis_config block called per_partition_categorization which can be added, like this:

"per_partition_categorization": {
  "enabled": true,
  "stop_on_warn": true
}

as a sibling to categorization_field_name and detectors. This should return errors to us per partition_field_name, or event.dataset in our case.

jasonrhodes · 2020-07-16T16:51:19Z

So here are the decisions we still need to make, I think:

Can we query for this information about a job after the fact, to support the Job Configuration screen scenario described below, or should we just allow the job to be created and let ML remove the unwanted partitions behind the scenes?
How should we handle already created jobs which may have problematic datasets? They were created in beta so I'm inclined to say that we should just encourage folks to re-create jobs on every release rather than trying to message about that in the UI, until we are GA.
Are there any other improvements we need/want to make to educate users about problematic datasets? Seems like this new feature from ML is good enough that we don't need to worry so much about it?

droberts195 · 2020-07-16T16:57:42Z

If the status is warn, we will perform per-partition queries to determine which partitions likely cause the high rare categories count or a high category count in respect to the overall count and then display a warning message at the top, calling out the specific datasets that have the categorization_status = warn.

For 7.10 you can just check the categorizer_stats result type of the ML job for each partition - added in elastic/elasticsearch#57978.

You won't need to do separate calculations to work out which dataset is responsible, as ML will tell you. If ML's current definition of warn status is bad in some way then ideally we should change it once in the ML code (see https://github.com/elastic/ml-cpp/blob/b9a3e4b9e0cef324d21572881d6c7dcd3798baa4/lib/model/CTokenListDataCategorizerBase.cc#L609-L650) and then the logs UI just uses the warn/ok flag to pick up the result of that calculation instead of doing something slightly different. We can work together to hone that definition if you see data sets that inappropriately end up with warn status or inappropriately keep an ok status when they shouldn't.

weltenwort · 2020-07-17T10:22:45Z

@droberts195, the new categorizer stats look awesome and should help us a lot. 🤯

Have we done this already / are we interested in any of these improvements while we wait for the ML-side improvements?

As to what we've done so far that didn't depend on the ML changes, we already implemented a setup enhancement that enables the user to (de)select specific datasets on job (re)creation.

How should we handle already created jobs which may have problematic datasets? They were created in beta so I'm inclined to say that we should just encourage folks to re-create jobs on every release rather than trying to message about that in the UI, until we are GA.

We have a mechanism in place that informs the user about job definition changes in the UI and prompts for re-creation of the job.

To me it sounds like this is what we would have to do in order to take advantage of the new per-partition warnings:

If the job has per-partition categorization enabled, query the categorizer stats document for jobs with categorization status warn (as @droberts195 wrote).
Display an informative message on the results page that prompts the user to re-create without the problematic categories.
Display the per-dataset warnings in the setup screen during re-creation (as shown in the mockup above).

The new stop-on-warn parameter also looks extremely useful and if we include it in our job config we would have to adapt the warning messages accordingly.

mukeshelastic added the Feature:Logs UI Logs UI feature label Mar 17, 2020

weltenwort added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.8.0 labels Mar 18, 2020

This was referenced Mar 29, 2020

[Logs UI] categorisation setup screen #59005

Closed

[Logs UI] Show a warning indicator on ML job setup screen for datasets not suitable for categorization #61900

Open

sgrodzicki added the [zube]: Ready label Apr 6, 2020

weltenwort changed the title ~~Include the dataset information in categorization warning message~~ [Logs UI] Include the dataset information in categorization warning message May 4, 2020

weltenwort removed the v7.8.0 label May 4, 2020

jasonrhodes added [zube]: Backlog and removed [zube]: Ready labels May 4, 2020

sgrodzicki added v7.9.0 [zube]: Ready and removed [zube]: Backlog labels May 15, 2020

sgrodzicki added this to the Logs UI 7.9 milestone May 25, 2020

sgrodzicki removed this from the Logs UI 7.9 milestone May 26, 2020

jasonrhodes added the blocked label Jun 5, 2020

jasonrhodes added Logs Epic: Anomaly Explorer and removed Logs Epic: Anomaly Explorer labels Jun 11, 2020

sgrodzicki removed the v7.9.0 label Jul 15, 2020

jasonrhodes added this to the Logs UI 7.10 milestone Jul 21, 2020

weltenwort added v7.10.0 and removed [zube]: Ready blocked labels Aug 12, 2020

weltenwort self-assigned this Aug 12, 2020

weltenwort mentioned this issue Aug 18, 2020

[Logs UI] Add dataset-specific categorization warnings #75351

Merged

weltenwort closed this as completed in #75351 Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Logs UI] Include the dataset information in categorization warning message #60392

[Logs UI] Include the dataset information in categorization warning message #60392

mukeshelastic commented Mar 17, 2020 •

edited by zube bot

Loading

elasticmachine commented Mar 18, 2020

weltenwort commented May 4, 2020

jasonrhodes commented Jul 16, 2020

jasonrhodes commented Jul 16, 2020

jasonrhodes commented Jul 16, 2020

droberts195 commented Jul 16, 2020

weltenwort commented Jul 17, 2020

[Logs UI] Include the dataset information in categorization warning message #60392

[Logs UI] Include the dataset information in categorization warning message #60392

Comments

mukeshelastic commented Mar 17, 2020 • edited by zube bot Loading

Summary

ℹ️ Implementation hints

Use-case description

elasticmachine commented Mar 18, 2020

weltenwort commented May 4, 2020

jasonrhodes commented Jul 16, 2020

jasonrhodes commented Jul 16, 2020

jasonrhodes commented Jul 16, 2020

droberts195 commented Jul 16, 2020

weltenwort commented Jul 17, 2020

mukeshelastic commented Mar 17, 2020 •

edited by zube bot

Loading