-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Logs UI] categorisation setup screen #59005
Comments
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui) |
Thank you for providing so many details, this looks great. A few thoughts come to mind: Data quality criteria: It would probably be good to write down the specific criteria we want to use to determine whether there is "too little training data" or whether a dataset is "not suitable for categorization". In both cases I assume it would be evaluations of the count or cardinality?
@sophiec20, can you provide any guidance on such quality criteria? IIRC you were considering emitting such warnings while running the ML jobs? Can we access these? Well-known datasets: And what about the well-known filebeat module datasets, which we already know to be unsuitable? Do we want to hard-code a warning list for those? Combination of warnings: From the UI perspective I wonder if the combined case "Too little training data and not suitable" should be displayed as two separate warnings? Otherwise the user might not be able to tell which is which and the combinatorial complexity in the implementation grows - especially if we possibly add more warnings in the future. |
We discussed this and decided to have a single callout box but adjust the text accordingly. There is also a special case where one dataset has both problems. I think if it does not provide useful data for categorization we should not even mention too little training data - it won't be useful no matter how much data we have. @mukeshelastic will help with the wording (so the text in the issue description is likely to change). |
@katrin-freihofner @weltenwort when we detect lack of sufficient training dataset, we lack confidence in the displayed anomaly score. I wonder whether the appropriate user feedback is 1. Show N/A or something similar in the anomaly score column for each category of the dataset where we detect this case 2. Show a warning message at the top, exactly as katrin suggested but tweak the message to communicate the lack of confidence in anomaly score and hence being not displayed in the anomaly score column for the detected datasets. |
In ML, we have the helper text (above) which is aimed at helping users understand what categorization is designed for. It would be good to align on this if possible - the final sentence anyway. // too little training data Anomaly detection learns from trained data. The probability of anomalies has already been adjusted according to the amount of training data seen. So I advise against Logs UI picking an arbitrary value which defines if enough training data has been seen. It depends on the data. It is not the case that we lack confidence in displaying the anomaly score because the model has already built this in. The proposal above links to the // not suitable for categorization In 7.7 we now have the following stats categorization stats. A Unfortunately, because categorization is not yet done on a per partition basis, then this status is also not yet partition aware. It gives a view of the overall job. This will be set to In 7.6 we had a basic log category check, which would raise an ML job message if 1000 or more categories existed for a job before 100 buckets of results have been created. Because the Logs UI job is partitioned and has // well-known datasets In the end, we did not add this into ML. We did not feel that the business logic ought to be written into the back-end APIs. However I would still think that it has value in the Logs UI application which already has logic in-built to handle different dataset types. This hard-coded list could be extended over time and based on telemetry. It would help with the web access log data which I suspect might be used with categorization but is actually structured data. // combination of warnings I do not believe that the "too little data" message should be a warning. |
One thing I forgot about, we do have a job validation check in the ML UI which pertains to too little data. If there is less than 25 buckets or 2 hrs (which ever is greater), then we warn prior to job creation that there is too little data for the model to be initialized, and therefore no anomalies will be written until such time as sufficient data can be seen. Meaning, there is no historical data to analyse and you'll have to wait for it to continue in real-time until you start to see anomalies. I assumed the comments above were about the early lifetime of the job which comes after the model initialisation but that may not have been the case. |
Thank you for the detailed response, @sophiec20! With the awesome new model stats it sounds like we could do something like the following:
Does that make sense? I think the idea behind warning about "too little data" would be to indicate that some datasets might never have enough documents for training due to their rare occurrence. But maybe that's not useful enough to confuse the user with that detail? |
@weltenwort the steps 1-4 above sound good. In addition, due to These are other reasons that indicate that the message is not suited for categorizing.
These cannot be assessed using elasticsearch queries as they are metrics captured as we model. The ML UI categorization wizard does do some pre-flight data validations using For 7.7, so our end-users can get the most benefit from categorizing data that is categorize-able, then I think a pragmatic approach would be to
For beyond 7.7, then there are options for a smoother experience from the ml side, such as making |
Describe the feature
There are two cases where we need to improve the categorization UX:
For both scenarios, we want to display a warning to the user. With the button in this warning callout, the ML job setup can be updated.
Warning message (EuiCallout - Warning)
Too little training data
A single dataset
Title
[dataset.name] does not provide enough training data
Message
Longer periods of time will improve the categorization results for [dataset.name]. Update the configuration to improve your results. Learn more
Button
Update configuration
-> Links to setup screenMultiple datasets
Title
Multiple datasets do not provide enough training data
Message
We have too little training data for following datasets: [dataset.name], [dataset.name]. Longer periods of time will improve the categorization results. Learn more
Button
Update configuration
-> Links to setup screenData is not suitable for categorization
A single dataset
Title
[dataset.name] does not provide data for meaningful categorization
Message
Because of the structure the log messages in [dataset.name] have, they can not be categorized in a meaningful way. Update your job configuration to improve the results. Learn more.
Button
Update configuration
-> Links to setup screenMultiple datasets
Title
Multiple datasets do not provide data for meaningful categorization
Message
Because of the structure the log messages in [dataset.name] have, they can not be categorized in a meaningful way. Update your job configuration to improve the results. Learn more.
Button
Update configuration
-> Links to setup screenToo little training data and not suitable
Title
Multiple datasets don’t provide data for meaningful categorization or provide too little training data
Message
Because of the structure the log messages in [dataset.name], [dataset.name] and [dataset.name], they can not be categorized in a meaningful way or there is too little training data. Learn more.
Button
Update configuration
-> Links to setup screen-> the learn more links should point to a docs page. @mukeshelastic would you please provide the link?
Setup screen
The changes in the setup screen affect the index selection. With the new version if should be possible to select/deselect an index but also all datasets within individually.
Default selection
The default state will not change. When a user first enters the setup all indices and their datasets are selected.
Warning message (left column)
The warning message should be the same as described above for the categorization view. If there is too little training data for a dataset, the warning message appears.
Additionally, the
alert icon
shows which of the indices/datasets has problems (see screenshot above). Hovering the icon shows a tooltip explaining the warning.Index
Too little training data
One or more datasets in this index provide not enough training data.
Data not suitable
One or more datasets in this index can not be categorized in a meaningful way.
Dataset
Too little training data
The dataset provides not enough training data.
Data not suitable
The data in this dataset can not be categorized in a meaningful way.
Design issue
Figma file
The text was updated successfully, but these errors were encountered: