Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Adds text about data types and categorization to Anomaly Detection overview page #809

Closed
wants to merge 10 commits into from
Closed
22 changes: 22 additions & 0 deletions docs/en/stack/ml/anomaly-detection/categorization-data.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[role="xpack"]
[[ml-datatypes-categorization]]
=== Data types and categorization

Categorization is a {ml} process that considers a tokenization of a field,
clusters similar data together, and classifies them into categories. However,
categorization doesn't work equally well on different data types. It works
best on machine-written messages and application outputs, typically on data that
consists of repeated elements, for example log messages for the purpose of
system troubleshooting. Log categorization groups unstructured log messages into
categories, then you can use {anomaly-detect} to model and identify rare or
unusual counts of log message categories. For more information about the
process, see
{ml-docs}/ml-configuring-categories.html[Categorizing log messages].

Categorization is tuned to work best on data like log messages by taking token
order into account, not considering synonyms, and including stop words in its analysis.
Complete sentences in human communication or literary text (for example emails,
wiki pages, prose, or other human generated content) can be extremely diverse in
structure. Since categorization is tuned for machine data it will give poor results on such human generated data.
For example, the categorization job would create so many categories that
couldn't be handled effectively. Categorization is _not_ natural language processing (NLP).
1 change: 1 addition & 0 deletions docs/en/stack/ml/anomaly-detection/overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ include::analyzing.asciidoc[]

include::forecasting.asciidoc[]

include::categorization-data.asciidoc[]