Skip to content

Commit

Permalink
[DOCS] Adds text about data types to the categorization docs (elastic…
Browse files Browse the repository at this point in the history
  • Loading branch information
szabosteve authored and lcawl committed Jan 17, 2020
1 parent e3e2082 commit e74b271
Showing 1 changed file with 37 additions and 14 deletions.
51 changes: 37 additions & 14 deletions docs/reference/ml/anomaly-detection/categories.asciidoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
[role="xpack"]
[[ml-configuring-categories]]
=== Categorizing log messages
=== Categorizing data

Categorization is a {ml} process that considers a tokenization of a field,
clusters similar data together, and classifies them into categories. However,
categorization doesn't work equally well on different data types. It works
best on machine-written messages and application outputs, typically on data that
consists of repeated elements, for example log messages for the purpose of
system troubleshooting. Log categorization groups unstructured log messages into
categories, then you can use {anomaly-detect} to model and identify rare or
unusual counts of log message categories.

Categorization is tuned to work best on data like log messages by taking token
order into account, not considering synonyms, and including stop words in its
analysis. Complete sentences in human communication or literary text (for
example emails, wiki pages, prose, or other human generated content) can be
extremely diverse in structure. Since categorization is tuned for machine data
it will give poor results on such human generated data. For example, the
categorization job would create so many categories that couldn't be handled
effectively. Categorization is _not_ natural language processing (NLP).

[float]
[[ml-categorization-log-messages]]
==== Categorizing log messages

Application log events are often unstructured and contain variable data. For
example:
Expand Down Expand Up @@ -65,8 +87,8 @@ defining categories. The categorization filters are applied in the order they
are listed in the job configuration, which allows you to disregard multiple
sections of the categorization field value. In this example, we have decided that
we do not want the detailed SQL to be considered in the message categorization.
This particular categorization filter removes the SQL statement from the categorization
algorithm.
This particular categorization filter removes the SQL statement from the
categorization algorithm.

If your data is stored in {es}, you can create an advanced {anomaly-job} with
these same properties:
Expand All @@ -79,7 +101,7 @@ NOTE: To add the `categorization_examples_limit` property, you must use the

[float]
[[ml-configuring-analyzer]]
==== Customizing the categorization analyzer
===== Customizing the categorization analyzer

Categorization uses English dictionary words to identify log message categories.
By default, it also uses English tokenization rules. For this reason, if you use
Expand Down Expand Up @@ -135,7 +157,8 @@ here achieves exactly the same as the `categorization_filters` in the first
example.
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
that was used for categorization in older versions of machine learning. If you
want the same categorization behavior as older versions, use this property value.
want the same categorization behavior as older versions, use this property
value.
<3> By default, English day or month words are filtered from log messages before
categorization. If your logs are in a different language and contain
dates, you might get better results by filtering the day or month words in your
Expand Down Expand Up @@ -178,9 +201,9 @@ POST _ml/anomaly_detectors/_validate
If you specify any part of the `categorization_analyzer`, however, any omitted
sub-properties are _not_ set to default values.

The `ml_classic` tokenizer and the day and month stopword filter are more or less
equivalent to the following analyzer, which is defined using only built-in {es}
{ref}/analysis-tokenizers.html[tokenizers] and
The `ml_classic` tokenizer and the day and month stopword filter are more or
less equivalent to the following analyzer, which is defined using only built-in
{es} {ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]:

[source,console]
Expand Down Expand Up @@ -234,11 +257,11 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
<5> Underscores, hyphens, and dots are also removed from the end of tokens.

The key difference between the default `categorization_analyzer` and this example
analyzer is that using the `ml_classic` tokenizer is several times faster. The
difference in behavior is that this custom analyzer does not include accented
letters in tokens whereas the `ml_classic` tokenizer does, although that could
be fixed by using more complex regular expressions.
The key difference between the default `categorization_analyzer` and this
example analyzer is that using the `ml_classic` tokenizer is several times
faster. The difference in behavior is that this custom analyzer does not include
accented letters in tokens whereas the `ml_classic` tokenizer does, although
that could be fixed by using more complex regular expressions.

If you are categorizing non-English messages in a language where words are
separated by spaces, you might get better results if you change the day or month
Expand All @@ -263,7 +286,7 @@ API examples above.

[float]
[[ml-viewing-categories]]
==== Viewing categorization results
===== Viewing categorization results

After you open the job and start the {dfeed} or supply data to the job, you can
view the categorization results in {kib}. For example:
Expand Down

0 comments on commit e74b271

Please sign in to comment.