match_only_text and general necessity for case insensitive and exact match (search) #1837

neu5ron · 2022-03-11T13:12:02Z

please see updated outline #1837 (comment)
however, the below can be used to note some of the shortcomings of match_only_text

Description

match_only_text text data type causes undesired search results when wanting to perform accurate searches that are also case insensitive.. This data type appears to not be the most optimal solution for security/log data.
Searching for ends with or starts with does not keep/respect positioning, most importantly is the lack of accuracy. When searching for things other than numbers/letters (ie: $, ., {, etc..) the search characters are ignored.

I understand the "solution" may be for a custom analyzer or to use wildcard, however the problem is that ECS is being applied without customers/users understanding this issue OR even changing the mappings. Thus the default and most commonly loaded/widely used mappings are giving users inaccurate search results.

The desire for security use cases, and I assume most logging use cases, is a) case insensitivity and b) accurate/exact results.

I would recommend one of two things:
a) adopting a community text analyzer (see: https://github.com/neu5ron/es_stk). Which is adopted in things such as Security Onion.
b) moving everything, defined as `match_only_text, to wildcard.

Example

Test Data

I loaded some sample values into Elasticsearch with a explicitly defined mapping for match_only_text on the field cli (note: used the field cli but it can be any field, whether ECS or not as long as that mapping is applied).

POST /_bulk
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe"""}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe $  -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe $"""}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rundll32.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rund1132.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rund1132.exe $"""}

Result to Find

The value I to want to find is C:\Users\test\rundll32.exe $

Search - in Human Form

I expect to find this result using the following logic (in human form):

contains rundll32.exe
followed by a $

Search - in Elastic Form

After converting this logic into an actual Elastic query the syntax looks like:
cli.text:"*rundll32.exe*\$"

Search - Results

The results from this search return 6 matches when there should be only 1.
There is only one occurrence where rundll32.exe endswith a $
Not only does it find results that don't end in $, it returns results that do not contain a $ at all.

Follow along complete test

https://github.com/neu5ron/es_stk/wiki#testing-yourself

The text was updated successfully, but these errors were encountered:

ebeahan · 2022-03-14T22:01:42Z

Thanks for the detailed issue, @neu5ron.

`match_only_text` vs. `text`

At first, I thought this issue was reporting search inaccuracies using match_only_text compared to text. However, I'm seeing the same search results (with the six matches) for either text or match_only_text fields.

Expand to see `text` vs `match_only_text` testing summary

Create `text` template

PUT _template/text
{
  "index_patterns": [
    "text"
  ],
  "mappings": {
    "properties": {
      "cli": {
        "type": "text"
      }
    }
  }
}

Create `match_only_text` template

PUT _template/match-only-text
{
  "index_patterns": [
    "match-only-text"
  ],
  "mappings": {
    "properties": {
      "cli": {
        "type": "match_only_text"
      }
    }
  }
}

Test data

PUT /_bulk
{ "index" : { "_index": "text"} }
{"cli": """C:\Users\test\rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "text"} }
{"cli": """C:\Users\test\rundll32.exe"""}
{ "index" : { "_index": "text"} }
{"cli": """rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "text"} }
{"cli": """C:\Users\test\rundll32.exe $  -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "text"} }
{"cli": """C:\Users\test\rundll32.exe $"""}
{ "index" : { "_index": "text"} }
{"cli": """rundll32.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "text"} }
{"cli": """rund1132.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "text"} }
{"cli": """rund1132.exe $"""}
{ "index" : { "_index": "match-only-text"} }
{"cli": """C:\Users\test\rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "match-only-text"} }
{"cli": """C:\Users\test\rundll32.exe"""}
{ "index" : { "_index": "match-only-text"} }
{"cli": """rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "match-only-text"} }
{"cli": """C:\Users\test\rundll32.exe $  -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "match-only-text"} }
{"cli": """C:\Users\test\rundll32.exe $"""}
{ "index" : { "_index": "match-only-text"} }
{"cli": """rundll32.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "match-only-text"} }
{"cli": """rund1132.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "match-only-text"} }
{"cli": """rund1132.exe $"""}

Check mappings

GET /text/_mapping/field/cli
{
  "text" : {
    "mappings" : {
      "cli" : {
        "full_name" : "cli",
        "mapping" : {
          "cli" : {
            "type" : "text"
          }
        }
      }
    }
  }
}

GET /match-only-text/_mapping/field/cli
{
  "match-only-text" : {
    "mappings" : {
      "cli" : {
        "full_name" : "cli",
        "mapping" : {
          "cli" : {
            "type" : "match_only_text"
          }
        }
      }
    }
  }
}

`text` search query

GET /text/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match_phrase": {
                  "cli": "*rundll32.exe*\\$"
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  }
}

`text` result

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 0.44975153,
    "hits" : [
      {
        "_index" : "text",
        "_id" : "YGtNiX8BmUhPRDsIFEBC",
        "_score" : 0.44975153,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe"""
        }
      },
      {
        "_index" : "text",
        "_id" : "Y2tNiX8BmUhPRDsIFEBC",
        "_score" : 0.44975153,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe $"""
        }
      },
      {
        "_index" : "text",
        "_id" : "YWtNiX8BmUhPRDsIFEBC",
        "_score" : 0.36145675,
        "_source" : {
          "cli" : "rundll32.exe   -exec       a bypass ^T^e^S^t  "
        }
      },
      {
        "_index" : "text",
        "_id" : "ZGtNiX8BmUhPRDsIFEBC",
        "_score" : 0.36145675,
        "_source" : {
          "cli" : "rundll32.exe $   -exec       a bypass ^T^e^S^t  "
        }
      },
      {
        "_index" : "text",
        "_id" : "X2tNiX8BmUhPRDsIFEBB",
        "_score" : 0.31506658,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe   -exec       a bypass ^T^e^S^t  """
        }
      },
      {
        "_index" : "text",
        "_id" : "YmtNiX8BmUhPRDsIFEBC",
        "_score" : 0.31506658,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe $  -exec       a bypass ^T^e^S^t  """
        }
      }
    ]
  }
}

`match_only_text` search query

GET /match-only-text/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match_phrase": {
                  "cli": "*rundll32.exe*\\$"
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  }
}

`match_only_text` result

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "match-only-text",
        "_id" : "Z2tNiX8BmUhPRDsIFEBC",
        "_score" : 1.0,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe   -exec       a bypass ^T^e^S^t  """
        }
      },
      {
        "_index" : "match-only-text",
        "_id" : "aGtNiX8BmUhPRDsIFEBC",
        "_score" : 1.0,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe"""
        }
      },
      {
        "_index" : "match-only-text",
        "_id" : "aWtNiX8BmUhPRDsIFEBC",
        "_score" : 1.0,
        "_source" : {
          "cli" : "rundll32.exe   -exec       a bypass ^T^e^S^t  "
        }
      },
      {
        "_index" : "match-only-text",
        "_id" : "amtNiX8BmUhPRDsIFEBC",
        "_score" : 1.0,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe $  -exec       a bypass ^T^e^S^t  """
        }
      },
      {
        "_index" : "match-only-text",
        "_id" : "a2tNiX8BmUhPRDsIFEBC",
        "_score" : 1.0,
        "_source" : {
          "cli" : """C:\Users\test\rundll32.exe $"""
        }
      },
      {
        "_index" : "match-only-text",
        "_id" : "bGtNiX8BmUhPRDsIFEBC",
        "_score" : 1.0,
        "_source" : {
          "cli" : "rundll32.exe $   -exec       a bypass ^T^e^S^t  "
        }
      }
    ]
  }
}

My results don't change the inaccuracies reported in the issue summary. However, I wanted first better to understand if these are issues specific to match_only_text or, more so, a problem with the standard analyzer that both text and match_only_text use in this configuration.

Full-text search in ECS

@neu5ron, you're probably still well-aware of these facts from past ECS discussion. I'm adding this context here for the sake of completeness for anyone not as familiar with ECS. 😄

Currently, only two fields in ECS define match_only_text as their primary field mapping: message and error.message. Both fields are intended to hold more "human-readable" messages and more likely benefit from being analyzed. Several more ECS fields use .text multi-field. Having an analyzed field can be valuable in addition to the primary keyword field (e.g., host.os.name has a host.os.name.text multi-field which uses match_only_text).

Most other ECS fields will use keyword with the remaining using wildcard. Keyword enables faster exact match filtering and aggregations (for Kibana visualizations). More details about why ECS uses this convention are covered in the docs.

Test data indexed using `keyword`

Indexing the same test data but using keyword instead of match_only_text or text didn't cause the same inaccuracies. I did adjust the original KQL search string:

`wildcard` data type in ECS

ECS has migrated some select fields to use the wildcard data type. Wildcard is a drop-in replacement for keyword, but it comes with additional indexing and storage considerations. This blog post describes the feature in-depth and has a detailed table comparing keyword vs. wildcard performance.

One of the wildcard migrated fields was process.command_line, which would typically contain the cli values used in some of this testing. More about the motivations to use wildcard on command-line fields can be found in the wildcard RFC proposal.

Wrap up

I'll wrap up by touching on some of the statements:

The desire for security use cases, and I assume most logging use cases, is a) case insensitivity and b) accurate/exact results.

@neu5ron Can you confirm if you're experiencing search issues with match_only_text not seen using text? Or is the issue that both types use the standard analyzer with grammar-based tokenization?

a) adopting a community text analyzer (see: https://github.com/neu5ron/es_stk). Which is adopted in things such as Security Onion.

I know this is a topic that's been discussed before.

There are some early ideas about defining schemas for specific use-cases. As that work evolves, it might open up more flexibility for more use-case-specific settings, like the analyzers linked.

b) moving everything, defined as `match_only_text, to wildcard.

Both text and match_only_text are for full-text search, but wildcard isn't. Wildcard is a specialized keyword type. Wildcard fields can help lessen some of the pain points of keyword or a .text multi-field but isn't a complete replacement.

neu5ron · 2022-03-21T08:02:45Z

Hey @ebeahan thanks for the reply. Especially appreciate some of the background/context for others.
An additional article that may help some as well: https://socprime.com/blog/elastic-for-security-analysts-part-1-searching-strings/

I would like to try to clear up a few things for the discussion going forward and to preface that - I apologize for causing confusion by using the word "inaccuracy" (to which I updated the title and some of the wording of the original issue).
This will also answer the confusion I caused whether it was a standard analyzer issue.

Desired State

For (cyber) security and logging use cases there is necessity of:

(Search) exact match.
for things such as symbols, numbers, punctuation, etc (ie: anything that is not just words/letters).
(Search) case insensitive.
Aggregations.
(for visualizations AKA anything outside of match/search) I think this mostly goes without saying, but as the point gets brought up later I wanted to note it.
Aggregation max character length.
not to be confused with maximum character length related to the ability to just search/match the data, but rather to display/return data in a visualization/ML/etc which requires use of an aggregation. This is a separate discussion that has been noted in other ECS issues (Increasing of ignore_above for keyword #105) and is alleviated in use of certain data types, however when this discussion gets brought up it tends to take away from the points of 1) and 2).

Elasticsearch ECS Data Types

keyword

exact match: yes
case insensitive: no
aggregations: yes
aggregation max character length: limited at 32766

additional noteworthy info:

con:
- it is case sensitive, requires use of regex (which worth mentioning is not the full PCRE spec)
- aggregation max character length. keep for separate topic that should not be used as a weight in the discussion

match_only_text

exact match: no
case insensitive: yes
aggregations: no
aggregation max character length: n/a

additional noteworthy info:

con:
- will not match exact/precise of anything that is not a letter/word (shown in thread of this github issue)

wildcard

exact match: yes
case insensitive: no
aggregations: yes
aggregation max character length: unlimited (relatively) at 2147483647

additional noteworthy info:

con:
- use of case insensitive flag is not available in the main components and overall user facing components of Kibana such as Discover or Visualization.
pro:
- aggregation max character length. separate topic that should not be used as a weight in the discussion

ECS Data Type Usage

The 3 noted data types from above are used within the ECS templates (going off the main branch as of commit hash 7496470bf422451744cef8308c1782baab8086bf

keyword
985
max character(byte) length set to 1024 or lower on 984 of the 985 fields.
match_only_text
61
wildcard
18

Data Types for Desired State

1) Exact match

It should go without saying this is solved through the use of wildcard OR keyword

2) Case insensitive

Solution is TBD... discussed in depth later.
However, we know that keyword is not the solution and match_only_text seems like it was meant to be an option for case insensitive search, but as outlined it is not useful for security/cyber/logging use cases as it strips/ignores/removes symbols/punctuation/etc.

3) Aggregations

keyword or wildcard

this is relatively solved, as already in the state of ECS keyword or wildcard is used where appropriate and for the vast majority of fields.

4) Aggregation max character length

separate topic

Solving Case Insensitive Search

The recommendation that users can implement their own custom (text) analyzer is great but it's just not happening and in return users have a false sense of security because they are now running searches expecting results that they will never get.
Even the use of wildcard as the solution has become problematic. I think it's evident it is not the solution given the fact it is only used 18 times despite it's release over 1 1/2 years ago.

You did mention There are some early ideas about defining schemas for specific use-cases, but I would like to say this case insensitive search situation has been around for 18+ months. Also, I would assume there is not a ton of ECS implementations outside of cyber. Or more to the point, cyber has always been the main use case (especially given the drivers/employees/etc within Elastic who created/maintain/oversee/contributed to ECS). Sure I could see data normalization and a schema is not some special thing specific to cyber and therefore should obviously not be squandered and ruined for all other possible use cases - so correct me if I am wrong if ECS is not primarily cyber focused in relation to the early adoption ideas.

On these deployments/uses of Elastic that don't have this solved, one would think that it could be lack of training, pro services, subscriptions, or the like.. However, I have seen this issue across 15 Elastic deployments in just the last year..across just about every possible scenario:

paid elastic subscriptions and OSS
ECE, self, and or Elastic Cloud
Professional Services provided by Elastic themselves, an internal team, or other 3rd Party
Very intelligent professionals who are well versed in (cyber) detection
Very intelligent professionals who are well versed with the Elastic stack
Industries such as Health, Gov/Fed, Transportation, University, Financial, etc

The most common thing is users are just using the ECS templates and just have no clue about case sensitive search restrictions or more so they think ECS has solved it given a few blog posts over the past year in relation to wildcard and match_only_text. Not to mention even if ECS templates solved case insensitivity through wildcard, it is only used on %2 of the fields.

The reason I bring up this non technical background for the solution, is I think Elastic has not just the responsibility to solve this but they more than have the means to solve the issue. Especially given the (cyber) community has solved this and is more than willing to help. We all just want to get to searching and using the data and helping people do the same.

With that said, is it possible to have a customer analyzer integrated into ECS?
I think if it was possible to get standard analyzer for text field types promoted into it's very own data type in the foundational elasticsearch mappings.. then ECS using a custom analyzer does not seem outside the realm of possibility.

Etc

(data) type	type count
date	60
object	8
keyword	985
wildcard	18
long	118
ip	15
match_only_text	61
geo_point	8
boolean	21
flattened	11
scaled_float	3
constant_keyword	3
float	5
nested	13

MikePaquette · 2022-03-25T12:25:39Z

Thanks @neu5ron for raising this issue so clearly and constructively. We agree that our users need and expect reliable, explainable methods for case insensitive and exact-match searching on ECS-compliant data, and that the current state presents several challenges as you've described.

I have initiated an internal Elastic effort to focus on a holistic solution for our users, and we will share our ideas and plans with you and the rest of the ECS community as they develop. Thanks again for your continued contributions to ECS.

jpountz · 2022-04-06T16:52:43Z

Some thoughts on this problem:

One problem with using a different analyzer is that we'd be forcing everyone to perform exact search. I'm sure that some users want exact search at times, but I also wouldn't be surprised if some users appreciated running Google-like searches on their logs where they don't have to care much about how data is formated.
Also I'm unsure if a different analyzer is the right solution. When I read the problem statement, it actually feels to me like you would like to have a way to perform exact search regardless of the analyzer that is configured.

Thinking out loud, instead of switching everything to a different field type that has the semantics you want for queries, maybe another option would be to introduce a new query type that always matches on whole values, regardless of its actual type or of the configured analyzer.

mbudge · 2022-04-12T17:53:01Z

For the past 3 years we've followed this process to add case insensitive search to indexes storing security logs

Export index mapping/component template from python
Use a python script to add the lowercase normaliser to all keyword fields
Set up the beats indexes using our modified templates

We've never had an issue querying keyword fields with the lowercase normaliser applied.

It's a simple solution which is already supported in the stack.

In the rare occasions where an operator might want to do case sensitive search against security logs (maybe user-agent), ecs can implement a keyword multi-field without the lowercase normaliser applied. But everything else like usernames, hostnames, , domain names, urls, process names, process paths, command line args.... it can all be lowercase.

EDR tools like crowdstrike handle case insensitive search.

neu5ron · 2022-11-10T19:38:17Z

as a heads up, exposing case insensitivity in Kibana or KQL or EQL or Lucene does not fix the issue mentioned above with match_only_text field types

rwaight · 2022-11-15T14:39:30Z

Hey @neu5ron, is there a reason this was closed? It is something that should certainly be resolved, but I do not see PRs that would address the issues you have raised.

as a heads up, exposing case insensitivity in Kibana or KQL or EQL or Lucene does not fix the issue mentioned above with match_only_text field types

I realize that elastic/kibana#134143 is not meant to resolve the issues with match_only_text field types; but still opened that issue in hopes of improving the experience within Kibana and provide users with the ability to toggle the case_insensitive option.

neu5ron added the bug Something isn't working label Mar 11, 2022

neu5ron changed the title ~~match_only_text text data type causes many inaccuracies in search~~ match_only_text and general necessity for case insensitive and exact match (search) Mar 21, 2022

rwaight mentioned this issue Jun 9, 2022

Expose Elasticsearch case insensitivity in EQL/KQL elastic/kibana#134143

Open

neu5ron closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

match_only_text and general necessity for case insensitive and exact match (search) #1837

match_only_text and general necessity for case insensitive and exact match (search) #1837

neu5ron commented Mar 11, 2022 •

edited

Loading

ebeahan commented Mar 14, 2022

Create `text` template

Create `match_only_text` template

Test data

Check mappings

`text` search query

`text` result

`match_only_text` search query

`match_only_text` result

neu5ron commented Mar 21, 2022 •

edited

Loading

MikePaquette commented Mar 25, 2022

jpountz commented Apr 6, 2022

mbudge commented Apr 12, 2022

neu5ron commented Nov 10, 2022 •

edited

Loading

rwaight commented Nov 15, 2022

match_only_text and general necessity for case insensitive and exact match (search) #1837

match_only_text and general necessity for case insensitive and exact match (search) #1837

Comments

neu5ron commented Mar 11, 2022 • edited Loading

Description

Example

Test Data

Result to Find

Search - in Human Form

Search - in Elastic Form

Search - Results

Follow along complete test

ebeahan commented Mar 14, 2022

match_only_text vs. text

Create text template

Create match_only_text template

Test data

Check mappings

text search query

text result

match_only_text search query

match_only_text result

Full-text search in ECS

Test data indexed using keyword

wildcard data type in ECS

Wrap up

neu5ron commented Mar 21, 2022 • edited Loading

Desired State

Elasticsearch ECS Data Types

keyword

match_only_text

wildcard

ECS Data Type Usage

Data Types for Desired State

1) Exact match

2) Case insensitive

3) Aggregations

4) Aggregation max character length

Solving Case Insensitive Search

Etc

MikePaquette commented Mar 25, 2022

jpountz commented Apr 6, 2022

mbudge commented Apr 12, 2022

neu5ron commented Nov 10, 2022 • edited Loading

rwaight commented Nov 15, 2022

neu5ron commented Mar 11, 2022 •

edited

Loading

`match_only_text` vs. `text`

Create `text` template

Create `match_only_text` template

`text` search query

`text` result

`match_only_text` search query

`match_only_text` result

Test data indexed using `keyword`

`wildcard` data type in ECS

neu5ron commented Mar 21, 2022 •

edited

Loading

neu5ron commented Nov 10, 2022 •

edited

Loading