-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
match_only_text and general necessity for case insensitive and exact match (search) #1837
Comments
Thanks for the detailed issue, @neu5ron.
|
Hey @ebeahan thanks for the reply. Especially appreciate some of the background/context for others. I would like to try to clear up a few things for the discussion going forward and to preface that - I apologize for causing confusion by using the word "inaccuracy" (to which I updated the title and some of the wording of the original issue). Desired StateFor (cyber) security and logging use cases there is necessity of:
Elasticsearch ECS Data Typeskeyword
additional noteworthy info:
match_only_text
additional noteworthy info:
wildcard
additional noteworthy info:
ECS Data Type UsageThe 3 noted data types from above are used within the ECS templates (going off the main branch as of commit hash 7496470bf422451744cef8308c1782baab8086bf
Data Types for Desired State1) Exact matchIt should go without saying this is solved through the use of 2) Case insensitiveSolution is TBD... discussed in depth later. 3) Aggregations
this is relatively solved, as already in the state of ECS keyword or wildcard is used where appropriate and for the vast majority of fields. 4) Aggregation max character lengthseparate topic Solving Case Insensitive SearchThe recommendation that users can implement their own You did mention On these deployments/uses of Elastic that don't have this solved, one would think that it could be lack of training, pro services, subscriptions, or the like.. However, I have seen this issue across 15 Elastic deployments in just the last year..across just about every possible scenario:
The most common thing is users are just using the ECS templates and just have no clue about case sensitive search restrictions or more so they think ECS has solved it given a few blog posts over the past year in relation to wildcard and match_only_text. Not to mention even if ECS templates solved case insensitivity through wildcard, it is only used on %2 of the fields. The reason I bring up this non technical background for the solution, is I think Elastic has not just the responsibility to solve this but they more than have the means to solve the issue. Especially given the (cyber) community has solved this and is more than willing to help. We all just want to get to searching and using the data and helping people do the same. With that said, is it possible to have a customer analyzer integrated into ECS? Etc
|
Thanks @neu5ron for raising this issue so clearly and constructively. We agree that our users need and expect reliable, explainable methods for case insensitive and exact-match searching on ECS-compliant data, and that the current state presents several challenges as you've described. I have initiated an internal Elastic effort to focus on a holistic solution for our users, and we will share our ideas and plans with you and the rest of the ECS community as they develop. Thanks again for your continued contributions to ECS. |
Some thoughts on this problem:
Thinking out loud, instead of switching everything to a different field type that has the semantics you want for queries, maybe another option would be to introduce a new query type that always matches on whole values, regardless of its actual type or of the configured analyzer. |
For the past 3 years we've followed this process to add case insensitive search to indexes storing security logs
We've never had an issue querying keyword fields with the lowercase normaliser applied. It's a simple solution which is already supported in the stack. In the rare occasions where an operator might want to do case sensitive search against security logs (maybe user-agent), ecs can implement a keyword multi-field without the lowercase normaliser applied. But everything else like usernames, hostnames, , domain names, urls, process names, process paths, command line args.... it can all be lowercase. EDR tools like crowdstrike handle case insensitive search. |
as a heads up, exposing case insensitivity in Kibana or KQL or EQL or Lucene does not fix the issue mentioned above with |
Hey @neu5ron, is there a reason this was closed? It is something that should certainly be resolved, but I do not see PRs that would address the issues you have raised.
I realize that elastic/kibana#134143 is not meant to resolve the issues with |
please see updated outline #1837 (comment)
however, the below can be used to note some of the shortcomings of
match_only_text
Description
match_only_text
text data type causes undesired search results when wanting to perform accurate searches that are also case insensitive.. This data type appears to not be the most optimal solution for security/log data.Searching for ends with or starts with does not keep/respect positioning, most importantly is the lack of accuracy. When searching for things other than numbers/letters (ie:
$
,.
,{
, etc..) the search characters are ignored.I understand the "solution" may be for a custom analyzer or to use wildcard, however the problem is that ECS is being applied without customers/users understanding this issue OR even changing the mappings. Thus the default and most commonly loaded/widely used mappings are giving users inaccurate search results.
The desire for security use cases, and I assume most logging use cases, is a) case insensitivity and b) accurate/exact results.
I would recommend one of two things:
a) adopting a community text analyzer (see: https://github.com/neu5ron/es_stk). Which is adopted in things such as Security Onion.
b) moving everything, defined as `match_only_text, to wildcard.
Example
Test Data
I loaded some sample values into Elasticsearch with a explicitly defined mapping for
match_only_text
on the fieldcli
(note: used the field cli but it can be any field, whether ECS or not as long as that mapping is applied).Result to Find
The value I to want to find is
C:\Users\test\rundll32.exe $
Search - in Human Form
I expect to find this result using the following logic (in human form):
rundll32.exe
$
Search - in Elastic Form
After converting this logic into an actual Elastic query the syntax looks like:
cli.text:"*rundll32.exe*\$"
Search - Results
The results from this search return 6 matches when there should be only 1.
There is only one occurrence where
rundll32.exe
endswith a$
Not only does it find results that don't end in
$
, it returns results that do not contain a$
at all.Follow along complete test
https://github.com/neu5ron/es_stk/wiki#testing-yourself
The text was updated successfully, but these errors were encountered: