Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for wildcard field type #5639

Closed
epiphone opened this issue Dec 27, 2022 · 16 comments · Fixed by #13461
Closed

Add support for wildcard field type #5639

epiphone opened this issue Dec 27, 2022 · 16 comments · Fixed by #13461
Labels
enhancement Enhancement or improvement to existing feature or request feature New feature or request help wanted Extra attention is needed Indexing Indexing, Bulk Indexing and anything related to indexing Search:Query Capabilities Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0

Comments

@epiphone
Copy link

Elasticsearch added the wildcard field type in v7.9. Are there any plans to support the field type in OpenSearch?

Thanks!

@epiphone epiphone added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 27, 2022
@tlfeng
Copy link
Collaborator

tlfeng commented Dec 27, 2022

A clarification for others, Elasticsearch added wildcard field type as an x-pack feature in version 7.9.0, which is not an open-source feature. See the user guide of Elasticsearch 7.9 https://www.elastic.co/guide/en/elasticsearch/reference/7.9/keyword.html#wildcard-field-type

@tlfeng tlfeng added the Indexing Indexing, Bulk Indexing and anything related to indexing label Dec 27, 2022
@dblock dblock changed the title wildcard field type Add support for wildcard field type Dec 30, 2022
@macrakis
Copy link

@epiphone Could you describe your use case?
I do understand that it speeds up wildcard searches, but why is wildcard search performance critical in your application? How many unique values are in your dataset and how big are they?
Are there any good workarounds?

@josefschiefer27
Copy link

Some good use cases for wildcard:

  • Matching error messages and stack traces
  • Matching URL and file paths
  • Matching fields that have encoded content (e.g. "(8-10)||86128||Women's Apparel||...")

@epiphone
Copy link
Author

@macrakis my use case is a large index of user-submitted names where most names are short (<100 characters) and unique, and I want to query the names by arbitrary substrings.

As a workaround I'm using an ngram tokenizer which works well enough but is more complicated to set up than the wildcard field type.

@macrakis
Copy link

Josef, Epiphone, thanks for your answers -- very helpful!

So it sounds like you need to find arbitrary substrings in your corpus, not just strings starting at token boundaries.

That would be not just "clerc" in "Leclerc", but also "ecle", not just "org/open" in "server/src/main/java/org/opensearch/index/query" but also "ense" in that pathname.

Could the problems be solved by different tokenization?

@josefschiefer27
Copy link

josefschiefer27 commented Jan 31, 2023

What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via *ense*) and without worrying about tokenization.

@stevesimpson418
Copy link

stevesimpson418 commented Feb 8, 2023

We have a use case for this where we need to index large XML documents that are > 32766 bytes. Our users want to be able to search for a string in an XML document eg *failuremessage122* or just *failure* or even *fail*. A keyword field type would make sense (despite the poor leading wildcard performance) but this is not possible due to exceeding the 32766 byte limit.

Tokenisation is also problematic with XML docs and we also get issues where we have token explosion with > 10000 terms generated when using most of the analyzers.

The XML logstash filter was considered but has similar issues with large documents producing a huge amount of fields. We don't always know ahead of time which elements we need to search for so that pre-processing of data isn't really an option.

Support for a "wildcard" field type would really improve our user experience

@vindurriel
Copy link

wildcard field type has been supported sans x-pack since ElasticSearch 7.11
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/keyword.html#wildcard-field-type
any plans to support it in OpenSearch?

@dblock
Copy link
Member

dblock commented Mar 3, 2023

AFAIK nobody is working on this.

If someone wants to give it a shot, there are folks contributing flattened field type via #1018, and looks like there’s a draft PR in #6507 - can be used as an inspiration.

Please note that we cannot accept any code from ES > 7.10.2, which was the last version under APLv2. Would welcome an independent implementation that doesn't look at anything under an incompatible license.

@macohen macohen added help wanted Extra attention is needed Search Search query, autocomplete ...etc labels Mar 23, 2023
@macrakis
Copy link

Re using wildcard field for XML (#5639 (comment)), I wonder if you could use the XML logstash filter and then the Flat field type which is coming out in 2.7? (#1018 (comment)).

@stevesimpson418
Copy link

@macrakis I've had a look through the docs for a Flat field type and without ruling it out completely I'd have some concerns:

  • Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
  • Performance should be similar to a keyword field. I imagine this will be awful for wildcard searches where we need to search for "foo" OR "bar" in the XML.

That being said I'd be willing to give it a try in our development environments when this feature is released

@msfroh
Copy link
Collaborator

msfroh commented Feb 29, 2024

There was some good discussion over on #12500, which highlighted the value of wildcard fields.

Also, Elastic's blog post about the feature provides a really good explanation: https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field

@sandervandegeijn
Copy link

Another reason to implement it: if you want to use ECS 8.12, it's used in the standard component templates. Trying to load them:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"No handler for type [wildcard] declared on field [content]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: No handler for type [wildcard] declared on field [content]","caused_by":{"type":"mapper_parsing_exception","reason":"No handler for type [wildcard] declared on field [content]"}},"status":400}_component_template/ecs_8.0.0_http

@stowns
Copy link

stowns commented May 10, 2024

What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via ense) and without worrying about tokenization.

This is similar to our use-case. We are storing large json objects (log data) where the json keys are not known in advance. We are using flat_object for this but cannot store values larger than 32kb. The wildcard type allows for values > 32kb and would save us from having to drop fields > 32kb before indexing.

@getsaurabh02
Copy link
Member

There is a draft PR out now: #13461 (comment)

@getsaurabh02 getsaurabh02 added the v2.15.0 Issues and PRs related to version 2.15.0 label May 28, 2024
@sandervandegeijn
Copy link

Fantastic thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request feature New feature or request help wanted Extra attention is needed Indexing Indexing, Bulk Indexing and anything related to indexing Search:Query Capabilities Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0
Projects
Status: 2.15.0 (Release window opens on June 10th, 2024 and closes on June 25th, 2024)
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.