-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for wildcard
field type
#5639
Comments
A clarification for others, Elasticsearch added wildcard field type as an x-pack feature in version 7.9.0, which is not an open-source feature. See the user guide of Elasticsearch 7.9 https://www.elastic.co/guide/en/elasticsearch/reference/7.9/keyword.html#wildcard-field-type |
@epiphone Could you describe your use case? |
Some good use cases for wildcard:
|
@macrakis my use case is a large index of user-submitted names where most names are short (<100 characters) and unique, and I want to query the names by arbitrary substrings. As a workaround I'm using an ngram tokenizer which works well enough but is more complicated to set up than the wildcard field type. |
Josef, Epiphone, thanks for your answers -- very helpful! So it sounds like you need to find arbitrary substrings in your corpus, not just strings starting at token boundaries. That would be not just "clerc" in "Leclerc", but also "ecle", not just "org/open" in "server/src/main/java/org/opensearch/index/query" but also "ense" in that pathname. Could the problems be solved by different tokenization? |
What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via *ense*) and without worrying about tokenization. |
We have a use case for this where we need to index large XML documents that are > 32766 bytes. Our users want to be able to search for a string in an XML document eg *failuremessage122* or just *failure* or even *fail*. A keyword field type would make sense (despite the poor leading wildcard performance) but this is not possible due to exceeding the 32766 byte limit. Tokenisation is also problematic with XML docs and we also get issues where we have token explosion with > 10000 terms generated when using most of the analyzers. The XML logstash filter was considered but has similar issues with large documents producing a huge amount of fields. We don't always know ahead of time which elements we need to search for so that pre-processing of data isn't really an option. Support for a "wildcard" field type would really improve our user experience |
wildcard field type has been supported sans x-pack since ElasticSearch 7.11 |
AFAIK nobody is working on this. If someone wants to give it a shot, there are folks contributing flattened field type via #1018, and looks like there’s a draft PR in #6507 - can be used as an inspiration. Please note that we cannot accept any code from ES > 7.10.2, which was the last version under APLv2. Would welcome an independent implementation that doesn't look at anything under an incompatible license. |
Re using wildcard field for XML (#5639 (comment)), I wonder if you could use the XML logstash filter and then the Flat field type which is coming out in 2.7? (#1018 (comment)). |
@macrakis I've had a look through the docs for a Flat field type and without ruling it out completely I'd have some concerns:
That being said I'd be willing to give it a try in our development environments when this feature is released |
There was some good discussion over on #12500, which highlighted the value of wildcard fields. Also, Elastic's blog post about the feature provides a really good explanation: https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field |
Another reason to implement it: if you want to use ECS 8.12, it's used in the standard component templates. Trying to load them:
|
This is similar to our use-case. We are storing large json objects (log data) where the json keys are not known in advance. We are using |
There is a draft PR out now: #13461 (comment) |
Fantastic thanks! |
Elasticsearch added the
wildcard
field type in v7.9. Are there any plans to support the field type in OpenSearch?Thanks!
The text was updated successfully, but these errors were encountered: