Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New terms_enum API for discovering terms in the index. #66452

Merged
merged 31 commits into from
May 6, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9ffbb03
A TermsEnum API for discovering terms in the index.
markharwood Nov 18, 2020
b95cf81
Added HLRC support and related integration test
markharwood Feb 17, 2021
ab826b4
Added client classes for HLRC.
markharwood Feb 17, 2021
7a5e654
License fix
markharwood Feb 17, 2021
93dfe30
Remove HLRC code for now - requires less-than-ideal package names whi…
markharwood Feb 18, 2021
58bbf41
Return empty arrays when no results rather than no `terms` property a…
markharwood Mar 10, 2021
7fb781d
Fix bundling of shardIds for nodes, add success/fail accounting of nu…
markharwood Mar 11, 2021
4e9da39
Type fixes
markharwood Mar 23, 2021
cac1bb3
Types warning
markharwood Mar 23, 2021
810d638
Removed hot/warm tier tests (in anticipation of new queryable _tier f…
markharwood Apr 6, 2021
156302f
Move rest-api-spec and related YML test to new standard home for this…
markharwood Apr 8, 2021
62af758
Unused import
markharwood Apr 8, 2021
03f79de
Move test to xpack
markharwood Apr 8, 2021
53e12cd
Return early on network thread if can’t match any shards.
markharwood Apr 12, 2021
b36f477
Removed sort by popularity option
markharwood Apr 12, 2021
3b1b6d9
Unused import
markharwood Apr 12, 2021
5250cc7
Addressing some review comments (thanks Jim/Adrien!)
markharwood Apr 13, 2021
3288641
Docs tidy up
markharwood Apr 14, 2021
2c00968
Provide full stack traces for errors, change TODO comment
markharwood Apr 14, 2021
6a68b70
Move location of YAML test - was causing errors when seated alongside…
markharwood Apr 19, 2021
1fe0a11
Security enhancement - allow access where DLS rewrites to match_all. …
markharwood Apr 23, 2021
2f59860
Remove acquisition of searcher from security check code
markharwood Apr 23, 2021
4c38b78
Changed termenum to termsenum. REST endpoint is now _terms_enum
markharwood Apr 26, 2021
c40a0db
Checkstyle fix
markharwood Apr 26, 2021
6b9f41c
Addressing review comments - formatting, thread pool choices and more
markharwood Apr 27, 2021
814e45e
Oops. Thought I’d resolved this review comment but hadn’t
markharwood Apr 27, 2021
9897518
Changed timeout setting to a TimeValue
markharwood Apr 27, 2021
2cb91df
Checkstyle fix
markharwood Apr 27, 2021
6d55f99
In flattened fields make only the value (not the field name) subject …
markharwood Apr 30, 2021
cf70053
Moved initialisation of data node timing of request from NodeTermsEnu…
markharwood Apr 30, 2021
22312cf
Remove outdated TODOs
markharwood May 6, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions docs/reference/search/terms-enum.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
[[search-terms-enum]]
=== Terms enum API

The terms enum API can be used to discover terms in the index that match
a partial string. This is used for auto-complete:

[source,console]
--------------------------------------------------
POST stackoverflow/_terms_enum
{
"field" : "tags",
"string" : "kiba"
}
--------------------------------------------------
// TEST[setup:stackoverflow]


The API returns the following response:

[source,console-result]
--------------------------------------------------
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"terms": [
"kibana"
],
"complete" : true
}
--------------------------------------------------

The "complete" flag is false if time or space constraints were met and the
set of terms examined was not the full set of available values.

[[search-terms-enum-api-request]]
==== {api-request-title}

`GET /<target>/_terms_enum`


[[search-terms-enum-api-desc]]
==== {api-description-title}

The termsenum API can be used to discover terms in the index that begin with the provided
string. It is designed for low-latency look-ups used in auto-complete scenarios.


[[search-terms-enum-api-path-params]]
==== {api-path-parms-title}

`<target>`::
(Mandatory, string)
Comma-separated list of data streams, indices, and index aliases to search.
Wildcard (`*`) expressions are supported.
+
To search all data streams or indices in a cluster, omit this parameter or use
`_all` or `*`.

[[search-terms-enum-api-request-body]]
==== {api-request-body-title}

[[terms-enum-field-param]]
`field`::
(Mandatory, string)
Which field to match

[[terms-enum-string-param]]
`string`::
(Mandatory, string)
The string to match at the start of indexed terms

[[terms-enum-size-param]]
`size`::
(Optional, integer)
How many matching terms to return. Defaults to 10

[[terms-enum-timeout-param]]
`timeout`::
(Optional, <<time-units,time value>>)
The maximum length of time to spend collecting results. Defaults to "1s" (one second).
If the timeout is exceeded the `complete` flag set to false in the response and the results may
be partial or empty.

[[terms-enum-case_insensitive-param]]
`case_insensitive`::
(Optional, boolean)
When true the provided search string is matched against index terms without case sensitivity.
Defaults to false.

[[terms-enum-index_filter-param]]
`index_filter`::
(Optional, <<query-dsl,query object>> Allows to filter an index shard if the provided
query rewrites to `match_none`.

35 changes: 35 additions & 0 deletions rest-api-spec/src/main/resources/rest-api-spec/api/termsenum.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"termsenum":{
"documentation":{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/terms-enum.html",
"description": "The terms enum API can be used to discover terms in the index that begin with the provided string. It is designed for low-latency look-ups used in auto-complete scenarios."
},
"stability":"beta",
"visibility":"public",
"headers":{
"accept": [ "application/json"],
"content_type": ["application/json"]
},
"url":{
"paths":[
{
"path": "/{index}/_terms_enum",
"methods": [
"GET",
"POST"
],
"parts": {
"index": {
"type": "list",
"description": "A comma-separated list of index names to search; use `_all` or empty string to perform the operation on all indices"
}
}
}
]
},
"params":{},
"body":{
"description":"field name, string which is the prefix expected in matching terms, timeout and size for max number of results"
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,18 @@
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiTerms;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.automaton.Automata;
import org.apache.lucene.util.automaton.Automaton;
import org.apache.lucene.util.automaton.CompiledAutomaton;
import org.apache.lucene.util.automaton.MinimizationOperations;
import org.apache.lucene.util.automaton.Operations;
import org.elasticsearch.common.lucene.Lucene;
import org.elasticsearch.common.lucene.search.AutomatonQueries;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.analysis.IndexAnalyzers;
import org.elasticsearch.index.analysis.NamedAnalyzer;
Expand Down Expand Up @@ -248,6 +258,25 @@ public KeywordFieldType(String name, NamedAnalyzer analyzer) {
this.scriptValues = null;
}

markharwood marked this conversation as resolved.
Show resolved Hide resolved
@Override
public TermsEnum getTerms(boolean caseInsensitive, String string, SearchExecutionContext queryShardContext) throws IOException {
IndexReader reader = queryShardContext.searcher().getTopReaderContext().reader();

Terms terms = MultiTerms.getTerms(reader, name());
if (terms == null) {
// Field does not exist on this shard.
return null;
}
Automaton a = caseInsensitive
? AutomatonQueries.caseInsensitivePrefix(string)
: Automata.makeString(string);
a = Operations.concatenate(a, Automata.makeAnyString());
a = MinimizationOperations.minimize(a, Integer.MAX_VALUE);

CompiledAutomaton automaton = new CompiledAutomaton(a);
return automaton.getTermsEnum(terms);
}

@Override
public String typeName() {
return CONTENT_TYPE;
Expand Down Expand Up @@ -470,4 +499,6 @@ protected String contentType() {
public FieldMapper.Builder getMergeBuilder() {
return new Builder(simpleName(), indexAnalyzers, scriptCompiler).init(this);
}


}
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import org.apache.lucene.index.PrefixCodedTerms;
import org.apache.lucene.index.PrefixCodedTerms.TermIterator;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.queries.intervals.IntervalsSource;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
Expand Down Expand Up @@ -429,4 +430,20 @@ public enum CollapseType {
KEYWORD,
NUMERIC
}

/**
* This method is used to support auto-complete services and implementations
* are expected to find terms beginning with the provided string very quickly.
* If fields cannot look up matching terms quickly they should return null.
* The returned TermEnum should implement next(), term() and doc_freq() methods
markharwood marked this conversation as resolved.
Show resolved Hide resolved
* but postings etc are not required.
* @param caseInsensitive if matches should be case insensitive
* @param string the partially complete word the user has typed (can be empty)
* @param queryShardContext the shard context
* @return null or an enumeration of matching terms and their doc frequencies
* @throws IOException Errors accessing data
*/
public TermsEnum getTerms(boolean caseInsensitive, String string, SearchExecutionContext queryShardContext) throws IOException {
return null;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,27 @@
package org.elasticsearch.index.mapper.flattened;

import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.ImpactsEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.MultiTerms;
import org.apache.lucene.index.OrdinalMap;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermState;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.SortField;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.automaton.Automata;
import org.apache.lucene.util.automaton.Automaton;
import org.apache.lucene.util.automaton.CompiledAutomaton;
import org.apache.lucene.util.automaton.MinimizationOperations;
import org.apache.lucene.util.automaton.Operations;
import org.elasticsearch.common.lucene.Lucene;
import org.elasticsearch.common.lucene.search.AutomatonQueries;
import org.elasticsearch.common.unit.Fuzziness;
Expand Down Expand Up @@ -241,6 +254,29 @@ public Query wildcardQuery(String value,
public Query termQueryCaseInsensitive(Object value, SearchExecutionContext context) {
return AutomatonQueries.caseInsensitiveTermQuery(new Term(name(), indexedValueForSearch(value)));
}

@Override
public TermsEnum getTerms(boolean caseInsensitive, String string, SearchExecutionContext queryShardContext) throws IOException {
IndexReader reader = queryShardContext.searcher().getTopReaderContext().reader();
Terms terms = MultiTerms.getTerms(reader, name());
if (terms == null) {
// Field does not exist on this shard.
return null;
}

Automaton a = Automata.makeString(key + FlattenedFieldParser.SEPARATOR);
if (caseInsensitive) {
a = Operations.concatenate(a, AutomatonQueries.caseInsensitivePrefix(string));
} else {
a = Operations.concatenate(a, Automata.makeString(string));
a = Operations.concatenate(a, Automata.makeAnyString());
}
a = MinimizationOperations.minimize(a, Integer.MAX_VALUE);

CompiledAutomaton automaton = new CompiledAutomaton(a);
// Wrap result in a class that strips field names from discovered terms
return new TranslatingTermsEnum(automaton.getTermsEnum(terms));
}

@Override
public BytesRef indexedValueForSearch(Object value) {
Expand Down Expand Up @@ -270,6 +306,95 @@ public ValueFetcher valueFetcher(SearchExecutionContext context, String format)
return SourceValueFetcher.identity(rootName + "." + key, context, format);
}
}


// Wraps a raw Lucene TermsEnum to strip values of fieldnames
static class TranslatingTermsEnum extends TermsEnum {
TermsEnum delegate;

TranslatingTermsEnum(TermsEnum delegate) {
this.delegate = delegate;
}

@Override
public BytesRef next() throws IOException {
// Strip the term of the fieldname value
BytesRef result = delegate.next();
if (result != null) {
result = FlattenedFieldParser.extractValue(result);
}
return result;
}

@Override
public BytesRef term() throws IOException {
// Strip the term of the fieldname value
BytesRef result = delegate.term();
if (result != null) {
result = FlattenedFieldParser.extractValue(result);
}
return result;
}


@Override
public int docFreq() throws IOException {
return delegate.docFreq();
}

//=============== All other TermsEnum methods not supported =================

@Override
public AttributeSource attributes() {
throw new UnsupportedOperationException();
}

@Override
public boolean seekExact(BytesRef text) throws IOException {
throw new UnsupportedOperationException();
}

@Override
public SeekStatus seekCeil(BytesRef text) throws IOException {
throw new UnsupportedOperationException();
}

@Override
public void seekExact(long ord) throws IOException {
throw new UnsupportedOperationException();
}

@Override
public void seekExact(BytesRef term, TermState state) throws IOException {
throw new UnsupportedOperationException();
}

@Override
public long ord() throws IOException {
throw new UnsupportedOperationException();
}

@Override
public long totalTermFreq() throws IOException {
throw new UnsupportedOperationException();
}

@Override
public PostingsEnum postings(PostingsEnum reuse, int flags) throws IOException {
throw new UnsupportedOperationException();
}

@Override
public ImpactsEnum impacts(int flags) throws IOException {
throw new UnsupportedOperationException();
}

@Override
public TermState termState() throws IOException {
throw new UnsupportedOperationException();
}

}

/**
* A field data implementation that gives access to the values associated with
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -166,4 +166,15 @@ static BytesRef extractKey(BytesRef keyedValue) {
}
return new BytesRef(keyedValue.bytes, keyedValue.offset, length);
}

static BytesRef extractValue(BytesRef keyedValue) {
markharwood marked this conversation as resolved.
Show resolved Hide resolved
int length;
for (length = 0; length < keyedValue.length; length++){
if (keyedValue.bytes[keyedValue.offset + length] == SEPARATOR_BYTE) {
break;
}
}
int valueStart = keyedValue.offset + length + 1;
return new BytesRef(keyedValue.bytes, valueStart, keyedValue.length - valueStart );
}
}
Loading