New terms_enum API for discovering terms in the index. #66452

markharwood · 2020-12-16T16:08:32Z

A search string is supplied which is used as prefix for matching terms found in a given field in the index.
A timeout can limit the amount of time spent looking for matches.
Designed for use in Kibana auto-complete use cases.
Kibana requests for this API would typically look like this:

localhost:9200/myindex/_terms_enum
{
"field" : "myfield",
"string" : "Microsof",
"index_filter": {
    "bool":{
        "must":[
           { "range": "my Kibana time-picker range"},
           { "terms": {"_tier": ["data_warm", "data_hot"]}
         ]
}}

The time range would avoid any indices that fall outside of the range but does not filter any doc values in overlapping indices. The tier clause would avoid hitting frozen/cold indices.
An optional timeout time value can also be passed (default is "1s", one second).

The response looks like this:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "terms": [
    "Microsoft",
    "Microsoft Windows"
  ],
  "complete": true
}

Any requests that hit the timeout setting will return "complete":false

elasticmachine · 2020-12-16T16:08:36Z

Pinging @elastic/es-search (Team:Search)

mayya-sharipova

@markharwood Exciting API , I envision it to be very useful . I did an initial pass in review, mostly questions.

docs/reference/search/term-enum.asciidoc

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TermCount.java

...plugin/core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TermEnumResponse.java

...k/plugin/core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TermEnumAction.java

markharwood · 2020-12-22T12:06:14Z

When it comes to the HLRC I'm unsure where best to place the logic:

Do I make a copy of Request/Response classes for use in HLRC? (Helps long term goal of decoupling server classes from client).
Do I create a new XxxxxClient class or add methods to existing class e.g. IndicesClient? (Bear in mind this new method is basic-licensed and perhaps shouldn't be mixed in with OSS methods).

jimczi

I did a first pass to review the main API and the options.
I think we're mixing multiple use cases and allow too many options. I left some comments to simplify the API. I'll make a second one to review the concrete actions in a follow up.

docs/reference/search/term-enum.asciidoc

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TermCount.java

jimczi

Sorry for the delay.
I left another round of comments regarding the _terms action.

docs/reference/search/term-enum.asciidoc

...in/core/src/main/java/org/elasticsearch/xpack/core/termenum/action/ShardTermEnumRequest.java

...core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TransportTermEnumAction.java

server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

...core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TransportTermEnumAction.java

docs/reference/search/term-enum.asciidoc

markharwood · 2021-02-03T11:57:57Z

@elastic-jb Just updated this PR if you want to AB test the sort-by-popularity with current Kibana terms-agg approach.

This PR only considers up to a max of 10k matching terms on a node and returns complete : false if that limit is reached.
This is not a guaranteed API at this stage so is being made available for evaluating results quality/speed with the existing terms agg approach.
If you want to compare results interactively - I have a Python Flask webserver you can run. It lets you pick an index+field:

then try the various searches like this search for bands starting with A:

Note, in the above example of music artists sorted by popularity, the results are incomplete because there are more than 10k bands starting with a. Typing any additional character reduces the matches and makes the scan return complete:true.
Even when incomplete I expect the accuracy of these results will be greater than the current terms-agg approach (which is based on a sample of the first random N docs). I expect the speed to be significantly faster with this PR too.
I'll do some benchmarks on various datasets to measure the speed and quality difference between this PR and existing terms-agg approach. Comparing speed is easy. The quality is harder but should be measurable by the sorts of measures used in traditional search results evaluation - the deviation of a candidate configuration from a "gold standard" set of ordered results. In this case the gold standard is not some human-judged ranking of quality but something we can produce automatically - a strict doc_count popularity order produced by fully accurate terms aggs that are not subject to any time or space limits.

markharwood · 2021-02-03T17:48:53Z

@jimczi @elastic-jb @lizozom @giladgal

I tried this PR out compared with the existing Kibana impl (terms agg with terminate_after:100k docs).
To compare like-with-like I used a sort-by-popularity mode but with a 10k limit on number of matching terms considered in the index.

I had plans to automate the comparisons but there were some interesting findings just from manually playing with some datasets.

1) The existing Kibana impl is case sensitive

Unless the user has chosen to add normalisation to the mapping at index-time, there is no case insensitivity to searches:

Note the existing Kibana impl could be changed to make the include regex it uses to filter terms case insensitive but, as we'll see, there are other concerns.

2) The existing Kibana impl misses stuff

Sampling the first 100k docs has no guarantee of finding the term the user is looking for as in this example:

3) The existing Kibana impl is slow in ways we didn't realise

One of my datasets has many values per doc - person profiles have a list of band "likes" related to the user. This acts as big multiplier to the number of regex tests the terms agg does on the 100k docs in the sample. This dataset was noticeably slower than my tests on other datasets with similar numbers of docs but which had single-value fields. . It doesn't matter how long the searched prefix is - this took nearly 2 seconds for every request so is a fixed cost. The new impl was much faster:

4) The new impl can be inaccurate (but there's a simple solution)

For a high cardinality field and a short search string the results from the new impl can be inaccurate. In this example search for a there are many bands with that prefix so rather than considering aa to az we hit the fixed 10k limit of terms somewhere around am. We still found popular bands like abba but are missing the most popular arctic monkeys. This incompleteness is flagged in the response:

However, the solution is simple - when the result is incomplete the user should be encouraged to "type more" and, in this case, with a single key press the results become fully accurate and are faster:

This is a big improvement on the existing implementation where typing more characters neither improves speed or accuracy.

...core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TransportTermEnumAction.java

markharwood · 2021-02-18T10:25:21Z

Two things came out of discussions today

We should permanently remove the logic in this PR about filtering tiers once we have the canMatch test supporting a new query for testing the tier an index is on
We should remove the HLRC support until we can move the code out of xpack (currently we call security APIs directly to avoid DLS indices but to avoid this we'll need a new security API to filter indices where DLS is on for the current user)

… core/src/yamlRestTest

…Added security tests

…to case insensitive search

…mRequest constructor to TransportTermsEnumAction#asyncNodeOperation

markharwood added the :Search/Search Search-related issues that do not fall into other categories label Dec 16, 2020

markharwood self-assigned this Dec 16, 2020

elasticmachine added the Team:Search Meta label for search team label Dec 16, 2020

mayya-sharipova reviewed Dec 16, 2020

View reviewed changes

markharwood added the WIP label Dec 18, 2020

jimczi requested changes Jan 4, 2021

View reviewed changes

markharwood force-pushed the fix/KQLcompleteBasic branch 2 times, most recently from 0a1a2d3 to b48a473 Compare January 7, 2021 12:19

jimczi reviewed Jan 12, 2021

View reviewed changes

markharwood force-pushed the fix/KQLcompleteBasic branch 7 times, most recently from fa6eaf1 to 7e2ce35 Compare January 18, 2021 14:31

mayya-sharipova reviewed Jan 18, 2021

View reviewed changes

docs/reference/search/term-enum.asciidoc Outdated Show resolved Hide resolved

markharwood force-pushed the fix/KQLcompleteBasic branch from eb50f69 to 9d955e1 Compare January 25, 2021 17:06

jtibshirani reviewed Feb 4, 2021

View reviewed changes

...core/src/main/java/org/elasticsearch/xpack/core/termenum/action/TransportTermEnumAction.java Outdated Show resolved Hide resolved

markharwood force-pushed the fix/KQLcompleteBasic branch 4 times, most recently from 8d528a9 to 6869b7f Compare February 16, 2021 15:14

markharwood force-pushed the fix/KQLcompleteBasic branch from d481104 to 7bbf5b2 Compare February 18, 2021 10:51

lizozom mentioned this pull request Feb 25, 2021

[Autocomplete] Integrate new autocomplete API elastic/kibana#92783

Closed

markharwood added 16 commits May 6, 2021 09:40

Unused import

3b1b6d9

Addressing some review comments (thanks Jim/Adrien!)

5250cc7

Docs tidy up

3288641

Provide full stack traces for errors, change TODO comment

2c00968

Move location of YAML test - was causing errors when seated alongside…

6a68b70

… core/src/yamlRestTest

Security enhancement - allow access where DLS rewrites to match_all. …

1fe0a11

…Added security tests

Remove acquisition of searcher from security check code

2f59860

Changed termenum to termsenum. REST endpoint is now _terms_enum

4c38b78

Checkstyle fix

c40a0db

Addressing review comments - formatting, thread pool choices and more

6b9f41c

Oops. Thought I’d resolved this review comment but hadn’t

814e45e

Changed timeout setting to a TimeValue

9897518

Checkstyle fix

2cb91df

In flattened fields make only the value (not the field name) subject …

6d55f99

…to case insensitive search

Moved initialisation of data node timing of request from NodeTermsEnu…

cf70053

…mRequest constructor to TransportTermsEnumAction#asyncNodeOperation

Remove outdated TODOs

22312cf

markharwood force-pushed the fix/KQLcompleteBasic branch from bcf76a4 to 22312cf Compare May 6, 2021 08:40

markharwood merged commit 73e0662 into elastic:master May 6, 2021

markharwood added the backport pending label May 6, 2021

markharwood removed the backport pending label May 10, 2021

sethmlarson mentioned this pull request May 14, 2021

Rename 'termsenum' API to 'terms_enum' for better readability #73119

Merged

lukasolson mentioned this pull request May 14, 2021

Use new terms enum API for autocomplete value suggestions elastic/kibana#100174

Merged

5 tasks

mayya-sharipova mentioned this pull request Jun 9, 2021

Field key suggester API for flattened field #73968

Closed

FrankHassanabad mentioned this pull request Jul 8, 2021

New TermsEnum returns 404 on non-existent subtraction indexes e.g. (logs-*,-elastic-cloud-logs-*) instead of 200 #75155

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

probakowski added the >feature label Jul 30, 2021

probakowski changed the title ~~New TermsEnum API for discovering terms in the index.~~ New terms_enum API for discovering terms in the index. Jul 30, 2021

jrodewig mentioned this pull request Aug 2, 2021

[DOCS] Add highlights for terms enum and data tiers migration APIs #75958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New terms_enum API for discovering terms in the index. #66452

New terms_enum API for discovering terms in the index. #66452

markharwood commented Dec 16, 2020 •

edited

Loading

elasticmachine commented Dec 16, 2020

mayya-sharipova left a comment

markharwood commented Dec 22, 2020

jimczi left a comment

jimczi left a comment

markharwood commented Feb 3, 2021

markharwood commented Feb 3, 2021

markharwood commented Feb 18, 2021

New terms_enum API for discovering terms in the index. #66452

New terms_enum API for discovering terms in the index. #66452

Conversation

markharwood commented Dec 16, 2020 • edited Loading

elasticmachine commented Dec 16, 2020

mayya-sharipova left a comment

Choose a reason for hiding this comment

markharwood commented Dec 22, 2020

jimczi left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

markharwood commented Feb 3, 2021

markharwood commented Feb 3, 2021

1) The existing Kibana impl is case sensitive

2) The existing Kibana impl misses stuff

3) The existing Kibana impl is slow in ways we didn't realise

4) The new impl can be inaccurate (but there's a simple solution)

markharwood commented Feb 18, 2021

markharwood commented Dec 16, 2020 •

edited

Loading