Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New terms_enum API for discovering terms in the index. #66452

Merged
merged 31 commits into from
May 6, 2021

Conversation

markharwood
Copy link
Contributor

@markharwood markharwood commented Dec 16, 2020

A search string is supplied which is used as prefix for matching terms found in a given field in the index.
A timeout can limit the amount of time spent looking for matches.
Designed for use in Kibana auto-complete use cases.
Kibana requests for this API would typically look like this:

localhost:9200/myindex/_terms_enum
{
"field" : "myfield",
"string" : "Microsof",
"index_filter": {
    "bool":{
        "must":[
           { "range": "my Kibana time-picker range"},
           { "terms": {"_tier": ["data_warm", "data_hot"]}
         ]
}}

The time range would avoid any indices that fall outside of the range but does not filter any doc values in overlapping indices. The tier clause would avoid hitting frozen/cold indices.
An optional timeout time value can also be passed (default is "1s", one second).

The response looks like this:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "terms": [
    "Microsoft",
    "Microsoft Windows"
  ],
  "complete": true
}

Any requests that hit the timeout setting will return "complete":false

@markharwood markharwood added the :Search/Search Search-related issues that do not fall into other categories label Dec 16, 2020
@markharwood markharwood self-assigned this Dec 16, 2020
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markharwood Exciting API , I envision it to be very useful . I did an initial pass in review, mostly questions.

@markharwood
Copy link
Contributor Author

When it comes to the HLRC I'm unsure where best to place the logic:

  1. Do I make a copy of Request/Response classes for use in HLRC? (Helps long term goal of decoupling server classes from client).
  2. Do I create a new XxxxxClient class or add methods to existing class e.g. IndicesClient? (Bear in mind this new method is basic-licensed and perhaps shouldn't be mixed in with OSS methods).

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass to review the main API and the options.
I think we're mixing multiple use cases and allow too many options. I left some comments to simplify the API. I'll make a second one to review the concrete actions in a follow up.

docs/reference/search/term-enum.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/term-enum.asciidoc Outdated Show resolved Hide resolved
docs/reference/search/term-enum.asciidoc Outdated Show resolved Hide resolved
@markharwood markharwood force-pushed the fix/KQLcompleteBasic branch 2 times, most recently from 0a1a2d3 to b48a473 Compare January 7, 2021 12:19
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.
I left another round of comments regarding the _terms action.

@markharwood markharwood force-pushed the fix/KQLcompleteBasic branch 7 times, most recently from fa6eaf1 to 7e2ce35 Compare January 18, 2021 14:31
@markharwood
Copy link
Contributor Author

@elastic-jb Just updated this PR if you want to AB test the sort-by-popularity with current Kibana terms-agg approach.

This PR only considers up to a max of 10k matching terms on a node and returns complete : false if that limit is reached.
This is not a guaranteed API at this stage so is being made available for evaluating results quality/speed with the existing terms agg approach.
If you want to compare results interactively - I have a Python Flask webserver you can run. It lets you pick an index+field:
Select_index
then try the various searches like this search for bands starting with A:

Home_-_Microblog

Note, in the above example of music artists sorted by popularity, the results are incomplete because there are more than 10k bands starting with a. Typing any additional character reduces the matches and makes the scan return complete:true.
Even when incomplete I expect the accuracy of these results will be greater than the current terms-agg approach (which is based on a sample of the first random N docs). I expect the speed to be significantly faster with this PR too.
I'll do some benchmarks on various datasets to measure the speed and quality difference between this PR and existing terms-agg approach. Comparing speed is easy. The quality is harder but should be measurable by the sorts of measures used in traditional search results evaluation - the deviation of a candidate configuration from a "gold standard" set of ordered results. In this case the gold standard is not some human-judged ranking of quality but something we can produce automatically - a strict doc_count popularity order produced by fully accurate terms aggs that are not subject to any time or space limits.

@markharwood
Copy link
Contributor Author

@jimczi @elastic-jb @lizozom @giladgal

I tried this PR out compared with the existing Kibana impl (terms agg with terminate_after:100k docs).
To compare like-with-like I used a sort-by-popularity mode but with a 10k limit on number of matching terms considered in the index.

I had plans to automate the comparisons but there were some interesting findings just from manually playing with some datasets.

1) The existing Kibana impl is case sensitive

Unless the user has chosen to add normalisation to the mapping at index-time, there is no case insensitivity to searches:
Dallas crimes - Kibana case sensitive
Note the existing Kibana impl could be changed to make the include regex it uses to filter terms case insensitive but, as we'll see, there are other concerns.

2) The existing Kibana impl misses stuff

Sampling the first 100k docs has no guarantee of finding the term the user is looking for as in this example:
Dallas crimes - Kibana missing data

3) The existing Kibana impl is slow in ways we didn't realise

One of my datasets has many values per doc - person profiles have a list of band "likes" related to the user. This acts as big multiplier to the number of regex tests the terms agg does on the 100k docs in the sample. This dataset was noticeably slower than my tests on other datasets with similar numbers of docs but which had single-value fields. . It doesn't matter how long the searched prefix is - this took nearly 2 seconds for every request so is a fixed cost. The new impl was much faster:
Music fans_-_Kibana multivalue fields slow

4) The new impl can be inaccurate (but there's a simple solution)

For a high cardinality field and a short search string the results from the new impl can be inaccurate. In this example search for a there are many bands with that prefix so rather than considering aa to az we hit the fixed 10k limit of terms somewhere around am. We still found popular bands like abba but are missing the most popular arctic monkeys. This incompleteness is flagged in the response:
Music fans - new api misses top
However, the solution is simple - when the result is incomplete the user should be encouraged to "type more" and, in this case, with a single key press the results become fully accurate and are faster:

Music fans - new api longer strings are same

This is a big improvement on the existing implementation where typing more characters neither improves speed or accuracy.

@markharwood markharwood force-pushed the fix/KQLcompleteBasic branch 4 times, most recently from 8d528a9 to 6869b7f Compare February 16, 2021 15:14
@markharwood
Copy link
Contributor Author

Two things came out of discussions today

  1. We should permanently remove the logic in this PR about filtering tiers once we have the canMatch test supporting a new query for testing the tier an index is on
  2. We should remove the HLRC support until we can move the code out of xpack (currently we call security APIs directly to avoid DLS indices but to avoid this we'll need a new security API to filter indices where DLS is on for the current user)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants