Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlighters are slow with thousands of fields #36452

Closed
melissachang opened this issue Dec 10, 2018 · 6 comments
Closed

Highlighters are slow with thousands of fields #36452

melissachang opened this issue Dec 10, 2018 · 6 comments
Labels
:Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@melissachang
Copy link

I have a multi_match query that searches across all fields. Without highlighting, it takes 2 seconds.

I'd like to use highlight to tell me which fields match. But highlight makes the query take several minutes long. I gave up after 2 minutes -- I'm not sure if the query ever finishes. I tried all three highlight types (unified, plain, fvh).

So instead of using highlight, I'm manually iterating through the source documents and finding the matching field. This only takes maybe .2 seconds.

(Curious about how Elasticsearch query works -- as part of the query process, does Elasticsearch know which field matches? Say document D contains field F that matches query Q. When Elasticsearch determines that D is a result for Q, does Elasticserach know that F contains Q, as part of that process?)

Would it be possible to create a highlight type that is fast? It would either return only the field that matched, or the field name and entire contents of the field.

Here are some timing stats. My index has 121k documents. Across all documents there are 7k fields; a particular document will have a small subset of the 7k fields.

~/data-explorer (master): curl "localhost:9200/_cat/indices?v&s=index"
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open nurse_s_health_study kMyBnxPVTLC7W5ufetg3UQ 5 1 121701 0 5.6gb 5.6gb
yellow open nurse_s_health_study_fvh snoFPeGQSWqXZL8rGG-1qg 5 1 121701 0 8.9gb 8.9gb

no highlighter

GET /nurse_s_health_study/_search
{
  "query": {
    "multi_match": {
      "query": "pre",
      "type": "phrase_prefix"
    }
  },
  "size": 10
}

2.2s
unified highlighter

GET /nurse_s_health_study/_search
{
    "query": {
    "multi_match": {
      "query": "pre",
      "type": "phrase_prefix"
    }
  },
  "highlight": {
    "fields": {
      "*": {
        "type": "unified"
      }
    }
  },
  "size": 10
}

Gave up after 2 mins.
plain highlighter

GET /nurse_s_health_study/_search
{
    "query": {
    "multi_match": {
      "query": "pre",
      "type": "phrase_prefix"
    }
  },
  "highlight": {
    "fields": {
      "*": {
        "type": "plain"
      }
    }
  },
  "size": 10
}

Gave up after 4 minutes.
fvh highligher
I reindexed with term_vector = with_positions_offsets.

GET /nurse_s_health_study_fvh/_search
{
    "query": {
    "multi_match": {
      "query": "pre",
      "type": "phrase_prefix"
    }
  },
  "highlight": {
    "fields": {
      "*": {
        "type": "fvh"
      }
    }
  },
  "size": 10
}

Gave up after 2 minutes.

@matriv matriv added the :Search Relevance/Highlighting How a query matched a document label Dec 11, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jimczi
Copy link
Contributor

jimczi commented Dec 11, 2018

@melissachang I tried a recreation in 6.4 with the information you gave (7K fields and a few of them per document) and highlighting was fast. We added a shortcut in this version to bypass fields that don't appear in the document when highlighting so I guess that you're using a version without this enhancement.
However the issue with lot of fields should not affect the plain and fvh highlighter so I am not sure if this is what caused the slow queries in your case. Are you sure that the timings you reported are not all linked to the unified highlighter. I tried the same recreation in 6.3.2 and the query took several minutes to complete but the other highlighters responded in less than a second.
Could you please provide us with the version you used in your tests and check if the new (6.5) solved your issue ?
I am also a bit sad when I see the title of your issue, we try our best to provide tools that can be used for various use cases so when something isn't working as you want it is probably a bug or something wrong in your configuration (using 7k fields doesn't help here ;) ). Can you change the title to reflect the real issue here ? Highlighting is slow on mappings with thousands of fields, this is a bug that we hope we fixed in 6.4 for the unified highlighter but again since it shouldn't affect the other highlighters I suspect that something else is at play here so we'll need more informations.

@melissachang melissachang changed the title New highlight type that isn't slow Highlighters are slow with thousands of fields Dec 11, 2018
@melissachang
Copy link
Author

Apologies, it didn't occur to me to file a bug report instead of a feature request. (And because I was using the feature request template, I didn't include Elasticsearch version.) I am using the docker image docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.2. I am trying 6.5.3 now. I'll update this issue afterwards.

@melissachang
Copy link
Author

melissachang commented Dec 12, 2018

(I realized when I was searching for similar issues, I came across #34015 Leverage the Lucene's Matches API in a new highlighter type, which influenced me to create a feature request for new highlighter type.)

With 6.5.3:
unified - Gave up after 4 mins
plain - Worked after 4 mins 15 sec. No logs from elasticsearch.
fvh - Haven't tried yet, haven't reindexed with necessary flags.

Unfortunately my data is private. I'll try to find similar public data and repro. I'll let you know if I do.

@melissachang
Copy link
Author

Unfortunately I wasn't able to create a index that reproduces this problem.

Here are some properties of my index:

bash-4.4# curl "localhost:9200/_cat/indices?v&s=index"
health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   nurse_s_health_study        FxMn0ugWQ9G3vhM4agzfHw   5   1     121701            0     11.2gb          5.3gb

Across all documents, there are a total of 6624 fields. ~4k of the fields are string (as opposed to numeric). A single document may have 2k fields, give or take.

If anyone comes across a similar index, please try out the above queries.

I work on a tool that indexes Google BigQuery tables. If anyone comes across a public BigQuery table with the above properties (> 6k columns, > 120k rows), I'm happy to run my indexer and try to repro this bug.

@jimczi
Copy link
Contributor

jimczi commented Apr 4, 2019

As explained in this comment we have a shortcut to bypass highlighting if the field is empty or null in the current document. I tried to reproduce the slow query in >6.4 and it responded in less than a second so I think that something else is at play in your setup. I am going to close this issue but we can revisit if you provide a clear reproduction since the example in the description should be fixed by #32090.

@jimczi jimczi closed this as completed Apr 4, 2019
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

6 participants