Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for returning documents with completion suggester #19536

Merged

Conversation

areek
Copy link
Contributor

@areek areek commented Jul 21, 2016

This is a followup to #13576 (comment), enriching completion suggestions
with documents.

This commit enables completion suggester to return documents
associated with suggestions. Now the document source is returned
with every suggestion, which respects source filtering options.

In case of suggest queries spanning more than one shard, the
suggest is executed in two phases, where the last phase fetches
the relevant documents from shards, implying executing suggest
requests against a single shard is more performant due to the
document fetch overhead when the suggest spans multiple shards.

Example completion suggest response:

{
  "song-suggest": [
    {
      "text": "nev",
      "offset": 0,
      "length": 4,
      "options": [ {
          "text": "Nevermind",
          "_index": "music",
          "_type": "song",
          "_id": "52",
          "_score": 52,
          "_source": {
              "song": "Nevermind",
              "artist": "Nirvana"
          }
        }
      ]
    }
  ]
}

NOTE: now that we support returning documents with suggestions, we can remove the payload option

relates #10746

}
return groupedSuggestions;
}

public static List<Suggestion<? extends Entry<? extends Option>>> reduce(Map<String, List<Suggest.Suggestion>> groupedSuggestions) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should extract the reduce logic from suggest, the suggest abstraction is not the right place for these, maybe somewhere in SearchPhraseController? I think this will simplify Suggest and help with nuking the crazy generics. maybe we should simplify this in a subsequent PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@areek areek added the review label Jul 21, 2016
@areek areek force-pushed the enhancement/completion_suggester_documents branch from f390633 to e41f54f Compare July 21, 2016 13:53
@s1monw
Copy link
Contributor

s1monw commented Jul 21, 2016

@areek thanks so much! It took me a while to figure out what you did with the named suggestions etc. I do think there is an easier way to do what you wanted to do. I think we can, instead of use a name, scoredoc[] map we can just use offsets into the ScoreDoc[] which would allow us to keep most of the code as is and just resolve the actual documents once we finish up after the fact. ie. query would be implicitly using offset 0 and each of the suggestion will get an offset assigned in the beginning since we know the size of each suggestion and the query ahead of time.
The code that actually fetches the docs can still simply be a single array so that code can stay! We might need to add some offset, lenght parameters to stuff like #fillDocIdsToLoad but I think that is less intrusive than our current path?

I love the tests man thanks for that

@rmuir
Copy link
Contributor

rmuir commented Jul 22, 2016

Can we improve the abstractions here? For example, the relatively straightforward ScoreDoc[] and IntArrayList get hidden behind less-obvious layers named SortedDocs and DocIdsToLoad, and neither of these have any javadocs. It makes it hard to tell what is going on with the code.

@areek areek added the WIP label Jul 22, 2016
@areek
Copy link
Contributor Author

areek commented Jul 22, 2016

@rmuir I am working on removing the DocIdsToLoad abstraction and changing the SortedDocs to store offsets instead of map of IntArrayList, as per @s1monw's comment. Will make sure to add javadocs, to clarify. Thanks for the feedback :)

@areek areek force-pushed the enhancement/completion_suggester_documents branch 10 times, most recently from 8484eb3 to 748a036 Compare July 25, 2016 22:08
@areek
Copy link
Contributor Author

areek commented Jul 25, 2016

@s1monw @rmuir I updated the PR to represent search and suggest score docs array via an offsets array in SortedDocs abstraction. Would love to get feedback on the current approach!

@@ -74,7 +76,7 @@
protected final AtomicArray<FirstResult> firstResults;
private volatile AtomicArray<ShardSearchFailure> shardFailures;
private final Object shardFailuresMutex = new Object();
protected volatile ScoreDoc[] sortedShardList;
protected volatile SortedDocs sortedShardList;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to sortedShardDocs

@@ -35,6 +40,8 @@
import java.util.Map;
import java.util.Set;

import static org.elasticsearch.search.suggest.Suggest.COMPARATOR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the generic here are aweful. I regret it so much that I added them. We should really clean this up it makes stuff so complicated for no good reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, I will try to clean up the generics in Suggest in subsequent PRs

@s1monw
Copy link
Contributor

s1monw commented Aug 4, 2016

this looks awesome @areek it's simplified so much from the original patch. All the extra classes and abstractions are gone! good job! I added some comments about sharing more code etc. I also think we should have more tests on the CompletionSuggest end, like simple unittests of filter methods etc.
What I am really missing is a place where we document how the fetching works, that the order of the docIDs is crucial. I think we should add it where we create the docId array? I think we are super close here, can you remove the WIP label?

@s1monw
Copy link
Contributor

s1monw commented Aug 4, 2016

@clintongormley can you please check if we need to work more on docs here?

@areek areek removed the WIP label Aug 4, 2016
@areek areek force-pushed the enhancement/completion_suggester_documents branch 4 times, most recently from e71e0ac to 1d29080 Compare August 5, 2016 06:40
@areek
Copy link
Contributor Author

areek commented Aug 5, 2016

Thanks @s1monw for the review! I have added some documentation about how fetching works in SearchPhaseController#sortedDocs and what is expected by SearchPhaseController#merge, added simple unittests for Suggest and CompletionSuggester (more tests can be added later?).

@areek areek force-pushed the enhancement/completion_suggester_documents branch from 1d29080 to b83b495 Compare August 5, 2016 06:54
@s1monw
Copy link
Contributor

s1monw commented Aug 5, 2016

LGTM - I think CI failed so I guess you want to merge master in again and run it again! good job.

@areek areek force-pushed the enhancement/completion_suggester_documents branch from b83b495 to 260cc3d Compare August 5, 2016 18:56
This commit enables completion suggester to return documents
associated with suggestions. Now the document source is returned
with every suggestion, which respects source filtering options.

In case of suggest queries spanning more than one shard, the
suggest is executed in two phases, where the last phase fetches
the relevant documents from shards, implying executing suggest
requests against a single shard is more performant due to the
document fetch overhead when the suggest spans multiple shards.
@areek areek force-pushed the enhancement/completion_suggester_documents branch from 260cc3d to fee013c Compare August 5, 2016 21:52
@areek areek merged commit 469eb25 into elastic:master Aug 5, 2016
@areek
Copy link
Contributor Author

areek commented Aug 5, 2016

@clintongormley if you any feedback regarding the docs, let me know, will do a separate PR for it.

areek added a commit to areek/elasticsearch that referenced this pull request Aug 8, 2016
The payload option was introduced with the new completion
suggester implementation in v5, as a stop gap solution
to return additional metadata with suggestions.

Now we can return associated documents with suggestions
(elastic#19536) through fetch phase using stored field (_source).
The additional fetch phase ensures that we only fetch
the _source for the global top-N suggestions instead of
fetching _source of top results for each shard.
@speedplane
Copy link
Contributor

I'm using ES 1.7 now and the completion suggester is currently one of my largest memory bottlenecks, so thank you, I'm definitely looking forward to this! Is there documentation yet that describes this new feature?

In my index, to save space, I do not have a _store, so if I want to use this feature, my understanding is that:

  1. I'll have to create a new index that has _store turned on.
  2. During indexing, I put the payloads into this new index. Their ID can be keyed off of their value.
  3. Also when indexing, I point the completion suggester to the documents in the new index.

Do I have this generally correct?

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Sep 27, 2016
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Sep 27, 2016
deprecate _ttl/_timestamp and remove them from our tests as per elastic/elasticsearch#18980 so that migrated 2.x indices do not have their code altered (just yet)

explicit 5.x spec generation

fix failing nodes test because  is removed as per elastic/elasticsearch#19218

fixed failing integration tests due to lang no longer defaulting to groovy elastic/elasticsearch#19960

fields => stored fields, updated failing cathelp tests due to endpoint changing

suggest response is now generic and gets _source returned in accordance with elastic/elasticsearch#19536

histogram key double not long

source filtering include and exclude are now plural

script fields tests did not explicitly specify groovy

search's StoredFields still sent

get task api tests wreaked havoc on the readonly tests

scripted metric did not specify lang

set script.max_compilations_per_minute on node

fix top hits not setting groovy explicitly

multi search now response 404 properly

multitermvector tests making sure it took more then 0 is no longer reliable beta1 is too fast :)

foreach put pipeline processors is no longer an array as per elastic/elasticsearch#19402

revert field=>stored_fields rename on update request

remove propery name with dot failure assertion integration test, no longer valid since elastic/elasticsearch#19899

use existing elasticsearch node in test framework could still spawn a new java process

revert field=>stored_fields rename on update request

get pipeline api is now dictionary based as per elastic/elasticsearch#19685

xpack beta1 related fixes

reindex tests not setting all waithandles and taking 3 minutes for no good reason

missing fieldsecurity class

fix post integration test failures unit test failures

add back run as tests now that we send the right header in the beta1 world
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 10, 2016
* removed deleted file from csproj

deprecate _ttl/_timestamp and remove them from our tests as per elastic/elasticsearch#18980 so that migrated 2.x indices do not have their code altered (just yet)

explicit 5.x spec generation

fix failing nodes test because  is removed as per elastic/elasticsearch#19218

fixed failing integration tests due to lang no longer defaulting to groovy elastic/elasticsearch#19960

fields => stored fields, updated failing cathelp tests due to endpoint changing

suggest response is now generic and gets _source returned in accordance with elastic/elasticsearch#19536

histogram key double not long

source filtering include and exclude are now plural

script fields tests did not explicitly specify groovy

search's StoredFields still sent

get task api tests wreaked havoc on the readonly tests

scripted metric did not specify lang

set script.max_compilations_per_minute on node

fix top hits not setting groovy explicitly

multi search now response 404 properly

multitermvector tests making sure it took more then 0 is no longer reliable beta1 is too fast :)

foreach put pipeline processors is no longer an array as per elastic/elasticsearch#19402

revert field=>stored_fields rename on update request

remove propery name with dot failure assertion integration test, no longer valid since elastic/elasticsearch#19899

use existing elasticsearch node in test framework could still spawn a new java process

revert field=>stored_fields rename on update request

get pipeline api is now dictionary based as per elastic/elasticsearch#19685

xpack beta1 related fixes

reindex tests not setting all waithandles and taking 3 minutes for no good reason

missing fieldsecurity class

fix post integration test failures unit test failures

add back run as tests now that we send the right header in the beta1 world

* make sure code is generated of master after mass picking commits of 5.x branch
@trompx
Copy link

trompx commented Nov 3, 2016

Hello,

Compared to the old completion suggester, do we still need to optimize indices to prevent duplicate results when we update some documents ?

As quoted here : https://discuss.elastic.co/t/autocompletion-suggester-duplicate-record-in-suggestion-return/16950/4

The main reason, why you see the suggestion twice is, that even though a
document is deleted and cannot be found anymore, the suggest data
structures are only cleaned up during merges/optimizations. Running
optimize should fix this.

Anyway awesome work on this new suggester. By any chance have you run any benchmark to compare suggestion speed 2.2 vs 5.0?

@areek
Copy link
Contributor Author

areek commented Nov 3, 2016

Hey @trompx,

Compared to the old completion suggester, do we still need to optimize indices to prevent duplicate results when we update some documents ?

No we don't need to as the new completion suggester is near-real time and is expected to take into account deleted documents (even if the deleted document hasn't been merged away).

Anyway awesome work on this new suggester. By any chance have you run any benchmark to compare suggestion speed 2.2 vs 5.0?

Thanks :), the benchmark for the new implementation can be found here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants