Group docIds by segment in FetchPhase to better use LRU cache #57273

boicehuang · 2020-05-28T10:27:36Z

Currently, the doc-ids array in FetchPhase is unordered. Suppose there are 6 docs in 3 segments A, B, and C in a one-node cluster and the fetch order of segments to be A, B, C, B, C, A. It couldn't make good use of system cache because the cache of Segment A that is loaded from disk at the first time will be overwritten by Segment B, C or other read operations. The second fetch action of Segment A still needs to read data from disk and doesn't make good use of System cache.

This PR optimizes this issue by grouping doc-ids by segment. We have verified it with range query, the fetch performance would be better than before.

jimczi

Thanks @boicehuang , I think this change can help scroll queries that need to fetch multiple sequential documents.
I left one comment, I don't think we need an indirection to scan the doc ids

jimczi · 2020-05-28T11:43:23Z

server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java

-            FetchSubPhase.HitContext hitContext = new FetchSubPhase.HitContext();
+            // group docIds by segment in order to better use LRU cache
+            Map<Integer, List<Integer>> segmentTasks = new HashMap<>();
+            Map<Integer, Integer> docIdToIndex = new HashMap<>();
            for (int index = 0; index < context.docIdsToLoadSize(); index++) {


We could sort the doc ids to load once and move the LeafReaderContext while iterating ? In fact that's what we do already in FetchDocValuesPhase and other fetch sub-phases. Sorting the doc ids would remove the need to sort hits on every sub-phase.

We also need to preserve the original order of hits in the response so sorting the doc ids is only internal (for fetching stored values and executing sub-phases). The original order must be restored when setting the hits in the response.

Thanks for you comment @jimczi. Sort the doc ids to load seems better. But if we change the order of doc ids, the order of the search hits will relatedly be changed, it would cause a lot of test cases to fail, such as testInsideTerms.

But if we change the order of doc ids, the order of the search hits will relatedly be changed, it would cause a lot of test cases to fail, such as testInsideTerms.

See my previous comment.
We can change the order in the fetch phase but we have to preserve the original order in the response. The final hits must be re-sorted based on their original order in the request (context.docIdsToLoad).

elasticmachine · 2020-05-28T12:22:50Z

Pinging @elastic/es-search (:Search/Search)

boicehuang · 2020-06-08T02:49:24Z

@jimczi. I have updated my PR.

jimczi

Thanks @boicehuang , this looks good to me. Can you add a small comment in FetchSubPhase#hitsExecute to remind that the hits are sorted by doc ids ?

jimczi · 2020-06-08T11:54:56Z

@elasticmachine ok to test

jimczi

Sorry but I missed the fact that we can have duplicate doc ids in the docIdsToLoad array. We use the same array to fetch the top hits and the suggested hits (using suggest) without de-deduplication. This should be a rare case (asking for top hits and suggested hits on the same request) but this was luckily caught by suggest/40_typed_keys/Test typed keys parameter for suggesters.
So, using a simple hash map cannot work. We need to preserve the original order and handle duplicates gracefully. One way to do that is to use an org.apache.lucene.util.IntroSorter that allows to keep multiple arrays aligned. Don't hesitate if you have any questions and sorry again for missing this requirement in the first review.

boicehuang · 2020-06-14T07:17:32Z

Thanks, @jimczi. Can we use a HashMap<Integer, ArrayList> to preserve the original order of hits? The ArrayList is used to store different indexes of duplicate doc ids.

            // preserve the original order of hits in inverted index
            Map<Integer, ArrayList<Integer>> docIdToIndex = new HashMap<>();
            for (int index = 0; index < context.docIdsToLoadSize(); index++) {
                int docId = context.docIdsToLoad()[context.docIdsToLoadFrom() + index];
                if (docIdToIndex.get(docId) == null) {
                    docIdToIndex.put(docId, new ArrayList<>());
                }
                docIdToIndex.get(docId).add(index);
            }

I have passed suggesters tests in my local environment. I have updated my PR but why the following checks still fail?

jimczi

I left more comments.

jimczi · 2020-06-15T09:41:33Z

server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java

-                hits[index] = searchHit;
+                sortedHits[index] = searchHit;
+                for (int i = 0; i < docIdToIndex.get(docId).size(); i++) {
+                    hits[docIdToIndex.get(docId).get(i)] = searchHit;


You can retrieve the array list once ?

jimczi · 2020-06-15T09:43:30Z

server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java

+            Arrays.sort(sortedDocIds);
+
+            // preserve the original order of hits in inverted index
+            Map<Integer, ArrayList<Integer>> docIdToIndex = new HashMap<>();


I wonder if it'd be better to use a static inner class and a custom comparator ?

It seems better to preserve the original order in a custom comparator than using a hash map. I am going to update it.

jimczi · 2020-06-15T09:47:05Z

I have updated my PR but why the following checks still fail?

You need to merge master into your branch. We've upgraded Lucene version in master and the 7x branch so the bwc tests are failing in your pr. You also have two checkstyle errors:


[ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/main/java/org/elasticsearch/search/fetch/subphase/FetchDocValuesPhase.java:42:8: Unused import - java.util.HashMap. [UnusedImports]
[ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/main/java/org/elasticsearch/search/fetch/subphase/ScriptFieldsPhase.java:35:8: Unused import - java.util.HashMap. [UnusedImports]

jtibshirani · 2020-06-15T23:02:31Z

server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java

@@ -172,7 +191,7 @@ public void execute(SearchContext context) {
            }

            for (FetchSubPhase fetchSubPhase : fetchSubPhases) {
-                fetchSubPhase.hitsExecute(context, hits);
+                fetchSubPhase.hitsExecute(context, sortedHits);


It would be nice if these fetch sub phases, which implement hitsExecute (plural), could also benefit from locality. Currently they each loop through the hits array separately, so any cached data from processing a hit may be lost by the time the next fetch phase is run.

I think this is a distinct idea from this PR though, I filed the separate issue #58155.

jimczi

Thanks for updating, I left one suggestion.

jimczi · 2020-06-17T17:33:11Z

server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java

+            int[] docIds = Arrays.copyOfRange(context.docIdsToLoad(), context.docIdsToLoadFrom(), context.docIdsToLoadSize());
+            int[] sortedDocIds = docIds.clone();
+            Arrays.sort(sortedDocIds);
+


Sorry I wasn't clear. I was more thinking of something like this:

static class DocIdAndIndex implements Comparable<DocIdAndIndex> { final int docId; final int index; DocIdAndIndex(int docId, int index) { this.docId = docId; this.index = index; } @Override public int compareTo(DocIdAndIndex o) { return Integer.compare(docId, o.docId); } } .... DocIdAndIndex[] docs = new DocIdAndIndex[context.docIdsToLoadSize()]; for (int index = 0; index < context.docIdsToLoadSize(); index++) { docs[index] = new DocIdAndIndex(context.docIdsToLoad()[context.docIdsToLoadFrom() + index], index); } Arrays.sort(docs)

You can then use docs to retrieve the original index and you don't have the array twice ?

Thanks @jimczi . I have one question here. If we use a custom comparator, the average performance of timSort or quicksort is O(nlogn).but the complexity of constructing a hashmap is O(n). Maybe we can have a better performance if we use the latter?

I don't see how you'd avoid the initial sort by doc ids ? The proposed solution requires to sort the array once and avoids building an hashmap, isn't it better ?

In the first commit, I built the hashmap with the array without sorting it. It only took O(n) to iterate the array once. I think using a hashmap may be better? Do we have to do the sorting? Does using a hashmap have a different impact on every sub-phase?

We still need to provide the array of SearchHit sorted by doc ids to the sub fetch phase (hitsExecute) so I don't see how the hashmap would be enough. The current change allows to sort the array once before executing the sub fetch phase so it's an enhancement. Also note that the array is limited to 10k by default since we have a soft limit for the number of hits that can retrieved.

boicehuang · 2020-06-28T07:11:42Z

@jimczi. Sorry for late reply. Using a comparator is the best way to deal with it. I have updated my PR. Can you have a look?

jimczi · 2020-06-29T10:11:59Z

@elasticmachine ok to test

jimczi · 2020-06-29T11:14:36Z

@elasticmachine run elasticsearch-ci/2

jimczi · 2020-06-29T21:34:51Z

@elasticmachine ok to test

jimczi

LGTM, thanks @boicehuang !

#57273) This change sorts the docIdsToLoad once instead of in each sub-phase.

Group docIds by segment in FetchPhase to better use LRU cache

65132df

jimczi requested changes May 28, 2020

View reviewed changes

dliappis added the :Search/Search Search-related issues that do not fall into other categories label May 28, 2020

elasticmachine added the Team:Search Meta label for search team label May 28, 2020

jimczi added the >enhancement label May 29, 2020

boicehuang added 2 commits May 29, 2020 23:15

Sort docIdsToLoad once instead of in each sub-phase.

0545448

Merge remote-tracking branch 'upstream/master' into fetch_optimize

c0a615f

jimczi reviewed Jun 8, 2020

View reviewed changes

jimczi added v7.9.0 v8.0.0 labels Jun 8, 2020

boicehuang added 2 commits June 8, 2020 16:37

Merge remote-tracking branch 'upstream/master' into fetch_optimize

8bb36b1

Add a comment in FetchSubPhase#hitsExecute

84856d5

boicehuang requested a review from jimczi June 8, 2020 09:20

jimczi requested changes Jun 8, 2020

View reviewed changes

Use an ArrayList to store indexes of duplicate ids

212df2a

jimczi reviewed Jun 15, 2020

View reviewed changes

jtibshirani reviewed Jun 15, 2020

View reviewed changes

boicehuang added 5 commits June 17, 2020 10:24

Removed unused imports

7afbd2f

Merge remote-tracking branch 'upstream/master' into fetch_optimize

50d0c6e

Preserve the original order of hits in comparator

b1e512d

Merge remote-tracking branch 'upstream/master' into fetch_optimize

f61067d

fix preservation of the original order of hits

17b892b

jimczi reviewed Jun 17, 2020

View reviewed changes

boicehuang added 2 commits June 27, 2020 22:22

Use a custom comparator to retrieve the original index

d0e3f35

removed unused imports

c6f8a8e

Merge remote-tracking branch 'upstream/master' into fetch_optimize

79ccfee

elastic deleted a comment from elasticmachine Jun 29, 2020

jimczi approved these changes Jun 30, 2020

View reviewed changes

jimczi merged commit fa192a7 into elastic:master Jun 30, 2020

jimczi pushed a commit that referenced this pull request Jun 30, 2020

Sort document by internal doc id in FetchPhase to better use LRU cache (

8c93f4e

#57273) This change sorts the docIdsToLoad once instead of in each sub-phase.

jimczi mentioned this pull request Jul 28, 2020

Clarify SourceLookup sharing across fetch subphases. #60179

Merged

jtibshirani mentioned this pull request Aug 10, 2020

Merge FetchSubPhase hitsExecute and hitExecute methods #60907

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group docIds by segment in FetchPhase to better use LRU cache #57273

Group docIds by segment in FetchPhase to better use LRU cache #57273

boicehuang commented May 28, 2020 •

edited

Loading

jimczi left a comment

jimczi May 28, 2020

jimczi May 29, 2020

boicehuang May 29, 2020

jimczi May 29, 2020

elasticmachine commented May 28, 2020

boicehuang commented Jun 8, 2020

jimczi left a comment

jimczi commented Jun 8, 2020

jimczi left a comment

boicehuang commented Jun 14, 2020 •

edited

Loading

jimczi left a comment

jimczi Jun 15, 2020

jimczi Jun 15, 2020

boicehuang Jun 17, 2020

jimczi commented Jun 15, 2020

jtibshirani Jun 15, 2020 •

edited

Loading

jimczi left a comment

jimczi Jun 17, 2020

boicehuang Jun 18, 2020 •

edited

Loading

jimczi Jun 18, 2020

boicehuang Jun 18, 2020 •

edited

Loading

jimczi Jun 18, 2020

boicehuang commented Jun 28, 2020

jimczi commented Jun 29, 2020

jimczi commented Jun 29, 2020

jimczi commented Jun 29, 2020

jimczi left a comment

Group docIds by segment in FetchPhase to better use LRU cache #57273

Group docIds by segment in FetchPhase to better use LRU cache #57273

Conversation

boicehuang commented May 28, 2020 • edited Loading

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented May 28, 2020

boicehuang commented Jun 8, 2020

jimczi left a comment

Choose a reason for hiding this comment

jimczi commented Jun 8, 2020

jimczi left a comment

Choose a reason for hiding this comment

boicehuang commented Jun 14, 2020 • edited Loading

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi commented Jun 15, 2020

jtibshirani Jun 15, 2020 • edited Loading

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boicehuang Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boicehuang Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boicehuang commented Jun 28, 2020

jimczi commented Jun 29, 2020

jimczi commented Jun 29, 2020

jimczi commented Jun 29, 2020

jimczi left a comment

Choose a reason for hiding this comment

boicehuang commented May 28, 2020 •

edited

Loading

boicehuang commented Jun 14, 2020 •

edited

Loading

jtibshirani Jun 15, 2020 •

edited

Loading

boicehuang Jun 18, 2020 •

edited

Loading

boicehuang Jun 18, 2020 •

edited

Loading