Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group docIds by segment in FetchPhase to better use LRU cache #57273

Merged
merged 14 commits into from
Jun 30, 2020

Conversation

boicehuang
Copy link
Contributor

@boicehuang boicehuang commented May 28, 2020

Currently, the doc-ids array in FetchPhase is unordered. Suppose there are 6 docs in 3 segments A, B, and C in a one-node cluster and the fetch order of segments to be A, B, C, B, C, A. It couldn't make good use of system cache because the cache of Segment A that is loaded from disk at the first time will be overwritten by Segment B, C or other read operations. The second fetch action of Segment A still needs to read data from disk and doesn't make good use of System cache.

This PR optimizes this issue by grouping doc-ids by segment. We have verified it with range query, the fetch performance would be better than before.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @boicehuang , I think this change can help scroll queries that need to fetch multiple sequential documents.
I left one comment, I don't think we need an indirection to scan the doc ids

FetchSubPhase.HitContext hitContext = new FetchSubPhase.HitContext();
// group docIds by segment in order to better use LRU cache
Map<Integer, List<Integer>> segmentTasks = new HashMap<>();
Map<Integer, Integer> docIdToIndex = new HashMap<>();
for (int index = 0; index < context.docIdsToLoadSize(); index++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could sort the doc ids to load once and move the LeafReaderContext while iterating ? In fact that's what we do already in FetchDocValuesPhase and other fetch sub-phases. Sorting the doc ids would remove the need to sort hits on every sub-phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to preserve the original order of hits in the response so sorting the doc ids is only internal (for fetching stored values and executing sub-phases). The original order must be restored when setting the hits in the response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you comment @jimczi. Sort the doc ids to load seems better. But if we change the order of doc ids, the order of the search hits will relatedly be changed, it would cause a lot of test cases to fail, such as testInsideTerms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we change the order of doc ids, the order of the search hits will relatedly be changed, it would cause a lot of test cases to fail, such as testInsideTerms.

See my previous comment.
We can change the order in the fetch phase but we have to preserve the original order in the response. The final hits must be re-sorted based on their original order in the request (context.docIdsToLoad).

@dliappis dliappis added the :Search/Search Search-related issues that do not fall into other categories label May 28, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 28, 2020
@boicehuang
Copy link
Contributor Author

@jimczi. I have updated my PR.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @boicehuang , this looks good to me. Can you add a small comment in FetchSubPhase#hitsExecute to remind that the hits are sorted by doc ids ?

@boicehuang boicehuang requested a review from jimczi June 8, 2020 09:20
@jimczi
Copy link
Contributor

jimczi commented Jun 8, 2020

@elasticmachine ok to test

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I missed the fact that we can have duplicate doc ids in the docIdsToLoad array. We use the same array to fetch the top hits and the suggested hits (using suggest) without de-deduplication. This should be a rare case (asking for top hits and suggested hits on the same request) but this was luckily caught by suggest/40_typed_keys/Test typed keys parameter for suggesters.
So, using a simple hash map cannot work. We need to preserve the original order and handle duplicates gracefully. One way to do that is to use an org.apache.lucene.util.IntroSorter that allows to keep multiple arrays aligned. Don't hesitate if you have any questions and sorry again for missing this requirement in the first review.

@boicehuang
Copy link
Contributor Author

boicehuang commented Jun 14, 2020

Thanks, @jimczi. Can we use a HashMap<Integer, ArrayList> to preserve the original order of hits? The ArrayList is used to store different indexes of duplicate doc ids.

            // preserve the original order of hits in inverted index
            Map<Integer, ArrayList<Integer>> docIdToIndex = new HashMap<>();
            for (int index = 0; index < context.docIdsToLoadSize(); index++) {
                int docId = context.docIdsToLoad()[context.docIdsToLoadFrom() + index];
                if (docIdToIndex.get(docId) == null) {
                    docIdToIndex.put(docId, new ArrayList<>());
                }
                docIdToIndex.get(docId).add(index);
            }

I have passed suggesters tests in my local environment. I have updated my PR but why the following checks still fail?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more comments.

hits[index] = searchHit;
sortedHits[index] = searchHit;
for (int i = 0; i < docIdToIndex.get(docId).size(); i++) {
hits[docIdToIndex.get(docId).get(i)] = searchHit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can retrieve the array list once ?

Arrays.sort(sortedDocIds);

// preserve the original order of hits in inverted index
Map<Integer, ArrayList<Integer>> docIdToIndex = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be better to use a static inner class and a custom comparator ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to preserve the original order in a custom comparator than using a hash map. I am going to update it.

@jimczi
Copy link
Contributor

jimczi commented Jun 15, 2020

I have updated my PR but why the following checks still fail?

You need to merge master into your branch. We've upgraded Lucene version in master and the 7x branch so the bwc tests are failing in your pr. You also have two checkstyle errors:


[ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/main/java/org/elasticsearch/search/fetch/subphase/FetchDocValuesPhase.java:42:8: Unused import - java.util.HashMap. [UnusedImports]
[ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/main/java/org/elasticsearch/search/fetch/subphase/ScriptFieldsPhase.java:35:8: Unused import - java.util.HashMap. [UnusedImports]

@@ -172,7 +191,7 @@ public void execute(SearchContext context) {
}

for (FetchSubPhase fetchSubPhase : fetchSubPhases) {
fetchSubPhase.hitsExecute(context, hits);
fetchSubPhase.hitsExecute(context, sortedHits);
Copy link
Contributor

@jtibshirani jtibshirani Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if these fetch sub phases, which implement hitsExecute (plural), could also benefit from locality. Currently they each loop through the hits array separately, so any cached data from processing a hit may be lost by the time the next fetch phase is run.

I think this is a distinct idea from this PR though, I filed the separate issue #58155.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating, I left one suggestion.

int[] docIds = Arrays.copyOfRange(context.docIdsToLoad(), context.docIdsToLoadFrom(), context.docIdsToLoadSize());
int[] sortedDocIds = docIds.clone();
Arrays.sort(sortedDocIds);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear. I was more thinking of something like this:

static class DocIdAndIndex implements Comparable<DocIdAndIndex> {
        final int docId;
        final int index;

        DocIdAndIndex(int docId, int index) {
            this.docId = docId;
            this.index = index;
        }


        @Override
        public int compareTo(DocIdAndIndex o) {
            return Integer.compare(docId, o.docId);
        }
}
....
  DocIdAndIndex[] docs = new DocIdAndIndex[context.docIdsToLoadSize()]; 
  for (int index = 0; index < context.docIdsToLoadSize(); index++) {
    docs[index] = new DocIdAndIndex(context.docIdsToLoad()[context.docIdsToLoadFrom() + index], index);
  }
  Arrays.sort(docs)

You can then use docs to retrieve the original index and you don't have the array twice ?

Copy link
Contributor Author

@boicehuang boicehuang Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jimczi . I have one question here. If we use a custom comparator, the average performance of timSort or quicksort is O(nlogn).but the complexity of constructing a hashmap is O(n). Maybe we can have a better performance if we use the latter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how you'd avoid the initial sort by doc ids ? The proposed solution requires to sort the array once and avoids building an hashmap, isn't it better ?

Copy link
Contributor Author

@boicehuang boicehuang Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first commit, I built the hashmap with the array without sorting it. It only took O(n) to iterate the array once. I think using a hashmap may be better? Do we have to do the sorting? Does using a hashmap have a different impact on every sub-phase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to provide the array of SearchHit sorted by doc ids to the sub fetch phase (hitsExecute) so I don't see how the hashmap would be enough. The current change allows to sort the array once before executing the sub fetch phase so it's an enhancement. Also note that the array is limited to 10k by default since we have a soft limit for the number of hits that can retrieved.

@boicehuang
Copy link
Contributor Author

@jimczi. Sorry for late reply. Using a comparator is the best way to deal with it. I have updated my PR. Can you have a look?

@jimczi
Copy link
Contributor

jimczi commented Jun 29, 2020

@elasticmachine ok to test

@elastic elastic deleted a comment from elasticmachine Jun 29, 2020
@jimczi
Copy link
Contributor

jimczi commented Jun 29, 2020

@elasticmachine run elasticsearch-ci/2

@jimczi
Copy link
Contributor

jimczi commented Jun 29, 2020

@elasticmachine ok to test

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @boicehuang !

@jimczi jimczi merged commit fa192a7 into elastic:master Jun 30, 2020
jimczi pushed a commit that referenced this pull request Jun 30, 2020
#57273)

This change sorts the docIdsToLoad once instead of in each sub-phase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants