-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Lucene's I/O concurrency #13179
Comments
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. Currently, only `NIOFSDirectory` implements this new API. I played with `MMapDirectory` as well and found an approach that worked better in the benchmark I've been playing with, but I'm not sure it makes sense to implement this API on this directory as it either requires adding an explicit buffer on `MMapDirectory`, or forcing data to be loaded into the page cache even though the OS may have decided that it's not a good idea due to too few cache hits. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. Currently, only `NIOFSDirectory` implements this new API. I played with `MMapDirectory` as well and found an approach that worked better in the benchmark I've been playing with, but I'm not sure it makes sense to implement this API on this directory as it either requires adding an explicit buffer on `MMapDirectory`, or forcing data to be loaded into the page cache even though the OS may have decided that it's not a good idea due to too few cache hits. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. Currently, only `NIOFSDirectory` implements this new API. I played with `MMapDirectory` as well and found an approach that worked better in the benchmark I've been playing with, but I'm not sure it makes sense to implement this API on this directory as it either requires adding an explicit buffer on `MMapDirectory`, or forcing data to be loaded into the page cache even though the OS may have decided that it's not a good idea due to too few cache hits. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates #13179 Co-authored-by: Uwe Schindler <[email protected]>
@jpountz This looks interesting. On similar lines to this, in OpenSearch we are working on building a warm index which will not have all the data available locally all the time and will download data on-demand from remote store during search time. To improve on search performance I am thinking about the mechanism to prefetch the blocks of data from remote store to local disk before it is accessed. In addition to above, aggregations in OpenSearch can also benefit from this. The aggregation collector performs the doc value lookup on the fields of the matched documents. This happens sequentially document by document as driven by collectors. This can be changed to first collect all the matching documents and then perform prefetch of the blocks for matched documents followed by actual collection in AggregationCollector (similar to |
Thanks for looking at this!
This should work, though I'm wary of making it the new way that collectors need to interact with doc values if they want to be able to take advantage of prefetching. E.g. we also have collectors for top hits sorted by field, where collecting all hits ahead of time would kill the benefits of dynamic pruning. I wonder if there are approaches that don't require collecting all matches up-front? Access to doc values is forward-only, so prefetching the first page only and then relying on some form of read ahead would hopefully do what we need? |
Couldn't we do both with the suggested prefetch operation on For the general top doc collector / bulk scorer, the doc value prefetch could look "just ahead", prefetching as it goes (maybe we buffer the next few doc IDs from the first-phase scorer and prefetch those?). Am I correct in understanding that prefetching an already-fetched page is (at least approximately) a no-op? If we want to collect all the doc IDs during the collect phase (as Lucene's |
To add to the above, keeping it an optional API will let different collectors type also decide if it wants to make the |
We tried to make it cheap (see e.g. the logic to disable calling madvise after a while if the data seems to fully fit in the page cache anyway), but so is reading a doc-value that fits in the page cache via To avoid this per-doc overhead, I imagine that we would need to add some prefetch() API on I'm still debating with myself whether this would be valuable enough vs. just giving a hint to the IndexInput that it would be a good idea to read ahead because it's being used by postings or doc values that have a forward-only access pattern.
FWIW this would break a few things, e.g. we have collectors that only compute the score when needed (e.g. when sorting by field then score). But if we need to buffer docs up-front, then we don't know at this point in time if scores are going to be needed or not, so we need to score more docs. Maybe it's still the right trade-off, I'm mostly pointing out that this would be a bigger trade-off than what we've done for prefetching until now.
Maybe such an approach would be ok for application code that can make assumptions about how much page cache it has, but I'm expecting Lucene code to avoid ever prefetching many MBs at once, because this increases chances that the first bytes that got prefetched got paged out before we could use them. This is one reason why I like the approach of just giving a hint to the If one of you would like to take a stab at an approach to prefetching doc values, I'd be happy to look at a PR. |
Ya I was thinking that this
If I understand correctly, the read ahead mechanism in
Sure, I can take a stab for say |
This is correct. For the record, this wastage may sound disappointing, but it also helps with making I/O more concurrent. For instance, say you have a conjunction on two clauses: "a AND b" (which could be postings, but also doc-value-based iterators, e.g. via a
This sounds fine, we need to start somewhere. FWIW the main consumers of the |
…ess pattern. This introduces a new API that allows directories to optimize access to `IndexInput`s that have a forward-only access pattern by reading ahead of the current position. It would be applicable to: - Postings lists, - Doc values, - Norms, - Points. Relates apache#13179
@jpountz Thanks for sharing this. Originally I was thinking the prefetch optimization only in collect phase but I am trying to understand if it can be used in iterators side of things as well. To understand better I am looking into So far my general understanding is all the scoring and collection of docs via Collectors happens in the method Based on my above understanding, I am thinking below and would love your feedback
I think for scenarios like 2 and 3 above where we know exact doc matches, performing prefetch could be useful vs readAhead. |
This can be done, but I'd note that this would be a significant change to our APIs since
FWIW one thing that is on my mind is that both postings and doc values take in the order of 1 or 2 bytes per document. So even a query that matches 0.1% of docs, evenly distributed in the doc ID space, would still end up fetching all pages in practice. So a very smart prefetching may only perform better than naive prefetching in the following cases:
But then I'd still expect some naive readahead logic to perform ok in such cases. For the extremely sparse case, it would fetch up to X times too many pages where X is the number of pages that get read ahead. For reasonable values of X, this should be ok. The other thing that is on my mind is that this sort of approach allows us doing it completely at the OS level, which gives additional efficiency. |
Thanks for explaining this to me. Seems like this would mean to change the iteration and scoring behavior to work on range of docs vs 1 doc at a time (which is the current behavior in lucene). Probably it will work fine for collectors not requiring any scoring but it is not a general use case and will be limited to exact prefetch in collectors only.
Agreed. I had similar thought with readahead but was trying to see if there are ways to avoid X. But as you said in general cases it will probably end up fetching all the pages anyways. For read ahead in DocValues case, I am thinking that when docValue is fetched for current doc, probably we can provide the hint there to the IndexInput to perform readahead. This can be useful for IndexInputs to perform some read ahead which interacts with remote store. However, in default case, I think the OS will take care of readahead on the read from a specific offset so it could be NoOp there. But same could be done even when any seek is happening on an IndexInput if it knows that it should follow the sequential access pattern. So I guess your latest PR is providing that hint and probably we don't need any separate |
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179 Co-authored-by: Uwe Schindler <[email protected]>
There can always be follow-up improvements, but I think it's time to close this issue. Most usage of Lucene is now able to perform multiple concurrent I/O operations from the same search thread, this has been released in Lucene 10. |
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179 Co-authored-by: Uwe Schindler <[email protected]>
Description
Currently, Lucene's I/O concurrency is bound by the search concurrency. If
IndexSearcher
runs on N threads, then Lucene will never perform more than N I/Os concurrently. Unless you significantly overprovision your search thread pool - which is bad for other reasons, Lucene will bottleneck on I/O latency without even maxing out the IOPS of the host.I don't think that Lucene should fully embrace asynchronousness in its APIs, or query evaluation would become overly complicated. But I still expect that we have a lot of room for improvement to allow each search thread to perform multiple I/Os concurrently under the hood when needed.
Some examples:
apache OR lucene
, could the I/O lookups in thetim
file (terms dictionary) be performed concurrently for both terms?doc
file (postings) have been resolved, could we start loading the first bytes from these postings lists from disk concurrently?fdt
file (stored fields) for all these documents concurrently?This would require API changes in our
Directory
APIs, and some low-levelIndexReader
APIs (TermsEnum
,StoredFieldsReader
?).IndexInput#prefetch
for terms dictionary lookups. #13359ReadAdvice#NORMAL
on files that have a forward-only access pattern. #13450The text was updated successfully, but these errors were encountered: