-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we explore DiskANN for aKNN vector search? #12615
Comments
I do think Lucene's read-only segment based architecture leads itself to support quantization (required for DiskANN). It would be an interesting experiment to see how indexing times, segment merges, etc. are all handled with DiskANN. I wonder how much the speed difference is:
|
@benwtrent DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene. |
@robertvanwinkle1138 this is just one of my concerns. Additionally, I think we would still have the "segment merging problem". I do think we can get HNSW merges down by keeping connections between graphs, I just haven't had cycles myself to dig into that. There is a Lucene issue around making HNSW merges faster somewhere... |
A hybrid disk-memory algorithm would have very strong benefits. I did run a few tests recently that confirmed HNSW does not function very well when memory gets constrained (which I think everyone already knew). I wonder though, instead of DiskANN, what about a partitioning based approach such as SPANN? I think a partitioning based approach for Lucene would make merging, updating, filtering and indexing a lot easier. Also, it seems it would have better disk-access patterns. In the paper, they do show that in a memory constrained environment, they were able to outperfrom DiskANN. I guess the tradeoff might be that partitioning based approaches would struggle to achieve really low latency search when in memory compared to graph-based approaches. Additionally, partitioning approaches would require a potentially expensive "training" or "preprocessing" step such as k-Means and performance could degrade if data distribution drifts and the models are not updated. But, if PQ is implemented, the same considerations would need to be taken as well. |
@jmazanec15 I agree that SPANN seems more attractive. I would argue though we don't need to do clustering (in the paper they do clustering, but with minimal effectiveness), but could maybe get away with randomly selecting a subset of vectors per segment. |
The SPANN paper does not address efficient filtered queries. For example, Lucene's HNSW calculates the similarity score for every record, regardless of the record matching the filter. Filtered − DiskANN [1] describes a solution for efficient filtered queries. QDrant has a filter solution however the methodology described in their blog is opaque.
|
QDrant's HNSW filter solution is the exact same as Lucene's. You can look at the code, they don't filter candidate exploration but filer result collection. You are correct that filtering with SPANN would be different. Though I am not sure its intractable. It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings. But I think we can do that before scoring. So, as candidate posting lists are gathered, ensure they have some candidates. The SPANN repository supports filtering, and its OSS, so we could always just read what they did. |
Interesting thanks.
Couldn't that be done with the current Lucene HNSW implementation?
This search with filter method seems to throw an error. https://github.com/microsoft/SPTAG/blob/main/AnnService/src/Core/SPANN/SPANNIndex.cpp#L262 |
LOL, I thought it was supported, I must have read a github issue and made an assumption.
I think so. The tricky thing is making sure enough matching postings are gathered to give an accurate number for recall. Having to go back to the HNSW graph to get more postings list could get expensive if we keep getting to few matches. |
(listening to @jbellis talk at Community over Code). |
Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector) |
What say you @jbellis :-) |
Responding top to bottom,
(1) 90% of it is the PQ, yes. I assume that storing the vector inline w/ the graph also helps some but I did not test that separately. You could definitely get a big speed up just using PQ on top of HNSW. (2) Single segment in both cases. (JVector leaves segment management as an exercise for the user.) |
I don't remember the numbers here, maybe 10% slower? It wasn't material enough to make me worry about it. (This is with using an HNSW-style incremental build, not the two-pass build described in the paper.)
It was about 30% worse several months ago with my concurrent hnsw patch, should be significantly improved now since JVector (1) doesn't need the CompletionTracker to keep the layers consistent, b/c it's single layer, (2) added a DenseIntMap concurrent collection to replace ConcurrentHashMap for the nodes. |
This was a big problem for our initial deployment, we'd filter down to a few valid items and then spend forever searching the graph for them. Had to add a fairly accurate estimate of how many comparisons the index would need, and use that to decide whether to brute-force the comparison instead. (This is in Cassandra, not JVector, specifically VectorMemtableIndex.expectedNodesVisited.) I don't remember seeing code for this in Lucene but I mostly only looked at the HNSW code so I could have missed it. |
I'm happy to support anyone who wants to try this, including modifying JVector to make it a better fit if necessary. I won't have time to tackle this myself in the medium-term, however. |
So, I replicated the jvector benchmark (the lucene part) using the new int8 quantization. Note, this is with I reserved 12GB to heap, thus reducing off-heap memory to at most 20GB. (I only have 32GB of ram)
Since kNN allows the segments to be searched in parallel, I used 8 threads for the query rewrite
I am currently force-merging to a single segment to see what a single graph gives us. FYI: the data set would require > 40GB of ram to be held in memory. With int8 quantization, its down to around 10GB. EDIT: I force merged down to 8 segments. A single segment here is just huge, I thought it prudent to test something more paletable in segment size. Though, these segments are still fairly large (~8 GB). Still using multiple threads to search so that each segment has its own thread:
But, searching with just a single thread over 8 segment is still not too shabby (when comparing to the QPS claimed by JVector)
|
Hey @benwtrent and all, just wanted to let you know that I'm experimenting some with different index structures for larger than memory indexes. I have a working implementation of Vamana based off the existing HNSW implementation (with vectors colocated with the adjacency lists in storage) in this branch, which I'm currently working on integrating scalar quantization with, and a benchmarking framework here which can run various Lucene and JVector configurations. I don't have many numbers to share yet besides the fact that the Vamana graph implementation in a single segment seems competitive while in memory with Lucene HNSW (single segment) and JVector on small data sets. For glove-100-angular from ann-benchmarks (k=10):
I plan on testing vamana without PQ, vamana with PQ (a la DiskANN), as well as SPANN. Happy to collaborate with anyone interested. |
@kevindrosendahl this looks really interesting! Thank you for digging in and starting the experimentation! I haven't had a chance to read your branch yet, but hope to soon. |
Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to vamana is that it's actually fairly simple in the end. For the most part, I just removed all references to levels and changed the diversity check to incorporate vamana's Then on the storage size, you just write vectors into the graph instead of into the |
@kevindrosendahl if I am reading the code correctly, it does the following:
I have a concern around index sorting. How does building the graph & subsequent What do we think about always keeping One more question, have you tested your implementation in the situation where |
Perhaps much of the jvector performance improvement is simply from on heap caching. |
Another notable difference in the Lucene implementation is delta variable byte encoding of node ids. The increase in disk space requires the user to purchase more RAM per server. Also variable byte encoding is significantly slower to decode than raw integers. In the example listed the jvec index size is a whopping 32% greater. |
The implementation will write the vectors at their requested encoding/quantization level into the graph. So if your vectors are
Interesting re: sorting. I hadn't given that much deep thought yet. To level-set a little on how I'm planning on approaching this, I think it's still a bit of an open question what performance is theoretically possible, and what type of index could possibly get us there. I'm trying first to get a sense of the possibilities there, so my main focus first is on measuring the single segment performance in order to rule out ideas not worth pursuing before investing too much time in them. If a single segment index works well, I think the next step would probably be to make sure it would work well with concurrent query execution over multiple segments, and then to start thinking about how it could be productionalized (sorted indexes, merges, etc). Thoughts on this approach?
I've just got some first results for larger than memory performance, will post a separate comment. |
I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results. TL;DR:
Next steps:
SetupI used the Cohere/wikipedia-22-12-es-embeddings (768 dimensions) with L2 similarity. I split the dataset into 10,000 vectors randomly sampled from the dataset as the test set, with the remaining 10,114,929 vectors as the training set. I built and ran the query tests on a Note that the indexes aren't quite apples-to-apples, as the Lucene HNSW index configuration ( Index size
Queries fully in memory (124 GB of RAM available)
Queries with 8 GB system memory, 4 GB heap
Conclusions
|
Thank you for setting this up @kevindrosendahl -- these are very interesting results. It's great that PQ goes quite a ways before hurting recall. |
Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to quantization. But, this does confuse me as Vamana is supposedly good when things don't all fit in memory. It may be due to how we fetch things (MMAP). I wonder if Vamana is any good at all when using MMAP... For your testing infra, int8 quantization might close the gap at that scale. FYI, as significant (and needed) refactor occurred recently for int8 quantization & HNSW, so your testing branch might be significantly impacted :(.
I am surprised it decreases as the number of sub-spaces increases. This makes me thing that JVector's PQ implementation is weird. Or is Regardless, is there any oversampling that is occurring when PQ is enabled in JVector?
PQ is a sharp tool, at scale it can have significant draw backs (eventually you have to start dramatically oversampling as centroids get very noisy). Though, I am not sure there is a significant recall cliff. Two significant issues with a Lucene implementation I can think of are:
There are interesting approaches to PQ w/ graph exploration and a different PQ implementation via Microsoft that is worthwhile OPQ |
I would guess that either there is a bug or you happen to be testing with a really unusual dataset. PQ is fundamentally a lossy compression and can't magically create similarity that didn't exist in the original.
Today it's up to the caller (so on the Cassandra side, in Astra) but it's possible that it should move into JVector.
I suppose it's not impossible that you could compress first and then build but I have not seen anyone do it yet. JVector follows DiskANN's lead and builds the graph using uncompressed vectors. |
Yeah, upon reflection the result makes sense. DiskANN only uses the in-memory PQ vectors for the distance calculations for deciding which new candidates it should consider visiting in the future. This is the critical piece which reduces I/O. The in-graph vectors mean that when you have decided the next candidate, you get the full fidelity vector for free on the I/O to get the adjacency list, but at that point you've already done the distance calculation against that candidate. If you were to grab the full fidelity vector from the graph for the distance calculation like the non-PQ vamana implementations do, you've moved the I/O complexity back from What that full fidelity vector is useful for is re-scoring, so you can just keep that full fidelity vector in memory until the end of the search (jvector reference) and resort the final result list using their true distances (jvector reference). I'm not sure how impactful this re-ranking is, hoping to A/B it when I get PQ working, but assuming it's impactful, the index structure could save up to The main thing mmap may prevent us from "easily" doing is parallelizing the I/O, which I believe the DiskANN paper does but jvector does not currently (possibly due to the results looking pretty decent without it, and it being hard with mmap/Java). Out of curiosity I may try to
This is correct,
I ran some tests with int8 quantization, nothing very surprising in the results. If you can get the full index to fit in memory it performs great, but performance degrades significantly once it falls out of memory proportional to I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For reference, jvector
So fundamentally with these graph structures, the key is to have the vectors needed for distance calculations of potential neighbor candidates in memory. Scalar quantization helps the cause here by reducing the size of the vectors by 4. PQ can help even more by even more drastically reducing the memory impact. JVector further ensures that the quantized vectors are in memory by storing them on the heap.
Is there any expected performance difference, or is it mainly organizational? The code in my branch is not in a state I would be comfortable suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to keep my branch a bit outdated, and we could make any suitable ideas fit in if/when we thought upstreaming could be worthwhile.
Yeah, seems like for larger-than-memory indexes, some form of clustering or quantization is going to be a must have. I do really want to try out SPANN as well. It seems analogous to lucene's existing FST-as-an-index-to-a-postings-list design, and my hunch is that there may be more tricks you can play at merge time to reduce the number of times you need to re-cluster things and potentially the computational complexity. It's also just a lot simpler conceptually. As an aside, I'm not sure I'll have too much time to devote to this this week, but hoping to continue making forward progress. If anyone has any other datasets ranging from ~30-50 GB that could be useful for testing that would be helpful too. |
@kevindrosendahl, unrelated to the thread I see that you have added a column named page faults. Can you provide me some details around how you got the page faults? I am doing some other tests and want to know what is the way to get the page faults. |
@navneet1v I've been using oshi in my testing framework, particularly OSThread::getMajorFaults. If you need to do it with zero external dependencies, you'd need to implement it for each environment you'd be running on. On linux you can gather the info from |
Hi all, thanks for the patience, some interesting and exciting results to share. TL;DR:
Algorithm AnalysisThe DiskANN algorithm consists of two phases:
The original algorithm (and JVector's implementation) store the full fidelity vector and the adjacency list for a graph node together on disk. This means that when you retrieve the adjacency list during the first phase, the full fidelity vector will also be in memory due to locality. Immediately after retrieving the adjacency list, you cache the full fidelity vector. When re-ranking in phase 2, you simply use the cached vectors you've acquired during the first phase, and can rerank without any I/O. There are two interesting nuances to this algorithm:
If you instead do not store the vectors in the graph and do not cache them during traversal, the above two are a little different, and give some intuition into the performance of the pinned graph/PQ vectors with
ResultsThe following results are from indexes built against the Cohere/wiki-22-12-en-embeddings data set (35 million vectors, 768 dimensions) using L2 distance. The indexes are all around ~100 GB. The performance tests were run on the same setup as the prior results ( Some high level information about the indexes:
NB: in Lucene HNSW Fully in Memory
Charted QPS (left-to-right in the graphs matches top-to-bottom in the above table) 16GB System Memory
Charted QPS (left-to-right in the graphs matches top-to-bottom in the above table, Lucene HNSW is omitted from the graphs) 8GB System Memory
Charted QPS (left-to-right in the graphs matches top-to-bottom in the above table, Lucene HNSW is omitted from the graphs) Result AnalysisHigh level takeaways:
Other Notes
|
@kevindrosendahl Pretty interesting, thanks for the low level details, How is segment merging implemented by Lucene Vamana? Correct me if I'm wrong, looks like the Java ANN implementations examine the node ids in a more or less random order, filtering requires a |
Great to finally see you in the Lucene repo @kevindrosendahl after all these years. 🍰 The work you have done here is stellar and the whole world welcomes the diligence. I hope we can keep this momentum. I have a few concerns that are not intended to block the project I second the question, "How is segment merging implemented by Lucene Vamana?" I have a suspicion that extending I have general concerns about large copy overheads and the restrictions of POSIX AIO from experience dealing with the failures of storing data on disk, as well as random disk failures. One particular concern I have, as you specifically might have guessed, would be around soak testing. Has anyone looked into this issue since November, or has anyone run the tests for an extended period? The other concern I have is that I suspect the community would probably need to keep an eye on |
I didn't do anything special for Vamana in these experiments, the index construction and merging are practically identical to HNSW. The implication of course here is that the vectors need to be in memory when building the merged graph or performance falls off a cliff. To make this useful operationally you'd want to ensure your max segment size does not exceed available memory, and ideally you would build the index on a dedicated node to not displace pages being used to serve queries. The DiskANN paper suggests clustering the vectors and assigning each vector to the two closest clusters, building a graph for each vector, then overlaying the graphs and doing another round of pruning. You can then limit the amount of memory required based off the number of vectors and clusters. I didn't explore this, but it'd be a worthwhile improvement in the long term.
That's correct, the graph is explored (roughly) by proximity order rather than doc id. It's an interesting question about SPANN, intuitively it seems like you may be able to do as you describe by keeping each vector postings list sorted by doc id order, then essentially doing a streaming merge sort on the relevant centroids' postings as you consume the candidates and iterate through the passed in sorted doc id iterator.
I believe all of today's IO layer implementations in Lucene (including mmap on java 21) copy the bytes from the page cache onto the heap one way or another. This could potentially be improved in the future by passing a
Yeah, I wouldn't suggest that this be the default implementation, or perhaps even included in Lucene core itself at all given how platform specific it is (and this POC at least required a small amount ~50 LOC of C glue code linking in
There's clearly a ton of perf left on the table by using sequential I/O and only the page cache in this case, but it's of course also much simpler both in terms of implementation and operational complexity. |
i would be extremely careful around io_uring, it is disabled in many environments (e.g. by default in container environments) for security reasons:
To me, use of io_uring seems like a totally separate issue from KNN search. i don't think it is a good idea to entangle the two things. |
@kevindrosendahl This is really cool! I had a couple questions around product quantization implementation. I see in |
So, I did some of my own experiments. I tested Vamana (vectors in-graph) & HNSW, both with In low memory environments, HNSW performed better (confirming the results here: #12615 (comment)). When the vectors are in the graph for Vamana, there were many more page faults (NOTE: I was not using PQ, but trying an apples-to-apples comparison of HNSW & vamana in the same conditions). Additionally, looking at previous results (#12615 (comment)):
This indicates that there is very little benefit to Vamana. For DiskANN, one of the bragged benefits is being able to "get raw vectors for free" with disk-read-ahead when searching the in memory graph (PQ). If reranking with PQ'd search with vectors outside of the graph performs almost as well (without io_uring), it stands to reason that HNSW with PQ would do just as good. And with a better/smarter PQ implementation, less reranking may be necessary (combining with OPQ or something). I don't see any stand-out evidence that Vamana has a significant advantage over HNSW when it comes to being a graph based vector index. |
Think I agree with your points @benwtrent, will just jot down my thinking on HNSW vs Vamana vs DiskANN in case it's useful. HNSW and Vamana are "competing" proximity graphs, which differ mainly in the number of layers in the graph ( I think of DiskANN as the algorithm consisting of an initial ANN search using compressed vectors followed by a reranking phase on full fidelity vectors. There are a number of decisions that can be made for where to store the graph, compressed vectors, and full fidelity vectors. If you choose to store the full fidelity vectors in-line with the graph (as suggested by the original DiskANN paper), then Vamana may be more appealing than HNSW due to its 1:1 node:vector relationship. However, the results above seem to show that this implementation didn't benefit much from placing vectors inline with the graph. Given all other benefits of storing vectors in a flat file in ordinal order (including the potential for asynchronous I/O) that would seem like the pragmatic choice, in which case you could pretty easily use an HNSW graph as the proximity graph for the DiskANN algorithm. @jmazanec15 the constants used were taken from JVector, whose performance/behavior I was initially trying to emulate in Lucene. I didn't spend much time fiddling with them. |
I do not think about them as competing. They're different implementations with tradeoffs for certain uses, and the data here will evolve over time. Your preliminary work opens a new window that I hope you continue to explore for the benefit of all of us and all the people that depend on Lucene.
My short time dealing with academia was not about simplicity so this made me laugh. Thank you for that. |
hi, all: |
Description
I came across this compelling sounding JVector project which looks to have awesome QPS performance.
It uses DiskANN instead of HNSW (what Lucene uses now).
Maybe we should explore another aKNN Codec implementation using this? It'd also be a good test of the Codec pluggability of our KNN implementation.
The text was updated successfully, but these errors were encountered: