-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HNSW merges faster #12440
Comments
What is your take on existing merge optimization #12050 ? (*) To assess how effective it is, I ran a small test. Otherwise, naive question: have we consider integrating other - native - libraries (faiss, raft, nmslib...) like what is done in open search (at a higher abstraction level though). |
I think its a good start. One problem I have is that typical Lucene segment merging attempts to merge segments of equal size together. Making "segment tiers" of similar sizes. Maybe an adequate "make merges faster" optimization is to have a better segment merge policy that takes advantage of the inherit advantages with this optimization.
This is true, but it depends on the merge policy and which segments are merged. When merging 10 segments that are of equal size, this optimization has almost no impact.
I am unsure about this. A new codec could be made integrating those native libraries, but they should fit within the Lucene segment model and not use JNI. From what I can tell, those integrations don't do either of those things. Additionally, there shouldn't be any external dependencies (if directly integrated into the Lucene repo). See Discussion: #12502 Other options for "making merges faster" is to just provide scalar quantization for users. This will make merges as a whole faster as the computations required will be much cheaper. It bugs me that we have all this distributed work across segments that just gets ignored. No matter if this was a native implementation or not, merging similarly sized HNSW graphs from 9 segments into 1 will still be costly. |
Indeed, looking at the TieredMergePolicy, it seems that it will priviledge merge with the lowest skew, while we would want the reverse for hnsw merge, or at least have an algorithm that takes into account the cost merging hnsw graphs.
|
Another idea I have is a bit wild: what if we do less merge? For example, if we have segment 1,2,3,4 wants to merge and form a new segment, can we just leave the HNSW graphs as-is, or we only merge the smaller ones and leave the big ones as-is. And when we do searching, we can do a greedy search (always go to the closest node, whichever graph), or we can just let user choose to use multithreading to exchange for the latency? |
I agree on having a specific merge strategy as you describe. As the graph construction is O(n logn), the relative cost (per node) of merging in a small segment is much less than merging in a big segment. |
An idea, instead of trying to merge the subgraph, is to do a union of subgraphs:
The advantages of that approach would be that:
But of course, this may become bloated as we start to have many disconnected subgraphs. At some point you'll want to do a real merge. |
Having a look at the first paper you shared On the merge of knn graphs: they proposed 2 algorithms, the second one is called Joint-merge, and is exactly what we are doing today: add the raw dataset of the second graph into the fully constructed first graph. The first one is called Symmetric Merge, basic idea is: It's worth exploring some variation of this in my opinion. |
Would this potentially lead towards a bias of exploring small graphs? For instance, when searching, the small graph may converge to its nearest neighbors before the larger graph and then prevent the intermediate nodes in the larger graph from ever getting visited? |
@mbrette this is interesting. Definitely worth looking at. It is also worth noting that in the DiskANN paper (https://suhasjs.github.io/files/diskann_neurips19.pdf) when constructing their final structure from N graphs where each node is inserted into x graphs based on proximity to graph's representative vector (or centroid), they found that a union of edges as a merge worked out well, empirically. "Empirically, it turns out that the overlapping nature of the different clusters provides sufficient connectivity for the GreedySearch algorithm to succeed even if the query’s nearest neighbors are actually split between multiple shards. We would like to remark that there have been earlier works [ 9 , 22 ] which construct indices for large datasets by merging several smaller, overlapping indices. However, their ideas for constructing the overlapping clusters are different, and a more detailed comparison of these different techniques needs to be done." With these approaches, I wonder how well they preserve overall structure over time - i.e. would there be any quality drift. |
@jmazanec15 What is the current way to measure Lucene knn recall/performance this days ? |
This would be tricky. This scales linearly per graph and technically recall is higher when searching multiple graphs than a single one. It seems that the @zhaih In short, I think this has merit, but we could look at how it can improve multi-segment search by keeping track of min_score or dynamically adjusting
🤔 I am not sure about this. It seems to me that allowing the graph to be disconnected like this (pure union) would be very difficult to keep track of within the Codec. Now we not only have each node's doc_id, but which graph they belong to, etc. This would cause significant overhead and seems like it wouldn't be worth the effort.
I agree @mbrette, which is why I linked the paper :). I think if we do something like the nnDescent but still adhering to the diversity of neighborhood, it could help. HNSW is fairly resilient and I wouldn't expect horrific recall changes. One tricky thing there would be when do we promote/demote a node to a higher/lower level in the graph? I am not sure nodes should simply stay at their levels as graph size increases. I haven't thought about what the logic could be for determining which NSW get put on which layer. It might be as simple as re-running the probabilistic level assignment for every node. Since nodes at higher levels are also on the lower levels, we could still get a significant performance improvement as we would only have to "start from scratch" on nodes that are promoted from a lower level to a higher level (which is very rare). Node demotions could still keep their previous connections (or keep what percentage of the connections we allow in our implementation).
I tried to reproduced your test from #11354 (comment), but was not able to (recall = 0.574). Lucene Util is the way to do this. What issues are you having? It can be sort of frustrating to keep up and running, but once you got the flow down and it built, its useful. I will happily help with this or even run what benchmarking I can my self once you have some POC work. |
I had another idea, I wonder if we can initialize HNSW via coarse grained clusters. Depending on the clustering algorithm used, we can use clusters built from various segments to boot-strap the hnsw graph building. This obviously would be "new research". |
Description
It is well known that when merging segments, HNSW is not merged in linear time. We lose a fair bit of useful information as all the NSWs are effectively thrown away (except for a single graph). This means we have useful distributed work from each segment that is ignored.
Do we think its even possible to improve HNSW?
There is existing research around merging hierarchical graphs, but not specifically HNSW:
One other option that comes to my mind is keeping NSW scores and use them on merge. So, taking advantage of a FINGER type of idea. I am not 100% sold on using this during search as it increases memory usage. But, if we store these in a separate file (not loaded during search) but loaded during merge, we could reduce the number of calculations required.
Of course, all these types of changes require due diligence and do NOT obviate the need of a new algorithm in Lucene that is more page-fault friendly (IVFPQ, SPANN, DiskNN or something).
The text was updated successfully, but these errors were encountered: