Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HNSW merges faster #12440

Open
benwtrent opened this issue Jul 13, 2023 · 13 comments
Open

Make HNSW merges faster #12440

benwtrent opened this issue Jul 13, 2023 · 13 comments

Comments

@benwtrent
Copy link
Member

benwtrent commented Jul 13, 2023

Description

It is well known that when merging segments, HNSW is not merged in linear time. We lose a fair bit of useful information as all the NSWs are effectively thrown away (except for a single graph). This means we have useful distributed work from each segment that is ignored.

Do we think its even possible to improve HNSW?

There is existing research around merging hierarchical graphs, but not specifically HNSW:

One other option that comes to my mind is keeping NSW scores and use them on merge. So, taking advantage of a FINGER type of idea. I am not 100% sold on using this during search as it increases memory usage. But, if we store these in a separate file (not loaded during search) but loaded during merge, we could reduce the number of calculations required.

Of course, all these types of changes require due diligence and do NOT obviate the need of a new algorithm in Lucene that is more page-fault friendly (IVFPQ, SPANN, DiskNN or something).

@mbrette
Copy link

mbrette commented Aug 17, 2023

What is your take on existing merge optimization #12050 ?
The approach seems very effective (*), however it does not work if there are deleted documents, which is likely to happen in most usage scenario.

(*) To assess how effective it is, I ran a small test.
I ran a unit test using HnswGraphBuilder and OnHeapHnswGraph, simulating a big segment of 1M vector (128 dim) merging with a smaller segment of 10k vectors.
If I compare initializing the HNSW graph from the big segment, vs. rebuilding the graph, it is 100 times faster - even though initializing the HNSW graph recompute neighbors scores.
A given vector/document will go through many segment merge in its life time, so the benefit of this optimization accrue a lot.
Caveat: I used random vectors.

Otherwise, naive question: have we consider integrating other - native - libraries (faiss, raft, nmslib...) like what is done in open search (at a higher abstraction level though).

@benwtrent
Copy link
Member Author

What is your take on existing merge optimization #12050?

I think its a good start. One problem I have is that typical Lucene segment merging attempts to merge segments of equal size together. Making "segment tiers" of similar sizes. Maybe an adequate "make merges faster" optimization is to have a better segment merge policy that takes advantage of the inherit advantages with this optimization.

A given vector/document will go through many segment merge in its life time, so the benefit of this optimization accrue a lot.
Caveat: I used random vectors.

This is true, but it depends on the merge policy and which segments are merged. When merging 10 segments that are of equal size, this optimization has almost no impact.

have we consider integrating other - native - libraries (faiss, raft, nmslib...) like what is done in open search (at a higher abstraction level though).

I am unsure about this. A new codec could be made integrating those native libraries, but they should fit within the Lucene segment model and not use JNI. From what I can tell, those integrations don't do either of those things.

Additionally, there shouldn't be any external dependencies (if directly integrated into the Lucene repo). See Discussion: #12502

Other options for "making merges faster" is to just provide scalar quantization for users. This will make merges as a whole faster as the computations required will be much cheaper.

It bugs me that we have all this distributed work across segments that just gets ignored. No matter if this was a native implementation or not, merging similarly sized HNSW graphs from 9 segments into 1 will still be costly.

@mbrette
Copy link

mbrette commented Aug 18, 2023

Indeed, looking at the TieredMergePolicy, it seems that it will priviledge merge with the lowest skew, while we would want the reverse for hnsw merge, or at least have an algorithm that takes into account the cost merging hnsw graphs.

merges with lower skew, smaller size and those reclaiming more deletes, are favored.

@zhaih
Copy link
Contributor

zhaih commented Sep 1, 2023

One problem with optimization #12050 is that with a hybrid text-embedding index you can hardly have even one big segment that do not have the deletion, which makes real world index merge can hardly make use of that... I opened #12533 to gather some ideas on how we can better deal with deletions.

@zhaih
Copy link
Contributor

zhaih commented Sep 1, 2023

Another idea I have is a bit wild: what if we do less merge?

For example, if we have segment 1,2,3,4 wants to merge and form a new segment, can we just leave the HNSW graphs as-is, or we only merge the smaller ones and leave the big ones as-is. And when we do searching, we can do a greedy search (always go to the closest node, whichever graph), or we can just let user choose to use multithreading to exchange for the latency?

@mbrette
Copy link

mbrette commented Sep 1, 2023

I agree on having a specific merge strategy as you describe. As the graph construction is O(n logn), the relative cost (per node) of merging in a small segment is much less than merging in a big segment.
The caveat is that in a hybrid text-embedding scenario (as you called out in #12533), this strategy may be detrimental for the keyword part. This can be useful if you maintain different collection for vector indexes and keyword indexes.

@mbrette
Copy link

mbrette commented Sep 1, 2023

An idea, instead of trying to merge the subgraph, is to do a union of subgraphs:
When we merge, we build a disconnected graph which is the union of all the segment graphs.
At search time, you explore each subgraph to come up with results.
If we do this naively, this would be the same as parallel exploration of segments.
To improve this:
When we search in this graph, we run a greedy search like we do today, with 2 small tweaks:

  • First you make sure that you include candidates from every subgraph.
  • Second, you don't want the candidates list to be dominated by one subgraph (for ex because this is the one you start exploring), so you need a strategy to give a fair share for each subgraph. Maybe exploring not greedily by best score, but each subgraph in turn. Or maybe maintain your results and candidate lists such that the stopping criteria for adding candidates is not only based on a single best score for all the candidates, but also account on the best score from each subgraph.

The advantages of that approach would be that:

  • You can explore each subgraph in parallel, like you would do with many segments, except that it does not influence the segment strategy for the keyword indexes.
  • You can somewhat limit the increase in number of nodes you have to explore by having a stopping criteria that is cross-subgraph

But of course, this may become bloated as we start to have many disconnected subgraphs. At some point you'll want to do a real merge.

@mbrette
Copy link

mbrette commented Sep 1, 2023

Having a look at the first paper you shared On the merge of knn graphs: they proposed 2 algorithms, the second one is called Joint-merge, and is exactly what we are doing today: add the raw dataset of the second graph into the fully constructed first graph.

The first one is called Symmetric Merge, basic idea is:
Given 2 graphs, for each node in each graph, remove the lower halves of each neighbor lists. Replace it with random nodes from the other graph.
Then apply a NN-Descent algorithm, an older algorithm than HNSW, defined in Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures.
NN-Descent algorithm is an iterative algorithm that converge to an a-knn state.
For each node, it compare the node with its neighbors' neighbors (the 2-degree neighbors), and update the neighbor list with the closest ones. Then iterate again over all nodes until some stopping criteria is reached (number of changes in the graph below some threshold). There are some optimization/heuristics called out in the paper, but this is the idea.

It's worth exploring some variation of this in my opinion.

@jmazanec15
Copy link
Contributor

Another idea I have is a bit wild: what if we do less merge?

For example, if we have segment 1,2,3,4 wants to merge and form a new segment, can we just leave the HNSW graphs as-is, or we only merge the smaller ones and leave the big ones as-is. And when we do searching, we can do a greedy search (always go to the closest node, whichever graph), or we can just let user choose to use multithreading to exchange for the latency?

Would this potentially lead towards a bias of exploring small graphs? For instance, when searching, the small graph may converge to its nearest neighbors before the larger graph and then prevent the intermediate nodes in the larger graph from ever getting visited?

@jmazanec15
Copy link
Contributor

It's worth exploring some variation of this in my opinion.

@mbrette this is interesting. Definitely worth looking at. It is also worth noting that in the DiskANN paper (https://suhasjs.github.io/files/diskann_neurips19.pdf) when constructing their final structure from N graphs where each node is inserted into x graphs based on proximity to graph's representative vector (or centroid), they found that a union of edges as a merge worked out well, empirically.

"Empirically, it turns out that the overlapping nature of the different clusters provides sufficient connectivity for the GreedySearch algorithm to succeed even if the query’s nearest neighbors are actually split between multiple shards. We would like to remark that there have been earlier works [ 9 , 22 ] which construct indices for large datasets by merging several smaller, overlapping indices. However, their ideas for constructing the overlapping clusters are different, and a more detailed comparison of these different techniques needs to be done."

With these approaches, I wonder how well they preserve overall structure over time - i.e. would there be any quality drift.

@mbrette
Copy link

mbrette commented Sep 4, 2023

@jmazanec15 What is the current way to measure Lucene knn recall/performance this days ?
I tried to reproduced your test from #11354 (comment), but was not able to (recall = 0.574).

@benwtrent
Copy link
Member Author

For example, if we have segment 1,2,3,4 wants to merge and form a new segment, can we just leave the HNSW graphs as-is, or we only merge the smaller ones and leave the big ones as-is. And when we do searching, we can do a greedy search (always go to the closest node, whichever graph), or we can just let user choose to use multithreading to exchange for the latency?

This would be tricky. This scales linearly per graph and technically recall is higher when searching multiple graphs than a single one. It seems that the efSearch (lucene's k) per nearest neighbor search should be adjusted internally if this is the approach taken. Thinking further, it seems like a nice experiment to make regardless to see how multi-segment search speed can be improved (potentially keeping the minimally matching score or candidates between graph searches...)

@zhaih In short, I think this has merit, but we could look at how it can improve multi-segment search by keeping track of min_score or dynamically adjusting efSearch when multiple segments are being searched.

An idea, instead of trying to merge the subgraph, is to do a union of subgraphs:
When we merge, we build a disconnected graph which is the union of all the segment graphs.

@mbrette

🤔 I am not sure about this. It seems to me that allowing the graph to be disconnected like this (pure union) would be very difficult to keep track of within the Codec. Now we not only have each node's doc_id, but which graph they belong to, etc. This would cause significant overhead and seems like it wouldn't be worth the effort.

It's worth exploring some variation of this in my opinion.

I agree @mbrette, which is why I linked the paper :). I think if we do something like the nnDescent but still adhering to the diversity of neighborhood, it could help. HNSW is fairly resilient and I wouldn't expect horrific recall changes.

One tricky thing there would be when do we promote/demote a node to a higher/lower level in the graph? I am not sure nodes should simply stay at their levels as graph size increases.

I haven't thought about what the logic could be for determining which NSW get put on which layer. It might be as simple as re-running the probabilistic level assignment for every node. Since nodes at higher levels are also on the lower levels, we could still get a significant performance improvement as we would only have to "start from scratch" on nodes that are promoted from a lower level to a higher level (which is very rare). Node demotions could still keep their previous connections (or keep what percentage of the connections we allow in our implementation).

What is the current way to measure Lucene knn recall/performance this days ?
@mbrette

I tried to reproduced your test from #11354 (comment), but was not able to (recall = 0.574).

Lucene Util is the way to do this. What issues are you having? It can be sort of frustrating to keep up and running, but once you got the flow down and it built, its useful. I will happily help with this or even run what benchmarking I can my self once you have some POC work.

@benwtrent
Copy link
Member Author

I had another idea, I wonder if we can initialize HNSW via coarse grained clusters. Depending on the clustering algorithm used, we can use clusters built from various segments to boot-strap the hnsw graph building.

This obviously would be "new research".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants