Init HNSW merge with graph containing deleted documents #12533

zhaih · 2023-09-01T06:21:22Z

Description

Currently when we're merging HNSW graphs we're able to start with an exiting graph but not inserting nodes from scratch thanks to #12050. But we have set a constraint that the init graph cannot be the one that has deletions since we do not have a good way to handle deletions within a graph without reconstructing it. And this makes the nice optimization a bit wasteful as in reality the bigger the segment the less probability it has no deletion, especially in a hybrid text-embedding index.

Opening this issue to discuss how we could solve this problem or have a work around.
My very naive ideas:

We could remove the deleted node from the graph, fully connected all its neighbours, and do diverse check on those neighbors to remove extra links. In case the deleted node is an entry node, we can insert the closest node of the deleted node upto where the deleted node existed.
We tolerate certain amount of deletions (like 10% ~ 20%) inside HNSW graph and just use them as connections.

I personally like 1 better as for 2 we need to change the searching behavior accordingly and sounds a bit fragile. Any other ideas are welcome!

mbrette · 2023-09-01T07:53:25Z

Approach (2) may not be so much a change. Let's say you mark those nodes as "tombstone" so that we know they are deleted in the graph itself. Exploring the graph with those tombstone would be the same as what we do today with the acceptOrds filter: skipping those nodes from the results, but keeping them in the candidate list to explore its neighbors.

The problem with that approach is if you have a large ratio of deleted documents, the graph becomes bloated.
It could be part of an hybrid strategy where we keep those tombstones until we reach a threshold, after which we need to remove them - if we can - or rebuild the graph on the next merge.

jmazanec15 · 2023-09-01T17:23:09Z

We could remove the deleted node from the graph, fully connected all its neighbours, and do diverse check on those neighbors to remove extra links. In case the deleted node is an entry node, we can insert the closest node of the deleted node upto where the deleted node existed.

What do you mean by fully connect its neighbors? Would this mean basically figure out the to be deleted nodes in-edges and reinsert them into the graph using normal edge selection strategy excluding the deleted nodes to "patch" the broken connections? We looked into this a little bit recently, but the number of reinserts grows pretty fast. It might be promising, though, to start finding replacement neighbors from the neighbor that is being removed (as opposed to starting from the global entry point). I think with this approach we would need to figure out a way to avoid quality drift after the graph has been manipulated in such a way over several generations - edge selection strategy is different from building the graph. For instance, refinement overtime may mean that the long distance hops neighbors added on early would start to disappear. Would the diversity check help in this case? Also, I think at a certain point, it will be better to just rebuild the graph from scratch, suggesting a threshold might need to be selected.

We tolerate certain amount of deletions (like 10% ~ 20%) inside HNSW graph and just use them as connections.

There was some discussion around this in hnswlib: nmslib/hnswlib#4 (comment). In practice, this probably would work well - but not really sure how to choose the correct number of deletions. But agree with @mbrette - might be good to take a hybrid approach.

jmazanec15 · 2023-09-13T16:48:40Z

Additionally, the FreshDiskANN paper did some work in this space. They ran a test for NSG where they iteratively repeat the following process a certain number of cycles and track the recall:

delete 5% of the index
patch the incident nodes that were impacted via local neighborhoods (similiar to @zhaih (1))
reinsert the deleted nodes
measure recall

They ran a similar one for HNSW where they do not patch the edges. In both cases, they saw some degradation:

.

Their intuition for this happening is because of the graphs become sparser as this process happens, leading to less navigability. The graphs become sparser because the pruning policy is more strict.

In their system, they do employ a similar algorithm to @zhaih (1), where they connect the incident edges and prune based on some criteria that shows promise.

zhaih added the type:enhancement label Sep 1, 2023

zhaih mentioned this issue Sep 1, 2023

Make HNSW merges faster #12440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init HNSW merge with graph containing deleted documents #12533

Init HNSW merge with graph containing deleted documents #12533

zhaih commented Sep 1, 2023

mbrette commented Sep 1, 2023 •

edited

Loading

jmazanec15 commented Sep 1, 2023

jmazanec15 commented Sep 13, 2023

Init HNSW merge with graph containing deleted documents #12533

Init HNSW merge with graph containing deleted documents #12533

Comments

zhaih commented Sep 1, 2023

Description

mbrette commented Sep 1, 2023 • edited Loading

jmazanec15 commented Sep 1, 2023

jmazanec15 commented Sep 13, 2023

mbrette commented Sep 1, 2023 •

edited

Loading