Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init HNSW merge with graph containing deleted documents #12533

Open
zhaih opened this issue Sep 1, 2023 · 3 comments
Open

Init HNSW merge with graph containing deleted documents #12533

zhaih opened this issue Sep 1, 2023 · 3 comments

Comments

@zhaih
Copy link
Contributor

zhaih commented Sep 1, 2023

Description

Currently when we're merging HNSW graphs we're able to start with an exiting graph but not inserting nodes from scratch thanks to #12050. But we have set a constraint that the init graph cannot be the one that has deletions since we do not have a good way to handle deletions within a graph without reconstructing it. And this makes the nice optimization a bit wasteful as in reality the bigger the segment the less probability it has no deletion, especially in a hybrid text-embedding index.

Opening this issue to discuss how we could solve this problem or have a work around.
My very naive ideas:

  1. We could remove the deleted node from the graph, fully connected all its neighbours, and do diverse check on those neighbors to remove extra links. In case the deleted node is an entry node, we can insert the closest node of the deleted node upto where the deleted node existed.
  2. We tolerate certain amount of deletions (like 10% ~ 20%) inside HNSW graph and just use them as connections.

I personally like 1 better as for 2 we need to change the searching behavior accordingly and sounds a bit fragile. Any other ideas are welcome!

@mbrette
Copy link

mbrette commented Sep 1, 2023

Approach (2) may not be so much a change. Let's say you mark those nodes as "tombstone" so that we know they are deleted in the graph itself. Exploring the graph with those tombstone would be the same as what we do today with the acceptOrds filter: skipping those nodes from the results, but keeping them in the candidate list to explore its neighbors.

The problem with that approach is if you have a large ratio of deleted documents, the graph becomes bloated.
It could be part of an hybrid strategy where we keep those tombstones until we reach a threshold, after which we need to remove them - if we can - or rebuild the graph on the next merge.

@jmazanec15
Copy link
Contributor

  1. We could remove the deleted node from the graph, fully connected all its neighbours, and do diverse check on those neighbors to remove extra links. In case the deleted node is an entry node, we can insert the closest node of the deleted node upto where the deleted node existed.

What do you mean by fully connect its neighbors? Would this mean basically figure out the to be deleted nodes in-edges and reinsert them into the graph using normal edge selection strategy excluding the deleted nodes to "patch" the broken connections? We looked into this a little bit recently, but the number of reinserts grows pretty fast. It might be promising, though, to start finding replacement neighbors from the neighbor that is being removed (as opposed to starting from the global entry point). I think with this approach we would need to figure out a way to avoid quality drift after the graph has been manipulated in such a way over several generations - edge selection strategy is different from building the graph. For instance, refinement overtime may mean that the long distance hops neighbors added on early would start to disappear. Would the diversity check help in this case? Also, I think at a certain point, it will be better to just rebuild the graph from scratch, suggesting a threshold might need to be selected.

  1. We tolerate certain amount of deletions (like 10% ~ 20%) inside HNSW graph and just use them as connections.

There was some discussion around this in hnswlib: nmslib/hnswlib#4 (comment). In practice, this probably would work well - but not really sure how to choose the correct number of deletions. But agree with @mbrette - might be good to take a hybrid approach.

@jmazanec15
Copy link
Contributor

Additionally, the FreshDiskANN paper did some work in this space. They ran a test for NSG where they iteratively repeat the following process a certain number of cycles and track the recall:

  1. delete 5% of the index
  2. patch the incident nodes that were impacted via local neighborhoods (similiar to @zhaih (1))
  3. reinsert the deleted nodes
  4. measure recall

They ran a similar one for HNSW where they do not patch the edges. In both cases, they saw some degradation:

Screenshot 2023-09-13 at 9 40 10 AM.

Their intuition for this happening is because of the graphs become sparser as this process happens, leading to less navigability. The graphs become sparser because the pruning policy is more strict.

In their system, they do employ a similar algorithm to @zhaih (1), where they connect the incident edges and prune based on some criteria that shows promise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants