-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Init HNSW merge with graph containing deleted documents #12533
Comments
Approach (2) may not be so much a change. Let's say you mark those nodes as "tombstone" so that we know they are deleted in the graph itself. Exploring the graph with those tombstone would be the same as what we do today with the acceptOrds filter: skipping those nodes from the results, but keeping them in the candidate list to explore its neighbors. The problem with that approach is if you have a large ratio of deleted documents, the graph becomes bloated. |
What do you mean by fully connect its neighbors? Would this mean basically figure out the to be deleted nodes in-edges and reinsert them into the graph using normal edge selection strategy excluding the deleted nodes to "patch" the broken connections? We looked into this a little bit recently, but the number of reinserts grows pretty fast. It might be promising, though, to start finding replacement neighbors from the neighbor that is being removed (as opposed to starting from the global entry point). I think with this approach we would need to figure out a way to avoid quality drift after the graph has been manipulated in such a way over several generations - edge selection strategy is different from building the graph. For instance, refinement overtime may mean that the long distance hops neighbors added on early would start to disappear. Would the diversity check help in this case? Also, I think at a certain point, it will be better to just rebuild the graph from scratch, suggesting a threshold might need to be selected.
There was some discussion around this in hnswlib: nmslib/hnswlib#4 (comment). In practice, this probably would work well - but not really sure how to choose the correct number of deletions. But agree with @mbrette - might be good to take a hybrid approach. |
Additionally, the FreshDiskANN paper did some work in this space. They ran a test for NSG where they iteratively repeat the following process a certain number of cycles and track the recall:
They ran a similar one for HNSW where they do not patch the edges. In both cases, they saw some degradation: Their intuition for this happening is because of the graphs become sparser as this process happens, leading to less navigability. The graphs become sparser because the pruning policy is more strict. In their system, they do employ a similar algorithm to @zhaih (1), where they connect the incident edges and prune based on some criteria that shows promise. |
Description
Currently when we're merging HNSW graphs we're able to start with an exiting graph but not inserting nodes from scratch thanks to #12050. But we have set a constraint that the init graph cannot be the one that has deletions since we do not have a good way to handle deletions within a graph without reconstructing it. And this makes the nice optimization a bit wasteful as in reality the bigger the segment the less probability it has no deletion, especially in a hybrid text-embedding index.
Opening this issue to discuss how we could solve this problem or have a work around.
My very naive ideas:
I personally like 1 better as for 2 we need to change the searching behavior accordingly and sounds a bit fragile. Any other ideas are welcome!
The text was updated successfully, but these errors were encountered: