-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: The batch, the sync and the missing vector #2062
[BUG]: The batch, the sync and the missing vector #2062
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Nice description! |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@@ -292,3 +296,20 @@ def ann_accuracy( | |||
# Ensure that the query results are sorted by distance | |||
for distance_result in query_results["distances"]: | |||
assert np.allclose(np.sort(distance_result), distance_result) | |||
|
|||
|
|||
def segments_len_match(api: ServerAPI, collection: Collection) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -298,8 +298,12 @@ def collections( | |||
metadata.update(test_hnsw_config) | |||
if with_persistent_hnsw_params: | |||
metadata["hnsw:batch_size"] = draw(st.integers(min_value=3, max_value=2000)) | |||
# batch_size > sync_threshold doesn't make sense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
metadata=collection.metadata, | ||
embedding_function=collection.embedding_function, | ||
) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats this doing and why - seems comment-worthy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hnsw:batch
is only possibly persisted client/server. So, instead of making a more complex change to the test rig, this is one way to detect whether the client is persisted.
- Added a new test to verify the specific use case we've fixed WARNING: Property tests are expected to fail! There is another bug that will get stacked on top of this.
b613f2e
to
805bc79
Compare
Any status updates on this bug @tazarov ? |
Closing, stale and implemented as part of #2512 |
Description of changes
Summarize the changes made by this PR.
get()
with vector data being lostTest plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Affected issues
The following is a list of discord discussions related to this issue
Root Cause Analysis
TLDR: Under specific (note: specific not special) conditions metadata and vector segments go out of sync due to a batching mechanics causing vector data to be lost.
The Detail
A simple scenario: A user adds data to a collection, enough for their data to be moved from bruteforce index to HNSW (e.g.
batch_size
is exceeded, defaults to 100). At some point, the user decides they need to update a document (already in HNSW) and replace it with a fresh copy, a fairly common use case to make RAG systems useful. Down the line, the user usesdelete()
to remove the desired document's id andadd()
to update the new document. Chroma offersupsert()
, but given the number of affected issues and discussions. Experience shows some people prefer delete/add mechanics over upsert. At the moment of insertion of the new data, they are greeted withAdd of existing embedding ID:
. It seems like a warning, but most people, including myself, didn’t think much of it (I even went as far as to create a PR to bypass the warnings in WAL replays - https://github.com/chroma-core/chroma/pull/1763/files). The reality is that underneath the HNSW batching mechanism was silently discarding vector data for recently deleted vectors and thus causing meta and vector segments to go out of sync, leading to the following three types of problems with a subsequentget(include=["embeddings"])
:IndexError: list assignment index out of range
TypeError: 'NoneType' object is not subscriptable
How to reproduce
What is affected
The defect affects
PersistentClient
and Chroma server.**Why isn't in-memory affected: **
In-memory indices are not affected because the batch is updated and synchronized at the end of each transaction. Consider the following two locations in
_write_records
oflocal_hnsw.py
chroma/chromadb/segment/impl/vector/local_hnsw.py
Line 291 in c3db12e
chroma/chromadb/segment/impl/vector/local_hnsw.py
Line 321 in c3db12e
What really happens
Let’s start by visualizing things to illustrate how the defect works:
The happy path
The happy path is the following layout which is a normal vector segment layout. We have some data in HNSW and some in the bruteforce (BF) index. The
Batch
keeps track of things being added and deleted so that we can sync them happily after abatch_size
overflow of the BF.A regular query results in the vector segment would look like this for the above layout:
There are two loops in
get_vectors()
method:chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py
Lines 315 to 323 in 99381f2
chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py
Lines 325 to 332 in 99381f2
When things operate under normal conditions, as seen above, the
id_to_index
and theresults
align perfectly.Now, let’s look at what happens when a vector is removed:
The above shows the vector segment layout (state) after a
delete()
operation. Important fact to observe here that while ID1
goes into the deleted items in the batch it is not yet removed from the HNSW index (including its metadata held in_id_to_label
,_label_to_id
and_id_to_seq_id
. Keep this in mind it’s important in the next diagram. Sending aget
at this stage will return the correct results as HNSW vectors are fetched with IDs coming from the Metadata index (chroma/chromadb/api/segment.py
Line 540 in 99381f2
The metadata segment is successfully updated to remove the ID from the sqlite tables:
So what happens when we
add()
:The WAL (Embedding Queue) works in a pub-sub way where each segment registers for updates. Each time a user adds data to Chroma, the embedding queue distributes that to all segment subscriptions. In single-node Chroma, there are just two segments for each collection:
To ensure that your data is safely stored in the segments, Chroma sequentially and synchronously notifies each segment. Sequencing provides no guarantees of ordering which segment gets the update first (
chroma/chromadb/db/mixins/embeddings_queue.py
Lines 359 to 363 in a265673
As seen in the diagram above the metadata is updated fine as it did not have any references of
1
while the vector segment rejects the update as it can still see the ID in its_id_to_label
HNSW metadata. It is important to observe that the rejection does not result in an exception but a mere warning, which in a client/server setup does not even make it to the client.So here we are - the metadata and vector segments are out of sync. This out-of-sync is not immediately visible other
add(),
query()
etc. all work just fine until you get toget().
That is where you get confronted with the errors above when you also try to include the embeddings (vectors).But why does this problem surface in three different ways? The answer is deceitfully simple - key arrangement of
id_to_index
dictionary (chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py
Line 314 in a265673
The arrangement largely depends on the IDs used; in our experiments, we used
UUIDv4
which appears to be the more common approach people take to generating IDs in Chroma. The inherent random nature of uuids makes key ordering withinid_to_index
unpredictable. In our experimentation, we’ve observed the following three states of the keys withinid_to_index
:As exhibited by the diagram in the out-of-sync layout of vector segment the baseline IDs come from metadata segment but the color coding indicates which subset of the vector segment they belong to - BF or HNSW or missing (in red) if in neither.
In (1), the missing ID is at the beginning, so the batch ID fetching and assignment in
results
are not affected, which lets the results to surface inSegmentAPI
where an errorTypeError: 'NoneType' object is not subscriptable
is thrown as the first item inresults
isNone
.In (2), the missing ID is somewhere in the middle of the keys so an error
IndexError: list assignment index out of range
within the vector segment during the batch fetching and assignment of results.In (3), the missing ID is at the end of the
id_to_index
keys, which lets the missing result go through both the vector segment andSegmentAPI
completely unnoticed and results with incongruent cardinalities returned to the client.Here’s the error distribution of the errors:
Key takeaways
IndexError
) makes the issue a bit difficult to diagnose, especially on unlucky distributions of uuids or whatever IDs are being used in testsupsert()
on a missing ID fixes the out-of-sync for that recordquery()
, which relies on metadata pre-filtering and HNSW filtering, does not appear to be affected by an execution error. However, User expectations might not be met given the document is visible in Chroma, but a search for a similar or exact item does not appear to match it0.4.x
Solutions
We’ve explored four possible solutions as follows:
get()
andquery()
- We decided not to go for this approach for the following reasons:Follow-ups
Testing
Existing tests fail to catch the error for the following reasons:
test_embeddings.py
the observation is that in all state machine iterations the existing embeddings rarely exceed 50 whereas we never configurehnsw:batch_size
(defaults to 100) thus making the segment never move vectors to HNSW.with_persistent_hnsw_params
to allow a lower record count to cross thebatch_size
threshold.