-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486
community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@eyurtsev please review or assign to someone! Thanks! |
Is there anything I can do to move this PR forward? Please let me know! |
Pretty please, could someone please take a look? We're using this fix in our pipeline and have to keep using forked version until the PR gets through. Please let me know if I can do anything to speed up the process! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks fine to me, but could you add a test to the existing AzureSearch
tests that fails with existing code, and passes following this update?
@ccurme added a unit test. Please take a look. Thanks! |
else: | ||
key = str(uuid.uuid4()) | ||
# Encoding key for Azure Search valid characters | ||
key = base64.urlsafe_b64encode(bytes(key, "utf-8")).decode("ascii") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now if a user provides keys
, we encode them. Here we're changing that. Is this a breaking change? Should we encode in all circumstances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documents are passed to azure.search.documents.client.upload_documents()
. Nowhere in the documentation is mentioned that the ids should be url encoded. My guess is that the keys were url encoded just as an extra precaution.
I left it there for the case when the keys are generated by uuid4()
however in cases when keys/ids are passed in, url encoding them breaks stuff.
For instance when indexer is used to manage documents in the azure vector store, it passes ids
to .add_texts()
and expects that those ids
are then used as-is (and stores them in the record manager under provided ids). When .add_embeddings()
url-encodes them, the ids in vector store do not match ids in record manager. This breaks indexing.
In general when I pass ids/keys to a method I don't expect them to be silently changed in the background. Better alternative would be checking validity of provided keys/ids, however the documentation stats that the key
is of type Edm.String
which is defined here as <any UTF-8 character>' Note: See definition of UTF8-char in [RFC3629]
There is a limit of 1024 chars which could be checked, and the key
s should be unique what could be checked but only for the documents currently being uploaded - and therefore I think it is better left for the user to ensure uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme sorry for late reply! I actually replied to your question 4 days ago but didn't hit "start review" (I assumed the comment would be still visible). Only today I found out that the comment does not show up if the review is not "started".
Please take a look and let me know whether it answers your questions and whether you agree. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme Please let me know if I can expand on the explanation I gave. Or please let me know what can I do to move forward. The issue I'm trying to fix is that as it stands right now, the indexer does not work with Azure at all. Thank you!
…ore (langchain-ai#25486) # Description [Vector store base class](https://github.com/langchain-ai/langchain/blob/4cdaca67dc51dba887289f56c6fead3c1a52f97d/libs/core/langchain_core/vectorstores/base.py#L65) currently expects `ids` to be passed in and that is what it passes along to the AzureSearch vector store when attempting to `add_texts()`. However AzureSearch expects `keys` to be passed in. When they are not present, AzureSearch `add_embeddings()` makes up new uuids. This is a problem when trying to run indexing. [Indexing code expects](https://github.com/langchain-ai/langchain/blob/b297af5482ae7c6d26779513d637ec657a1cd552/libs/core/langchain_core/indexing/api.py#L371) the documents to be uploaded using provided ids. Currently AzureSearch ignores `ids` passed from `indexing` and makes up new ones. Later when `indexer` attempts to delete removed file, it uses the `id` it had stored when uploading the document, however it was uploaded under different `id`. **Twitter handle: @martintriska1**
Description
Vector store base class currently expects
ids
to be passed in and that is what it passes along to the AzureSearch vector store when attempting toadd_texts()
. However AzureSearch expectskeys
to be passed in. When they are not present, AzureSearchadd_embeddings()
makes up new uuids. This is a problem when trying to run indexing. Indexing code expects the documents to be uploaded using provided ids. Currently AzureSearch ignoresids
passed fromindexing
and makes up new ones. Later whenindexer
attempts to delete removed file, it uses theid
it had stored when uploading the document, however it was uploaded under differentid
.Twitter handle: @martintriska1