-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486
Merged
ccurme
merged 13 commits into
langchain-ai:master
from
MacanPN:triska/use-document-ids-as-keys-in-vectorstore
Sep 19, 2024
Merged
Changes from 7 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
e3f0114
Bugfix for AzureSearch vector store
MacanPN 1692279
typo fix
MacanPN d00bbc1
auto-format
MacanPN 731511b
Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…
MacanPN 11b093f
Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…
MacanPN 302509d
Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…
MacanPN b47b046
Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…
MacanPN 48f724f
added unit test
MacanPN 3b8563f
tiny formatting update
MacanPN de28870
formatting
MacanPN 9650275
fixed typing issues in test
MacanPN 5872676
add "ignore [no-untyped-def]" in the test
MacanPN 7c6c232
formatting: literally added a space before comment :-/
MacanPN File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now if a user provides
keys
, we encode them. Here we're changing that. Is this a breaking change? Should we encode in all circumstances?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documents are passed to
azure.search.documents.client.upload_documents()
. Nowhere in the documentation is mentioned that the ids should be url encoded. My guess is that the keys were url encoded just as an extra precaution.I left it there for the case when the keys are generated by
uuid4()
however in cases when keys/ids are passed in, url encoding them breaks stuff.For instance when indexer is used to manage documents in the azure vector store, it passes
ids
to.add_texts()
and expects that thoseids
are then used as-is (and stores them in the record manager under provided ids). When.add_embeddings()
url-encodes them, the ids in vector store do not match ids in record manager. This breaks indexing.In general when I pass ids/keys to a method I don't expect them to be silently changed in the background. Better alternative would be checking validity of provided keys/ids, however the documentation stats that the
key
is of typeEdm.String
which is defined here as<any UTF-8 character>' Note: See definition of UTF8-char in [RFC3629]
There is a limit of 1024 chars which could be checked, and the
key
s should be unique what could be checked but only for the documents currently being uploaded - and therefore I think it is better left for the user to ensure uniqueness.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme sorry for late reply! I actually replied to your question 4 days ago but didn't hit "start review" (I assumed the comment would be still visible). Only today I found out that the comment does not show up if the review is not "started".
Please take a look and let me know whether it answers your questions and whether you agree. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme Please let me know if I can expand on the explanation I gave. Or please let me know what can I do to move forward. The issue I'm trying to fix is that as it stands right now, the indexer does not work with Azure at all. Thank you!