community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486

MacanPN · 2024-08-16T12:32:00Z

Description

Vector store base class currently expects ids to be passed in and that is what it passes along to the AzureSearch vector store when attempting to add_texts(). However AzureSearch expects keys to be passed in. When they are not present, AzureSearch add_embeddings() makes up new uuids. This is a problem when trying to run indexing. Indexing code expects the documents to be uploaded using provided ids. Currently AzureSearch ignores ids passed from indexing and makes up new ones. Later when indexer attempts to delete removed file, it uses the id it had stored when uploading the document, however it was uploaded under different id.

Twitter handle: @martintriska1

vercel · 2024-08-16T12:32:04Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Aug 29, 2024 3:12pm

…tore

MacanPN · 2024-08-19T08:28:23Z

@eyurtsev please review or assign to someone! Thanks!

…tore

MacanPN · 2024-08-20T13:22:41Z

Is there anything I can do to move this PR forward? Please let me know!

MacanPN · 2024-08-28T07:12:26Z

Pretty please, could someone please take a look? We're using this fix in our pipeline and have to keep using forked version until the PR gets through. Please let me know if I can do anything to speed up the process!

@baskaryan @efriis @eyurtsev @ccurme @vbarda @hwchase17

ccurme

The change looks fine to me, but could you add a test to the existing AzureSearch tests that fails with existing code, and passes following this update?

MacanPN · 2024-08-29T15:15:51Z

@ccurme added a unit test. Please take a look. Thanks!

ccurme · 2024-08-29T19:15:29Z

libs/community/langchain_community/vectorstores/azuresearch.py

+            else:
+                key = str(uuid.uuid4())
+                # Encoding key for Azure Search valid characters
+                key = base64.urlsafe_b64encode(bytes(key, "utf-8")).decode("ascii")


right now if a user provides keys, we encode them. Here we're changing that. Is this a breaking change? Should we encode in all circumstances?

The documents are passed to azure.search.documents.client.upload_documents() . Nowhere in the documentation is mentioned that the ids should be url encoded. My guess is that the keys were url encoded just as an extra precaution.
I left it there for the case when the keys are generated by uuid4() however in cases when keys/ids are passed in, url encoding them breaks stuff.
For instance when indexer is used to manage documents in the azure vector store, it passes ids to .add_texts() and expects that those ids are then used as-is (and stores them in the record manager under provided ids). When .add_embeddings() url-encodes them, the ids in vector store do not match ids in record manager. This breaks indexing.
In general when I pass ids/keys to a method I don't expect them to be silently changed in the background. Better alternative would be checking validity of provided keys/ids, however the documentation stats that the key is of type Edm.String which is defined here as <any UTF-8 character>' Note: See definition of UTF8-char in [RFC3629]
There is a limit of 1024 chars which could be checked, and the keys should be unique what could be checked but only for the documents currently being uploaded - and therefore I think it is better left for the user to ensure uniqueness.

@ccurme sorry for late reply! I actually replied to your question 4 days ago but didn't hit "start review" (I assumed the comment would be still visible). Only today I found out that the comment does not show up if the review is not "started".
Please take a look and let me know whether it answers your questions and whether you agree. Thanks!

@ccurme Please let me know if I can expand on the explanation I gave. Or please let me know what can I do to move forward. The issue I'm trying to fix is that as it stands right now, the indexer does not work with Azure at all. Thank you!

petergoldstein · 2024-09-10T14:12:08Z

@ccurme Are you available to give this another look? I think @MacanPN 's responses address your question.

This is a blocker for me, and I'd really like to see it merged. If there's anything I can do to move that along, please let me know. Thanks!

…ore (langchain-ai#25486) # Description [Vector store base class](https://github.com/langchain-ai/langchain/blob/4cdaca67dc51dba887289f56c6fead3c1a52f97d/libs/core/langchain_core/vectorstores/base.py#L65) currently expects `ids` to be passed in and that is what it passes along to the AzureSearch vector store when attempting to `add_texts()`. However AzureSearch expects `keys` to be passed in. When they are not present, AzureSearch `add_embeddings()` makes up new uuids. This is a problem when trying to run indexing. [Indexing code expects](https://github.com/langchain-ai/langchain/blob/b297af5482ae7c6d26779513d637ec657a1cd552/libs/core/langchain_core/indexing/api.py#L371) the documents to be uploaded using provided ids. Currently AzureSearch ignores `ids` passed from `indexing` and makes up new ones. Later when `indexer` attempts to delete removed file, it uses the `id` it had stored when uploading the document, however it was uploaded under different `id`. **Twitter handle: @martintriska1**

MacanPN added 3 commits August 15, 2024 21:24

Bugfix for AzureSearch vector store

e3f0114

typo fix

1692279

auto-format

d00bbc1

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 16, 2024

Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…

731511b

…tore

MacanPN added 3 commits August 19, 2024 15:25

Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…

11b093f

…tore

Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…

302509d

…tore

Merge branch 'master' into triska/use-document-ids-as-keys-in-vectors…

b47b046

…tore

ccurme self-assigned this Aug 28, 2024

ccurme reviewed Aug 28, 2024

View reviewed changes

added unit test

48f724f

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Aug 29, 2024

MacanPN added 5 commits August 29, 2024 15:56

tiny formatting update

3b8563f

formatting

de28870

fixed typing issues in test

9650275

add "ignore [no-untyped-def]" in the test

5872676

formatting: literally added a space before comment :-/

7c6c232

ccurme reviewed Aug 29, 2024

View reviewed changes

ccurme approved these changes Sep 19, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Sep 19, 2024

ccurme merged commit 3fc0ea5 into langchain-ai:master Sep 19, 2024
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486

community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486

MacanPN commented Aug 16, 2024

vercel bot commented Aug 16, 2024 •

edited

Loading

MacanPN commented Aug 19, 2024

MacanPN commented Aug 20, 2024

MacanPN commented Aug 28, 2024

ccurme left a comment

MacanPN commented Aug 29, 2024

ccurme Aug 29, 2024

MacanPN Aug 30, 2024

MacanPN Sep 3, 2024 •

edited

Loading

MacanPN Sep 9, 2024

petergoldstein commented Sep 10, 2024

community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486

community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486

Conversation

MacanPN commented Aug 16, 2024

Description

vercel bot commented Aug 16, 2024 • edited Loading

MacanPN commented Aug 19, 2024

MacanPN commented Aug 20, 2024

MacanPN commented Aug 28, 2024

ccurme left a comment

Choose a reason for hiding this comment

MacanPN commented Aug 29, 2024

ccurme Aug 29, 2024

Choose a reason for hiding this comment

MacanPN Aug 30, 2024

Choose a reason for hiding this comment

MacanPN Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

MacanPN Sep 9, 2024

Choose a reason for hiding this comment

petergoldstein commented Sep 10, 2024

vercel bot commented Aug 16, 2024 •

edited

Loading

MacanPN Sep 3, 2024 •

edited

Loading