community: CrateDB: Vector Store #27710

amotl · 2024-10-29T13:33:55Z

About

Description: Vector Store adapter for CrateDB.
Coming from: Add support for CrateDB to LangChain LLM framework crate-workbench/langchain#1
Documentation: community: CrateDB: Documentation about Vector Store, Document Loader, and Conversational Memory #27713
Addressed to: @eyurtsev

Status

We are considering the patch ready for review and merging, with a few spots to be handled on a later iteration.
Please let us know if you want to see any other details to be addressed before the initial merge.
A few backlog items have been collected here: Backlog for GA crate-workbench/langchain#30.

Sandbox

A little walkthrough how to exercise the software tests on your workstation.

docker run --rm -it --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  crate:latest -Cdiscovery.type=single-node

git clone https://github.com/crate-workbench/langchain.git --branch=cratedb-up/1/vector-store
cd langchain
uv venv
source .venv/bin/activate
cd libs/community
uv pip install --upgrade --prerelease=allow --editable=. poetry sqlalchemy-cratedb
poetry install --no-interaction --no-ansi --with dev,test,test_integration

pytest -vvv tests/integration_tests/vectorstores/test_cratedb.py

vercel · 2024-10-29T13:34:00Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 4, 2024 1:41pm

libs/community/langchain_community/vectorstores/__init__.py

amotl · 2024-10-29T20:32:52Z

libs/community/langchain_community/vectorstores/cratedb/base.py

+class CrateDBVectorSearch(PGVector):
+    """`CrateDB` vector store.


FYI: The CrateDB implementation is heavily based on PGVector's, with a few adjustments. Previous generalizations and improvements to PGVector have been submitted the other day already.

community: pgvector: Slight refactoring to make code a bit more reusable #16243

community: pgvector: Use SQLAlchemy's bulk_save_objects method to improve insert performance #16244

libs/community/langchain_community/vectorstores/cratedb/base.py

amotl · 2024-10-29T20:35:52Z

libs/community/langchain_community/vectorstores/cratedb/base.py

+    @staticmethod
+    def _euclidean_relevance_score_fn(score: float) -> float:
+        """Return a similarity score on a scale [0, 1]."""
+        # The 'correct' relevance function
+        # may differ depending on a few things, including:
+        # - the distance / similarity metric used by the VectorStore
+        # - the scale of your embeddings (OpenAI's are unit normed. Many
+        #  others are not!)
+        # - embedding dimensionality
+        # - etc.
+        # This function converts the euclidean norm of normalized embeddings
+        # (0 is most similar, sqrt(2) most dissimilar)
+        # to a similarity function (0 to 1)
+
+        # Original:
+        # return 1.0 - distance / math.sqrt(2)
+        return score / math.sqrt(2)


Please review this.

/cc @surister, @matriv, @ckurze, @kneth

Are those matters relevant and applicable here?

Vector Store: Provide distance functions as scalar functions crate/crate#15835

Add vector_similarity scalar function (euclidean based) crate/crate#15832

Isn't _score based on BM25? If so, is 1 - score or 1 / score closer to the API documentation?

Isn't _score based on BM25?

I think it is some sort of hybrid. @surister or @matriv may be able to elaborate further?

If so, is 1 - score or 1 / score closer to the API documentation?

I also would like to defer this question to subject matter experts. ;]

If someone wants to look into this, this guideline may help to exercise relevant code paths that leverage the _euclidean_relevance_score_fn() method.

It is those two test cases that will:

tests/integration_tests/vectorstores/test_cratedb.py::test_cratedb_relevance_score tests/integration_tests/vectorstores/test_cratedb.py::test_cratedb_retriever_search_threshold

You can invoke them on the spot, exclusively, by using this pytest command:

pytest -vvv tests/integration_tests/vectorstores/test_cratedb.py -k "test_cratedb_relevance_score or test_cratedb_retriever_search_threshold"

Our full-text search function match gives a _score , which is the result of lucene's implementation of bm25. We also have knn_match and vector_similarity giving us a _score, even though then have they same name, they are different things.

amotl · 2024-10-29T22:04:27Z

libs/community/extended_testing_deps.txt

@@ -14,6 +14,7 @@ chardet>=5.1.0,<6
 cloudpathlib>=0.18,<0.19
 cloudpickle>=2.0.0
 cohere>=4,<6
+crate==1.0.0.dev1


About

Last summer, we removed the SQLAlchemy dialect from the fundamental DB API driver package. The next release of the crate package, version 1.0.0, will conclude the transition. Afterwards, it will be fine to just pull in the new sqlalchemy-cratedb package.

References

Remove SQLAlchemy dialect crate/crate-python#616

[2024] Migration to sqlalchemy-cratedb crate/crate-python#620

We have been doing some yak shaving towards a 1.0.0 release.

Release crate-python 1.0.0 crate-workbench/langchain#29

amotl · 2024-10-29T22:07:11Z

libs/community/langchain_community/vectorstores/cratedb/base.py

+            results: List[Any] = (
+                session.query(  # type: ignore[attr-defined]
+                    self.EmbeddingStore,
+                    # TODO: Original pgvector code uses `self.distance_strategy`.
+                    #       CrateDB currently only supports EUCLIDEAN.
+                    #       self.distance_strategy(embedding).label("distance")
+                    sqlalchemy.literal_column(
+                        f"{self.EmbeddingStore.__tablename__}._score"
+                    ).label("_score"),
+                )
+                .filter(filter_by)
+                # CrateDB applies `KNN_MATCH` within the `WHERE` clause.
+                .filter(
+                    sqlalchemy.func.knn_match(
+                        self.EmbeddingStore.embedding, embedding, k
+                    )
+                )
+                .order_by(sqlalchemy.desc("_score"))
+                .join(
+                    self.CollectionStore,
+                    self.EmbeddingStore.collection_id == self.CollectionStore.uuid,
+                )
+                .limit(k)
+            )


I don't know why this spot hasn't been marked with a huge FIXME admonition, but I guess it is just NOT OK to use _score here, and this has just been applied in the interim, to get at least something out of it? Isn't this exactly the spot where that feature request was coming from, to be able to use the actual vector distance?

Vector Store: Provide distance functions as scalar functions crate/crate#15835

/cc @ckurze, @surister

Can vector_similarity be used?

I dearly hope so!

In the best case, just using a corresponding SQLAlchemy incantation (correctly) will work without much ado.

import sqlalchemy as sa sa.func.vector_similarity(...)

Swapping in.

sqlalchemy.func.vector_similarity( self.EmbeddingStore.embedding, embedding ).label("_score"),

Not there yet, the original query is hitting an edge case.

Vector Store: UnsupportedOperationException: Can't handle Symbol [ParameterSymbol: $1]] when using JOINs and parameters to an aliased and sorted vector_similarity() together crate/crate#16912

However, when adjusting the query slightly to work around that edge case, it seems to start working well in general.

CrateDB: Vector Store -- make it work using CrateDB's vector_similarity() crate-workbench/langchain#31

Fixed per 476d718, effectively making this patch functionally complete/sound, with improvements pending for a later iteration.

I'm sorry but I don't understand the problem, what is it that we want? vector distance? if so, we do no directly have it, our vector_similarity gives us similarity based on euclidian distance, similarity is calculated like sim = 1 / euc_distance

If what we want is similarity, normalized between (0, 1] then yes, just returning our vector_similarity should be good enough, just bare in mind that per our docs, 0 is not included, not sure if this is in any way relevant for langchain,(maybe 0 can be used to flag total dissimilarity?, don't know)

Before, the adapter used CrateDB's built-in `_score` field for ranking. Now, it uses the dedicated `vector_similarity()` function to compute the similarity between two vectors.

surister

LGTM

amotl mentioned this pull request Oct 29, 2024

community: CrateDB: Documentation about Vector Store, Document Loader, and Conversational Memory #27713

Draft

amotl force-pushed the cratedb-up/1/vector-store branch 5 times, most recently from 379ce72 to 513249a Compare October 29, 2024 20:27

vercel bot deployed to Preview October 29, 2024 20:38 View deployment

amotl commented Oct 29, 2024

View reviewed changes

amotl force-pushed the cratedb-up/1/vector-store branch 2 times, most recently from d13f281 to 46750b6 Compare October 29, 2024 20:46

vercel bot deployed to Preview October 29, 2024 20:56 View deployment

amotl commented Oct 29, 2024

View reviewed changes

amotl added 3 commits October 31, 2024 07:40

CrateDB: Vector Store

6b6ad4e

CrateDB: Vector Store -- rename to CrateDBVectorStore

ebabf7e

CrateDB: Vector Store -- improve inline documentation

ffda5c8

amotl force-pushed the cratedb-up/1/vector-store branch from 46750b6 to ffda5c8 Compare October 31, 2024 06:57

vercel bot deployed to Preview October 31, 2024 07:06 View deployment

CrateDB: Vector Store -- make it work using CrateDB's vector_similarity

476d718

Before, the adapter used CrateDB's built-in `_score` field for ranking. Now, it uses the dedicated `vector_similarity()` function to compute the similarity between two vectors.

vercel bot deployed to Preview November 4, 2024 13:41 View deployment

amotl requested a review from kneth November 4, 2024 15:12

amotl marked this pull request as ready for review November 4, 2024 15:15

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module labels Nov 4, 2024

surister approved these changes Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: CrateDB: Vector Store #27710

community: CrateDB: Vector Store #27710

amotl commented Oct 29, 2024 •

edited

Loading

vercel bot commented Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024

kneth Oct 30, 2024

amotl Oct 30, 2024

amotl Nov 1, 2024

surister Nov 4, 2024

amotl Oct 29, 2024 •

edited

Loading

amotl Nov 1, 2024

amotl Oct 29, 2024 •

edited

Loading

kneth Oct 30, 2024

amotl Oct 30, 2024

amotl Oct 31, 2024

amotl Nov 1, 2024 •

edited

Loading

amotl Nov 4, 2024 •

edited

Loading

surister Nov 4, 2024 •

edited

Loading

surister left a comment

		class CrateDBVectorSearch(PGVector):
		"""`CrateDB` vector store.

community: CrateDB: Vector Store #27710

Are you sure you want to change the base?

community: CrateDB: Vector Store #27710

Conversation

amotl commented Oct 29, 2024 • edited Loading

About

Status

Sandbox

vercel bot commented Oct 29, 2024 • edited Loading

amotl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

amotl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

About

References

Choose a reason for hiding this comment

amotl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

amotl Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

surister Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

surister left a comment

Choose a reason for hiding this comment

amotl commented Oct 29, 2024 •

edited

Loading

vercel bot commented Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024 •

edited

Loading

amotl Oct 29, 2024 •

edited

Loading

amotl Nov 1, 2024 •

edited

Loading

amotl Nov 4, 2024 •

edited

Loading

surister Nov 4, 2024 •

edited

Loading