Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: CrateDB: Vector Store #27710

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

amotl
Copy link
Contributor

@amotl amotl commented Oct 29, 2024

About

Status

  • We are considering the patch ready for review and merging, with a few spots to be handled on a later iteration.
  • Please let us know if you want to see any other details to be addressed before the initial merge.
  • A few backlog items have been collected here: Backlog for GA crate-workbench/langchain#30.

Sandbox

A little walkthrough how to exercise the software tests on your workstation.

docker run --rm -it --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  crate:latest -Cdiscovery.type=single-node
git clone https://github.com/crate-workbench/langchain.git --branch=cratedb-up/1/vector-store
cd langchain
uv venv
source .venv/bin/activate
cd libs/community
uv pip install --upgrade --prerelease=allow --editable=. poetry sqlalchemy-cratedb
poetry install --no-interaction --no-ansi --with dev,test,test_integration
pytest -vvv tests/integration_tests/vectorstores/test_cratedb.py

Copy link

vercel bot commented Oct 29, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 4, 2024 1:41pm

Comment on lines 45 to 56
class CrateDBVectorSearch(PGVector):
"""`CrateDB` vector store.
Copy link
Contributor Author

@amotl amotl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The CrateDB implementation is heavily based on PGVector's, with a few adjustments. Previous generalizations and improvements to PGVector have been submitted the other day already.

Comment on lines 437 to 468
@staticmethod
def _euclidean_relevance_score_fn(score: float) -> float:
"""Return a similarity score on a scale [0, 1]."""
# The 'correct' relevance function
# may differ depending on a few things, including:
# - the distance / similarity metric used by the VectorStore
# - the scale of your embeddings (OpenAI's are unit normed. Many
# others are not!)
# - embedding dimensionality
# - etc.
# This function converts the euclidean norm of normalized embeddings
# (0 is most similar, sqrt(2) most dissimilar)
# to a similarity function (0 to 1)

# Original:
# return 1.0 - distance / math.sqrt(2)
return score / math.sqrt(2)
Copy link
Contributor Author

@amotl amotl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this.

/cc @surister, @matriv, @ckurze, @kneth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't _score based on BM25? If so, is 1 - score or 1 / score closer to the API documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't _score based on BM25?

I think it is some sort of hybrid. @surister or @matriv may be able to elaborate further?

If so, is 1 - score or 1 / score closer to the API documentation?

I also would like to defer this question to subject matter experts. ;]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone wants to look into this, this guideline may help to exercise relevant code paths that leverage the _euclidean_relevance_score_fn() method.

It is those two test cases that will:

tests/integration_tests/vectorstores/test_cratedb.py::test_cratedb_relevance_score
tests/integration_tests/vectorstores/test_cratedb.py::test_cratedb_retriever_search_threshold

You can invoke them on the spot, exclusively, by using this pytest command:

pytest -vvv tests/integration_tests/vectorstores/test_cratedb.py -k "test_cratedb_relevance_score or test_cratedb_retriever_search_threshold"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our full-text search function match gives a _score , which is the result of lucene's implementation of bm25. We also have knn_match and vector_similarity giving us a _score, even though then have they same name, they are different things.

@@ -14,6 +14,7 @@ chardet>=5.1.0,<6
cloudpathlib>=0.18,<0.19
cloudpickle>=2.0.0
cohere>=4,<6
crate==1.0.0.dev1
Copy link
Contributor Author

@amotl amotl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About

Last summer, we removed the SQLAlchemy dialect from the fundamental DB API driver package. The next release of the crate package, version 1.0.0, will conclude the transition. Afterwards, it will be fine to just pull in the new sqlalchemy-cratedb package.

References

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been doing some yak shaving towards a 1.0.0 release.

Comment on lines 312 to 350
results: List[Any] = (
session.query( # type: ignore[attr-defined]
self.EmbeddingStore,
# TODO: Original pgvector code uses `self.distance_strategy`.
# CrateDB currently only supports EUCLIDEAN.
# self.distance_strategy(embedding).label("distance")
sqlalchemy.literal_column(
f"{self.EmbeddingStore.__tablename__}._score"
).label("_score"),
)
.filter(filter_by)
# CrateDB applies `KNN_MATCH` within the `WHERE` clause.
.filter(
sqlalchemy.func.knn_match(
self.EmbeddingStore.embedding, embedding, k
)
)
.order_by(sqlalchemy.desc("_score"))
.join(
self.CollectionStore,
self.EmbeddingStore.collection_id == self.CollectionStore.uuid,
)
.limit(k)
)
Copy link
Contributor Author

@amotl amotl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this spot hasn't been marked with a huge FIXME admonition, but I guess it is just NOT OK to use _score here, and this has just been applied in the interim, to get at least something out of it? Isn't this exactly the spot where that feature request was coming from, to be able to use the actual vector distance?

/cc @ckurze, @surister

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can vector_similarity be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dearly hope so!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the best case, just using a corresponding SQLAlchemy incantation (correctly) will work without much ado.

import sqlalchemy as sa

sa.func.vector_similarity(...)

Copy link
Contributor Author

@amotl amotl Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapping in.

sqlalchemy.func.vector_similarity(
    self.EmbeddingStore.embedding, embedding
).label("_score"),

Not there yet, the original query is hitting an edge case.

However, when adjusting the query slightly to work around that edge case, it seems to start working well in general.

Copy link
Contributor Author

@amotl amotl Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed per 476d718, effectively making this patch functionally complete/sound, with improvements pending for a later iteration.

Copy link

@surister surister Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I don't understand the problem, what is it that we want? vector distance? if so, we do no directly have it, our vector_similarity gives us similarity based on euclidian distance, similarity is calculated like sim = 1 / euc_distance

If what we want is similarity, normalized between (0, 1] then yes, just returning our vector_similarity should be good enough, just bare in mind that per our docs, 0 is not included, not sure if this is in any way relevant for langchain,(maybe 0 can be used to flag total dissimilarity?, don't know)

Before, the adapter used CrateDB's built-in `_score` field for ranking.
Now, it uses the dedicated `vector_similarity()` function to compute the
similarity between two vectors.
@amotl amotl requested a review from kneth November 4, 2024 15:12
@amotl amotl marked this pull request as ready for review November 4, 2024 15:15
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module labels Nov 4, 2024
Copy link

@surister surister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community size:XXL This PR changes 1000+ lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
Status: Needs support
Development

Successfully merging this pull request may close these issues.

3 participants