-
Notifications
You must be signed in to change notification settings - Fork 15.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: CrateDB: Vector Store #27710
base: master
Are you sure you want to change the base?
community: CrateDB: Vector Store #27710
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
379ce72
to
513249a
Compare
class CrateDBVectorSearch(PGVector): | ||
"""`CrateDB` vector store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: The CrateDB implementation is heavily based on PGVector's, with a few adjustments. Previous generalizations and improvements to PGVector have been submitted the other day already.
libs/community/langchain_community/vectorstores/cratedb/base.py
Outdated
Show resolved
Hide resolved
@staticmethod | ||
def _euclidean_relevance_score_fn(score: float) -> float: | ||
"""Return a similarity score on a scale [0, 1].""" | ||
# The 'correct' relevance function | ||
# may differ depending on a few things, including: | ||
# - the distance / similarity metric used by the VectorStore | ||
# - the scale of your embeddings (OpenAI's are unit normed. Many | ||
# others are not!) | ||
# - embedding dimensionality | ||
# - etc. | ||
# This function converts the euclidean norm of normalized embeddings | ||
# (0 is most similar, sqrt(2) most dissimilar) | ||
# to a similarity function (0 to 1) | ||
|
||
# Original: | ||
# return 1.0 - distance / math.sqrt(2) | ||
return score / math.sqrt(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those matters relevant and applicable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't _score
based on BM25? If so, is 1 - score
or 1 / score
closer to the API documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone wants to look into this, this guideline may help to exercise relevant code paths that leverage the _euclidean_relevance_score_fn()
method.
It is those two test cases that will:
tests/integration_tests/vectorstores/test_cratedb.py::test_cratedb_relevance_score
tests/integration_tests/vectorstores/test_cratedb.py::test_cratedb_retriever_search_threshold
You can invoke them on the spot, exclusively, by using this pytest command:
pytest -vvv tests/integration_tests/vectorstores/test_cratedb.py -k "test_cratedb_relevance_score or test_cratedb_retriever_search_threshold"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our full-text search function match
gives a _score
, which is the result of lucene's implementation of bm25. We also have knn_match
and vector_similarity
giving us a _score
, even though then have they same name, they are different things.
d13f281
to
46750b6
Compare
@@ -14,6 +14,7 @@ chardet>=5.1.0,<6 | |||
cloudpathlib>=0.18,<0.19 | |||
cloudpickle>=2.0.0 | |||
cohere>=4,<6 | |||
crate==1.0.0.dev1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About
Last summer, we removed the SQLAlchemy dialect from the fundamental DB API driver package. The next release of the crate package, version 1.0.0, will conclude the transition. Afterwards, it will be fine to just pull in the new sqlalchemy-cratedb package.
References
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been doing some yak shaving towards a 1.0.0 release.
results: List[Any] = ( | ||
session.query( # type: ignore[attr-defined] | ||
self.EmbeddingStore, | ||
# TODO: Original pgvector code uses `self.distance_strategy`. | ||
# CrateDB currently only supports EUCLIDEAN. | ||
# self.distance_strategy(embedding).label("distance") | ||
sqlalchemy.literal_column( | ||
f"{self.EmbeddingStore.__tablename__}._score" | ||
).label("_score"), | ||
) | ||
.filter(filter_by) | ||
# CrateDB applies `KNN_MATCH` within the `WHERE` clause. | ||
.filter( | ||
sqlalchemy.func.knn_match( | ||
self.EmbeddingStore.embedding, embedding, k | ||
) | ||
) | ||
.order_by(sqlalchemy.desc("_score")) | ||
.join( | ||
self.CollectionStore, | ||
self.EmbeddingStore.collection_id == self.CollectionStore.uuid, | ||
) | ||
.limit(k) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this spot hasn't been marked with a huge FIXME admonition, but I guess it is just NOT OK to use _score
here, and this has just been applied in the interim, to get at least something out of it? Isn't this exactly the spot where that feature request was coming from, to be able to use the actual vector distance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can vector_similarity
be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dearly hope so!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the best case, just using a corresponding SQLAlchemy incantation (correctly) will work without much ado.
import sqlalchemy as sa
sa.func.vector_similarity(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapping in.
sqlalchemy.func.vector_similarity(
self.EmbeddingStore.embedding, embedding
).label("_score"),
Not there yet, the original query is hitting an edge case.
However, when adjusting the query slightly to work around that edge case, it seems to start working well in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed per 476d718, effectively making this patch functionally complete/sound, with improvements pending for a later iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but I don't understand the problem, what is it that we want? vector distance? if so, we do no directly have it, our vector_similarity
gives us similarity based on euclidian distance, similarity is calculated like sim = 1 / euc_distance
If what we want is similarity, normalized between (0, 1]
then yes, just returning our vector_similarity
should be good enough, just bare in mind that per our docs, 0
is not included, not sure if this is in any way relevant for langchain,(maybe 0 can be used to flag total dissimilarity?, don't know)
46750b6
to
ffda5c8
Compare
Before, the adapter used CrateDB's built-in `_score` field for ranking. Now, it uses the dedicated `vector_similarity()` function to compute the similarity between two vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
About
Status
Sandbox
A little walkthrough how to exercise the software tests on your workstation.