-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PGVector] The scores returned by 'similarity_search_with_score' are NOT proportional to the similarity #13437
Comments
🤖 Hello @huantt, I'm Dosu, a bot designed to assist in this repository. I can provide initial responses to your issues, answer some questions, and help guide you to become a more effective contributor. While you're waiting for a human maintainer to assist, I'm here to help. I'm currently reviewing the issue you've posted about the scores returned by 'similarity_search_with_score' in the PGVector component. I'll be back with a more detailed response shortly. |
Related issue: |
🤖 I understand the issue you're facing with the 'similarity_search_with_score' function in the PGVector component of the LangChain Python framework. The scores returned by this function are proportional to the distance, not the similarity, which is causing the most similar documents to be filtered out when using the This issue seems to be similar to a few previously solved issues in the LangChain repository:
The solutions suggested in these issues involve changing the distance metric when creating a collection in Chroma, submitting a pull request with proposed changes to the ClickHouse VectorStore's In your case, you can modify the 'similarity_search_with_score' function to return similarity scores instead of distance scores. One common way to convert distance to similarity is to use the formula Here's how you can modify the 'similarity_search_with_score' function: def similarity_search_with_score_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[dict] = None,
) -> List[Tuple[Document, float]]:
results = self.__query_collection(embedding=embedding, k=k, filter=filter)
# Convert distances to similarities
docs_and_scores = [(doc, 1 / (1 + distance)) for doc, distance in self._results_to_docs_and_scores(results)]
return docs_and_scores Please note that this solution assumes that the '_results_to_docs_and_scores' function returns a list of tuples, where each tuple contains a Document and its corresponding distance to the query. If this is not the case, you may need to adjust the code accordingly. I hope this helps! If you have any further questions or need more clarification, feel free to ask. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Oh, I think that I've understood the folow of PGVector: Before describe I want to clarify some concepts in Document:
How PGVector in Langchain works
@staticmethod
def _cosine_relevance_score_fn(distance: float) -> float:
"""Normalize the distance to a score on a scale [0, 1]."""
return 1.0 - distance It convert the distance to the score. So what's wrong?The one that makes me confuse and should be improved is the It should be renamed to |
Hi, @huantt, I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the issue you opened regarding the incorrect behavior of the Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you! |
System Info
langchain==0.0.266
Who can help?
@hwchase17
@eyurtsev
Information
Related Components
Reproduction
Behavior
When I call the
vector_store.similarity_search_with_score
function:Problem
as_retriever
function with thescore_threshold
, the behavior is wrong. Because whenscore_threshold
is declared, it will filter documents that have score greater than or equal toscore_threshold
value. So the top documents that found from pgvector will be filter out while it's the most similar in fact.Expected behavior
The returned scores from PGVector queries are proportional to the similarity.
In other words, the higher score, the higher similarity.
The text was updated successfully, but these errors were encountered: