[PGVector] The scores returned by 'similarity_search_with_score' are NOT proportional to the similarity #13437

huantt · 2023-11-16T03:10:54Z

dosubot · 2023-11-16T03:11:20Z

🤖

I'm Dosu, a bot designed to assist in this repository. I can provide initial responses to your issues, answer some questions, and help guide you to become a more effective contributor. While you're waiting for a human maintainer to assist, I'm here to help.

I'm currently reviewing the issue you've posted about the scores returned by 'similarity_search_with_score' in the PGVector component. I'll be back with a more detailed response shortly.

huantt · 2023-11-16T03:11:45Z

Related issue:
#2845

dosubot · 2023-11-16T03:21:20Z

🤖

I understand the issue you're facing with the 'similarity_search_with_score' function in the PGVector component of the LangChain Python framework. The scores returned by this function are proportional to the distance, not the similarity, which is causing the most similar documents to be filtered out when using the as_retriever function with the score_threshold.

This issue seems to be similar to a few previously solved issues in the LangChain repository:

get_relevant_documents of Chroma retriever uses cosine distance instead of cosine similarity as similarity score
ClickHouse VectorStore score_threshold not working.
When search_type="similarity_score_threshold, retriever returns negative scores

The solutions suggested in these issues involve changing the distance metric when creating a collection in Chroma, submitting a pull request with proposed changes to the ClickHouse VectorStore's score_threshold parameter in the similarity_search_with_relevance_scores function, and setting collection_metadata={"hnsw:space": "cosine"} when instantiating langchain.vectorstores.Chroma.

In your case, you can modify the 'similarity_search_with_score' function to return similarity scores instead of distance scores. One common way to convert distance to similarity is to use the formula similarity = 1 / (1 + distance). This will ensure that the most similar documents (i.e., those with the smallest distances) have the highest similarity scores.

Here's how you can modify the 'similarity_search_with_score' function:

def similarity_search_with_score_by_vector(
    self,
    embedding: List[float],
    k: int = 4,
    filter: Optional[dict] = None,
) -> List[Tuple[Document, float]]:
    results = self.__query_collection(embedding=embedding, k=k, filter=filter)
    # Convert distances to similarities
    docs_and_scores = [(doc, 1 / (1 + distance)) for doc, distance in self._results_to_docs_and_scores(results)]
    return docs_and_scores

Please note that this solution assumes that the '_results_to_docs_and_scores' function returns a list of tuples, where each tuple contains a Document and its corresponding distance to the query. If this is not the case, you may need to adjust the code accordingly.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

huantt · 2023-11-16T04:36:55Z

Oh, I think that I've understood the folow of PGVector:

Before describe I want to clarify some concepts in Document:

Distance: higher distance, more least similar
Score: higher score, more similar

How PGVector in Langchain works

In the Database layer: They does not return the score but the distance instead
In the langchain library layer: They convert the distance to the similarity score by distance_strategy. When declare PGVector instance, by default, distance_strategy is DistanceStrategy.COSINE, let's see this function:

    @staticmethod
    def _cosine_relevance_score_fn(distance: float) -> float:
        """Normalize the distance to a score on a scale [0, 1]."""

        return 1.0 - distance

It convert the distance to the score.

So what's wrong?

The one that makes me confuse and should be improved is the score concept in pgvector.py.
In some functions like similarity_search_with_score, the returned value is distance, not score.

It should be renamed to similarity_search_with_distinace

dosubot · 2024-02-15T16:01:46Z

Hi, @huantt,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the issue you opened regarding the incorrect behavior of the similarity_search_with_score function in the PGVector component has been resolved. The function has been modified to return similarity scores instead of distance scores and has been renamed to similarity_search_with_distance to better reflect its behavior.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 16, 2023

dangoldbj mentioned this issue Nov 17, 2023

Vectorstore relevance fn fix #13515

Closed

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 15, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 22, 2024

Jayaprakash8887 mentioned this issue Jun 27, 2024

[Feature Request]: Document Retrieval control OpenNyAI/Jugalbandi-Manager#115

Open

AnthonyDasse mentioned this issue Jul 30, 2024

bug: SimilaritySearch with scoreThrehold not works with PgVector. tmc/langchaingo#974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PGVector] The scores returned by 'similarity_search_with_score' are NOT proportional to the similarity #13437

[PGVector] The scores returned by 'similarity_search_with_score' are NOT proportional to the similarity #13437

huantt commented Nov 16, 2023

dosubot bot commented Nov 16, 2023

huantt commented Nov 16, 2023

dosubot bot commented Nov 16, 2023

huantt commented Nov 16, 2023

dosubot bot commented Feb 15, 2024

[PGVector] The scores returned by 'similarity_search_with_score' are NOT proportional to the similarity #13437

[PGVector] The scores returned by 'similarity_search_with_score' are NOT proportional to the similarity #13437

Comments

huantt commented Nov 16, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Behavior

Problem

Expected behavior

dosubot bot commented Nov 16, 2023

huantt commented Nov 16, 2023

dosubot bot commented Nov 16, 2023

Sources

huantt commented Nov 16, 2023

How PGVector in Langchain works

So what's wrong?

dosubot bot commented Feb 15, 2024