Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add SimilarityRanker to Haystack 2.0 #5923

Merged
merged 6 commits into from
Oct 6, 2023
Merged

Conversation

vblagoje
Copy link
Member

Why:

Ranking by similarity to a query is essential. Enter SimilarityRanker.

What:

Added SimilarityRanker to the rankers package. It's all about evaluating document relevance based on similarity.

How can it be used:

Here's a quick dive into its usage:

from haystack.preview import Document
from haystack.preview.components.rankers import SimilarityRanker

ranker = SimilarityRanker()
docs = [Document(text="Sarajevo"), Document(text="Berlin")]
query = "City in Bosnia and Herzegovina"
output = ranker.run(query=query, documents=docs)
docs = output["documents"]

assert len(docs) == 2
assert docs[0].text == "Sarajevo"

How did you test it:

Unit tests are in place. Also, ran the above snippet. Worked as expected.

Notes For Reviewer:

Reviewer, please do a deep dive into the similarity metrics and integration. Let's ensure this fits seamlessly into Haystack's architecture.

@vblagoje vblagoje requested review from a team as code owners September 29, 2023 13:04
@vblagoje vblagoje requested review from dfokina and silvanocerza and removed request for a team September 29, 2023 13:04
@vblagoje vblagoje added the 2.x Related to Haystack v2.0 label Sep 29, 2023
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Sep 29, 2023
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments. nothing worrying 😊

:param device: torch device (for example, cuda:0, cpu, mps) to limit model inference to a specific device.
"""
torch_and_transformers_import.check()
super().__init__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call super()

test/preview/components/rankers/test_similarity.py Outdated Show resolved Hide resolved
test/preview/components/rankers/test_similarity.py Outdated Show resolved Hide resolved
@vblagoje
Copy link
Member Author

cc @sjrl


def __init__(
self,
model_name_or_path: Union[str, Path] = "cross-encoder/ms-marco-MiniLM-L-6-v2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just go with model to be fair, type hints and documentation can do the rest.

@sjrl
Copy link
Contributor

sjrl commented Oct 4, 2023

Thanks for the work on this! I have a few more comments about potential additional features.

  1. The addition of embed_meta_fields as shown here from V1
    docs_with_meta_fields = self._add_meta_fields_to_docs(
    documents=documents, embed_meta_fields=self.embed_meta_fields
    )

    Sol has found this to be immensely useful just as we have for EmbeddingRetrievers in haystack V1. I think it would be a great addition to add here as well. We actively use this feature in client projects. And you can see how this is implemented for Embedding models in V2 here
    metadata_fields_to_embed: Optional[List[str]] = None,
  2. Being able to optionally specify a top_k here like we do in V1 would also be helpful I think since we will often pipe the output of this node directly into PromptNode. And I think optionally would be good in case we use nodes like the TopPSampler instead.

@vblagoje
Copy link
Member Author

vblagoje commented Oct 4, 2023

@sjrl let's keep track of these enhancement suggestions (actually existing V1 features :-) ), but for now, I propose we integrate the simplest implementation we currently have so we can experiment and build demos.

@vblagoje
Copy link
Member Author

vblagoje commented Oct 4, 2023

PR is awaiting #5945 more specifically, Document unfreezing. Current test failures are expected.

@vblagoje vblagoje force-pushed the similarity_ranker_new branch from 6a410f9 to f278786 Compare October 5, 2023 15:57
@vblagoje
Copy link
Member Author

vblagoje commented Oct 5, 2023

@sjrl and @silvanocerza would you give it one last pass please?

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Only one minor comment about additional test.

@vblagoje
Copy link
Member Author

vblagoje commented Oct 6, 2023

Should be gtg now, please have another look @sjrl

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad that helped catch a bug!

@vblagoje vblagoje merged commit 1cdff64 into main Oct 6, 2023
20 checks passed
@vblagoje vblagoje deleted the similarity_ranker_new branch October 6, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants