Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(FastEmbed): Support for SPLADE Sparse Embedder #579

Merged
merged 37 commits into from
Apr 10, 2024

Conversation

lambda-science
Copy link
Contributor

@lambda-science lambda-science commented Mar 13, 2024

Related to this PR: #578 and this issue: #549

Today FastEmbed released support for Sparse Embedder that are usefull for hybrid search (support by some DocumentStore such as Qdrant and Pinecone).
The goal of this PR is to bring support for FastEmbed Sparse encoder into Haystack. Even if Haystack currently don't support Sparse Embedding in their Document structure/dataclass. EDIT: Haystack support Sparse Embedding in 2.1.0

@lambda-science lambda-science requested a review from a team as a code owner March 13, 2024 15:52
@lambda-science lambda-science requested review from anakin87 and removed request for a team March 13, 2024 15:52
@github-actions github-actions bot added type:documentation Improvements or additions to documentation integration:fastembed and removed type:documentation Improvements or additions to documentation labels Mar 13, 2024
@lambda-science lambda-science marked this pull request as draft March 13, 2024 15:52
@github-actions github-actions bot added the type:documentation Improvements or additions to documentation label Mar 13, 2024
@lambda-science
Copy link
Contributor Author

lambda-science commented Mar 13, 2024

Fixed one test but two other are failing in the test_run with a weird error:

>       result = embedder.run(documents=[doc])

tests/test_fastembed_document_SPLADE_embedder.py:279:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/haystack_integrations/components/embedders/fastembed/fastembed_document_SPLADE_embedder.py:165: in run
    embeddings = self.embedding_backend.embed(
src/haystack_integrations/components/embedders/fastembed/embedding_backend/fastembed_backend.py:86: in embed
    sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/sparse/sparse_text_embedding.py:82: in embed
    yield from self.model.embed(documents, batch_size, parallel, **kwargs)
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/sparse/splade_pp.py:101: in embed
    yield from self._embed_documents(
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/common/onnx_model.py:93: in _embed_documents
    yield from self._post_process_onnx_output(self.onnx_embed(batch))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'fastembed.sparse.splade_pp.SpladePP'>
output = (array([[[ -3.0453873,  -2.9302917,  -3.0086668, ...,  -2.6947184,
          -3.2767653,  -3.9241323],
        [-10.48...42873, ...,  -4.9139   ,
          -5.25412  ,  -5.801072 ]]], dtype=float32), array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))

    @classmethod
    def _post_process_onnx_output(cls, output: Tuple[np.ndarray, np.ndarray]) -> Iterable[SparseEmbedding]:
        logits, attention_mask = output
        relu_log = np.log(1 + np.maximum(logits, 0))

        weighted_log = relu_log * np.expand_dims(attention_mask, axis=-1)

        max_val = np.max(weighted_log, axis=1)

        # Score matrix of shape (batch_size, vocab_size)
        # Most of the values are 0, only a few are non-zero
        scores = np.squeeze(max_val)
        for row_scores in scores:
            indices = row_scores.nonzero()[0]
>           scores = row_scores[indices]
E           IndexError: invalid index to scalar variable.

FAILED tests/test_fastembed_document_SPLADE_embedder.py::TestFastembedDocumentSPLADEEmbedderDoc::test_run - IndexError: invalid index to scalar variable.
FAILED tests/test_fastembed_text_SPLADE_embedder.py::TestFastembedTextSPLADEEmbedder::test_run - IndexError: invalid index to scalar variable.

Should investigate tomorrow. This is failing:

    def embed(self, data: List[List[str]], **kwargs) -> List[Dict[str, np.ndarray]]:
        # The embed method returns a Iterable[SparseEmbedding], so we convert it to a list of dictionaries
        sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
        return sparse_embeddings

@lambda-science
Copy link
Contributor Author

Ok well I guess I fixed all basic tests with the lastest commits. The rest are the two real test of component.run() failing with error from _post_process_onnx_output listed above. I have no idea currently why. I will check tomorrow @Anush008 🌞

@Anush008
Copy link
Contributor

Anush008 commented Mar 13, 2024

@lambda-science, we ran into a bug I think. Will update you on this.

@Anush008
Copy link
Contributor

FastEmbed v0.2.4 fixed the issue.

@lambda-science lambda-science marked this pull request as ready for review March 13, 2024 20:39
@lambda-science
Copy link
Contributor Author

All tests are now working thanks ! Doc have been made also.
Ready for final review I guess @Anush008 ! :)

For @anakin87 here is an example or a first working sparse embedder. Two things to note:

  • Compared to dense embedding (list of floats) the type of sparse embedding here is a dict like {"indices": List[int], "values": List[float]}
  • For text embedding it works the same but for document embedding, the sparse embedding result are not placed in the embedding field of the Document dataclass but inside the meta key in as doc["meta"]["_sparse_vector"] :)

Maybe some optimization can be done. The first thing would be deciding where to put these damn sparse embedding in docs ahah. For a first complete round trip, these sparse embedder can be installable with:

fastembed-haystack @ git+https://github.com/lambda-science/haystack-core-integrations/@69129c8cac1771814ce167c76a43348600b1d27e#subdirectory=integrations/fastembed

in a requirements.txt 👍

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lambda-science you did a great work!

I only found a few opportunities to improve this PR.

Corentin and others added 6 commits March 21, 2024 13:27
…bedders/fastembed/fastembed_sparse_document_embedder.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_document_embedder.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_document_embedder.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_document_embedder.py

Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_text_embedder.py

Co-authored-by: Stefano Fiorucci <[email protected]>
@lambda-science
Copy link
Contributor Author

lambda-science commented Mar 21, 2024

@anakin87 not sure why you changed the model name, changing "prithvida/SPLADE_PP_en_v1" to "prithvida/Splade_PP_en_v1" make the test crash on my side with:

FAILED tests/test_fastembed_sparse_document_embedder.py::TestFastembedSparseDocumentEmbedderDoc::test_run - ValueError: Model prithvida/Splade_PP_en_v1 is not supported in SparseTextEmbedding.Please check the supported models using `SparseTextEmbedding.list_supported_models()`
FAILED tests/test_fastembed_sparse_text_embedder.py::TestFastembedSparseTextEmbedder::test_run - ValueError: Model prithvida/Splade_PP_en_v1 is not supported in SparseTextEmbedding.Please check the supported models using `SparseTextEmbedding.list_supported_models()`        

EDIT: nevermind it's fastembed 0.2.5 released yesterday !

@Anush008
Copy link
Contributor

Also, with FastEmbed v0.2.5, model names are now case-insensitive.

@lambda-science
Copy link
Contributor Author

PR Ready I guess, last thing to do is removing building from source of Haystack
And waiting for haystack 2.1.0

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two last comments, but looks really good!

Corentin and others added 2 commits March 22, 2024 13:51
…bedders/fastembed/fastembed_sparse_text_embedder.py

Co-authored-by: Stefano Fiorucci <[email protected]>
@lambda-science
Copy link
Contributor Author

Resolved your comment @anakin87 and fixed test, ready to ship I guess ahah

Copy link
Contributor

@Anush008 Anush008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lambda-science and @anakin87.

@anakin87 anakin87 self-requested a review April 10, 2024 09:34
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging!

@anakin87 anakin87 merged commit 363c7b5 into deepset-ai:main Apr 10, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:fastembed topic:CI type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants