feat(FastEmbed): Support for SPLADE Sparse Embedder #579

lambda-science · 2024-03-13T15:52:26Z

Related to this PR: #578 and this issue: #549

Today FastEmbed released support for Sparse Embedder that are usefull for hybrid search (support by some DocumentStore such as Qdrant and Pinecone).
The goal of this PR is to bring support for FastEmbed Sparse encoder into Haystack. Even if Haystack currently don't support Sparse Embedding in their Document structure/dataclass. EDIT: Haystack support Sparse Embedding in 2.1.0

This reverts commit afc8e79.

lambda-science · 2024-03-13T16:10:53Z

Fixed one test but two other are failing in the test_run with a weird error:

>       result = embedder.run(documents=[doc])

tests/test_fastembed_document_SPLADE_embedder.py:279:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/haystack_integrations/components/embedders/fastembed/fastembed_document_SPLADE_embedder.py:165: in run
    embeddings = self.embedding_backend.embed(
src/haystack_integrations/components/embedders/fastembed/embedding_backend/fastembed_backend.py:86: in embed
    sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/sparse/sparse_text_embedding.py:82: in embed
    yield from self.model.embed(documents, batch_size, parallel, **kwargs)
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/sparse/splade_pp.py:101: in embed
    yield from self._embed_documents(
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/common/onnx_model.py:93: in _embed_documents
    yield from self._post_process_onnx_output(self.onnx_embed(batch))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'fastembed.sparse.splade_pp.SpladePP'>
output = (array([[[ -3.0453873,  -2.9302917,  -3.0086668, ...,  -2.6947184,
          -3.2767653,  -3.9241323],
        [-10.48...42873, ...,  -4.9139   ,
          -5.25412  ,  -5.801072 ]]], dtype=float32), array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))

    @classmethod
    def _post_process_onnx_output(cls, output: Tuple[np.ndarray, np.ndarray]) -> Iterable[SparseEmbedding]:
        logits, attention_mask = output
        relu_log = np.log(1 + np.maximum(logits, 0))

        weighted_log = relu_log * np.expand_dims(attention_mask, axis=-1)

        max_val = np.max(weighted_log, axis=1)

        # Score matrix of shape (batch_size, vocab_size)
        # Most of the values are 0, only a few are non-zero
        scores = np.squeeze(max_val)
        for row_scores in scores:
            indices = row_scores.nonzero()[0]
>           scores = row_scores[indices]
E           IndexError: invalid index to scalar variable.

FAILED tests/test_fastembed_document_SPLADE_embedder.py::TestFastembedDocumentSPLADEEmbedderDoc::test_run - IndexError: invalid index to scalar variable.
FAILED tests/test_fastembed_text_SPLADE_embedder.py::TestFastembedTextSPLADEEmbedder::test_run - IndexError: invalid index to scalar variable.

Should investigate tomorrow. This is failing:

    def embed(self, data: List[List[str]], **kwargs) -> List[Dict[str, np.ndarray]]:
        # The embed method returns a Iterable[SparseEmbedding], so we convert it to a list of dictionaries
        sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
        return sparse_embeddings

lambda-science · 2024-03-13T16:43:13Z

Ok well I guess I fixed all basic tests with the lastest commits. The rest are the two real test of component.run() failing with error from _post_process_onnx_output listed above. I have no idea currently why. I will check tomorrow @Anush008 🌞

Anush008 · 2024-03-13T17:16:55Z

@lambda-science, we ran into a bug I think. Will update you on this.

Anush008 · 2024-03-13T18:28:38Z

FastEmbed v0.2.4 fixed the issue.

lambda-science · 2024-03-13T20:48:57Z

All tests are now working thanks ! Doc have been made also.
Ready for final review I guess @Anush008 ! :)

For @anakin87 here is an example or a first working sparse embedder. Two things to note:

Compared to dense embedding (list of floats) the type of sparse embedding here is a dict like {"indices": List[int], "values": List[float]}
For text embedding it works the same but for document embedding, the sparse embedding result are not placed in the embedding field of the Document dataclass but inside the meta key in as doc["meta"]["_sparse_vector"] :)

Maybe some optimization can be done. The first thing would be deciding where to put these damn sparse embedding in docs ahah. For a first complete round trip, these sparse embedder can be installable with:

fastembed-haystack @ git+https://github.com/lambda-science/haystack-core-integrations/@69129c8cac1771814ce167c76a43348600b1d27e#subdirectory=integrations/fastembed

in a requirements.txt 👍

...c/haystack_integrations/components/embedders/fastembed/fastembed_document_SPLADE_embedder.py

anakin87

@lambda-science you did a great work!

I only found a few opportunities to improve this PR.

.../haystack_integrations/components/embedders/fastembed/embedding_backend/fastembed_backend.py

...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py

...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py

…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

…bedders/fastembed/fastembed_sparse_text_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

lambda-science · 2024-03-21T12:47:15Z

@anakin87 not sure why you changed the model name, changing "prithvida/SPLADE_PP_en_v1" to "prithvida/Splade_PP_en_v1" make the test crash on my side with:

FAILED tests/test_fastembed_sparse_document_embedder.py::TestFastembedSparseDocumentEmbedderDoc::test_run - ValueError: Model prithvida/Splade_PP_en_v1 is not supported in SparseTextEmbedding.Please check the supported models using `SparseTextEmbedding.list_supported_models()`
FAILED tests/test_fastembed_sparse_text_embedder.py::TestFastembedSparseTextEmbedder::test_run - ValueError: Model prithvida/Splade_PP_en_v1 is not supported in SparseTextEmbedding.Please check the supported models using `SparseTextEmbedding.list_supported_models()`

EDIT: nevermind it's fastembed 0.2.5 released yesterday !

Anush008 · 2024-03-21T13:04:06Z

Also, with FastEmbed v0.2.5, model names are now case-insensitive.

…g class

lambda-science · 2024-03-22T10:54:11Z

PR Ready I guess, last thing to do is removing building from source of Haystack
And waiting for haystack 2.1.0

anakin87

two last comments, but looks really good!

...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py

integrations/fastembed/tests/test_fastembed_sparse_text_embedder.py

…bedders/fastembed/fastembed_sparse_text_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

lambda-science · 2024-03-22T12:54:33Z

Resolved your comment @anakin87 and fixed test, ready to ship I guess ahah

Anush008

Thank you @lambda-science and @anakin87.

anakin87

Merging!

lambda-science and others added 5 commits March 6, 2024 13:37

fix(opensearch): bulk error without create key

afc8e79

Merge branch 'deepset-ai:main' into main

9800f2b

Merge branch 'deepset-ai:main' into main

bf0221c

feat(FastEmbed): Scaffold for SPLADE Sparse Embedding Support

aa95f13

Revert "fix(opensearch): bulk error without create key"

4b1d8f9

This reverts commit afc8e79.

lambda-science requested a review from a team as a code owner March 13, 2024 15:52

lambda-science requested review from anakin87 and removed request for a team March 13, 2024 15:52

github-actions bot added type:documentation Improvements or additions to documentation integration:fastembed and removed type:documentation Improvements or additions to documentation labels Mar 13, 2024

lambda-science marked this pull request as draft March 13, 2024 15:52

feat(FastEmbed): __all__ fix

62d8478

github-actions bot added the type:documentation Improvements or additions to documentation label Mar 13, 2024

feat(FastEmbed): fix one test

0e0968a

feat(FastEmbed): fix one test

1feea08

anakin87 mentioned this pull request Mar 13, 2024

Support Sparse Embedding Retrieval deepset-ai/haystack#7355

Closed

feat(FastEmbed): fix a second test

e1c5602

feat(FastEmbed): removed old TODO (fixed)

a9b3827

feat(FastEmbed): fixing all test + doc

69129c8

lambda-science marked this pull request as ready for review March 13, 2024 20:39

lambda-science mentioned this pull request Mar 13, 2024

Qdrant: Support Sparse Vectors #549

Closed

Corentin added 2 commits March 13, 2024 23:38

fix output typing

10ea129

Fix output component

8e20cee

Anush008 reviewed Mar 14, 2024

View reviewed changes

...c/haystack_integrations/components/embedders/fastembed/fastembed_document_SPLADE_embedder.py Outdated Show resolved Hide resolved

update model name

0050a6b

anakin87 reviewed Mar 21, 2024

View reviewed changes

Corentin and others added 6 commits March 21, 2024 13:27

Update integrations/fastembed/src/haystack_integrations/components/em…

5ea12b5

…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

Update integrations/fastembed/src/haystack_integrations/components/em…

14a8c2d

…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

Update integrations/fastembed/src/haystack_integrations/components/em…

709ac12

…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

Update integrations/fastembed/src/haystack_integrations/components/em…

11f8584

…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

Update integrations/fastembed/src/haystack_integrations/components/em…

727b5ab

…bedders/fastembed/fastembed_sparse_text_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

feat(FastEmbed): remove prefix/suffix

40cb5b6

feat(FastEmbed): fix linting

e7e1666

lambda-science added 6 commits March 21, 2024 16:03

feat(FastEmbed): suggestion for progress bar

89f857d

Merge branch 'main' into fastembed-sparse

c956ee2

feat(FastEmbed): return Haystack's SparseEmbedding instead of Dict

66bc952

feat(FastEmbed): fix lint

97dd121

feat(Fastembed): run output type from dict to haystack sparseembeddin…

bc3f555

…g class

feat(FastEmbed): reduce default sparse batch size

9261122

anakin87 reviewed Mar 22, 2024

View reviewed changes

...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py Outdated Show resolved Hide resolved

integrations/fastembed/tests/test_fastembed_sparse_text_embedder.py Outdated Show resolved Hide resolved

Corentin and others added 2 commits March 22, 2024 13:51

Update integrations/fastembed/src/haystack_integrations/components/em…

a697433

…bedders/fastembed/fastembed_sparse_text_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>

feat(FastEmbed): fix test

a16fc9d

Anush008 approved these changes Mar 22, 2024

View reviewed changes

anakin87 added 3 commits April 10, 2024 10:58

Merge branch 'main' into fastembed-sparse

d064cc5

updates after 2.0.1 release

a97c4ed

small fixes; naive example

1a8c707

anakin87 self-requested a review April 10, 2024 09:34

anakin87 approved these changes Apr 10, 2024

View reviewed changes

anakin87 merged commit 363c7b5 into deepset-ai:main Apr 10, 2024
7 checks passed

anakin87 mentioned this pull request Apr 12, 2024

Sparse Embeddings support: create docs/material #660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(FastEmbed): Support for SPLADE Sparse Embedder #579

feat(FastEmbed): Support for SPLADE Sparse Embedder #579

lambda-science commented Mar 13, 2024 •

edited

Loading

lambda-science commented Mar 13, 2024 •

edited

Loading

lambda-science commented Mar 13, 2024

Anush008 commented Mar 13, 2024 •

edited

Loading

Anush008 commented Mar 13, 2024

lambda-science commented Mar 13, 2024

anakin87 left a comment

lambda-science commented Mar 21, 2024 •

edited

Loading

Anush008 commented Mar 21, 2024

lambda-science commented Mar 22, 2024

anakin87 left a comment

lambda-science commented Mar 22, 2024

Anush008 left a comment

anakin87 left a comment

feat(FastEmbed): Support for SPLADE Sparse Embedder #579

feat(FastEmbed): Support for SPLADE Sparse Embedder #579

Conversation

lambda-science commented Mar 13, 2024 • edited Loading

lambda-science commented Mar 13, 2024 • edited Loading

lambda-science commented Mar 13, 2024

Anush008 commented Mar 13, 2024 • edited Loading

Anush008 commented Mar 13, 2024

lambda-science commented Mar 13, 2024

anakin87 left a comment

Choose a reason for hiding this comment

lambda-science commented Mar 21, 2024 • edited Loading

Anush008 commented Mar 21, 2024

lambda-science commented Mar 22, 2024

anakin87 left a comment

Choose a reason for hiding this comment

lambda-science commented Mar 22, 2024

Anush008 left a comment

Choose a reason for hiding this comment

anakin87 left a comment

Choose a reason for hiding this comment

lambda-science commented Mar 13, 2024 •

edited

Loading

lambda-science commented Mar 13, 2024 •

edited

Loading

Anush008 commented Mar 13, 2024 •

edited

Loading

lambda-science commented Mar 21, 2024 •

edited

Loading