-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(FastEmbed): Support for SPLADE Sparse Embedder #579
Conversation
This reverts commit afc8e79.
Fixed one test but two other are failing in the > result = embedder.run(documents=[doc])
tests/test_fastembed_document_SPLADE_embedder.py:279:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/haystack_integrations/components/embedders/fastembed/fastembed_document_SPLADE_embedder.py:165: in run
embeddings = self.embedding_backend.embed(
src/haystack_integrations/components/embedders/fastembed/embedding_backend/fastembed_backend.py:86: in embed
sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/sparse/sparse_text_embedding.py:82: in embed
yield from self.model.embed(documents, batch_size, parallel, **kwargs)
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/sparse/splade_pp.py:101: in embed
yield from self._embed_documents(
/home/cmeyer/.local/share/hatch/env/virtual/fastembed-haystack/HquJQBa6/fastembed-haystack/lib/python3.12/site-packages/fastembed/common/onnx_model.py:93: in _embed_documents
yield from self._post_process_onnx_output(self.onnx_embed(batch))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = <class 'fastembed.sparse.splade_pp.SpladePP'>
output = (array([[[ -3.0453873, -2.9302917, -3.0086668, ..., -2.6947184,
-3.2767653, -3.9241323],
[-10.48...42873, ..., -4.9139 ,
-5.25412 , -5.801072 ]]], dtype=float32), array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))
@classmethod
def _post_process_onnx_output(cls, output: Tuple[np.ndarray, np.ndarray]) -> Iterable[SparseEmbedding]:
logits, attention_mask = output
relu_log = np.log(1 + np.maximum(logits, 0))
weighted_log = relu_log * np.expand_dims(attention_mask, axis=-1)
max_val = np.max(weighted_log, axis=1)
# Score matrix of shape (batch_size, vocab_size)
# Most of the values are 0, only a few are non-zero
scores = np.squeeze(max_val)
for row_scores in scores:
indices = row_scores.nonzero()[0]
> scores = row_scores[indices]
E IndexError: invalid index to scalar variable.
FAILED tests/test_fastembed_document_SPLADE_embedder.py::TestFastembedDocumentSPLADEEmbedderDoc::test_run - IndexError: invalid index to scalar variable.
FAILED tests/test_fastembed_text_SPLADE_embedder.py::TestFastembedTextSPLADEEmbedder::test_run - IndexError: invalid index to scalar variable. Should investigate tomorrow. This is failing: def embed(self, data: List[List[str]], **kwargs) -> List[Dict[str, np.ndarray]]:
# The embed method returns a Iterable[SparseEmbedding], so we convert it to a list of dictionaries
sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
return sparse_embeddings |
Ok well I guess I fixed all basic tests with the lastest commits. The rest are the two real test of |
@lambda-science, we ran into a bug I think. Will update you on this. |
FastEmbed |
All tests are now working thanks ! Doc have been made also. For @anakin87 here is an example or a first working sparse embedder. Two things to note:
Maybe some optimization can be done. The first thing would be deciding where to put these damn sparse embedding in docs ahah. For a first complete round trip, these sparse embedder can be installable with: fastembed-haystack @ git+https://github.com/lambda-science/haystack-core-integrations/@69129c8cac1771814ce167c76a43348600b1d27e#subdirectory=integrations/fastembed in a |
...c/haystack_integrations/components/embedders/fastembed/fastembed_document_SPLADE_embedder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lambda-science you did a great work!
I only found a few opportunities to improve this PR.
.../haystack_integrations/components/embedders/fastembed/embedding_backend/fastembed_backend.py
Outdated
Show resolved
Hide resolved
...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
Outdated
Show resolved
Hide resolved
...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
Outdated
Show resolved
Hide resolved
...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
Outdated
Show resolved
Hide resolved
...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
Outdated
Show resolved
Hide resolved
...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
Outdated
Show resolved
Hide resolved
...c/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
Outdated
Show resolved
Hide resolved
...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py
Outdated
Show resolved
Hide resolved
...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py
Outdated
Show resolved
Hide resolved
...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py
Outdated
Show resolved
Hide resolved
…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_document_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>
…bedders/fastembed/fastembed_sparse_text_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>
@anakin87 not sure why you changed the model name, changing "prithvida/SPLADE_PP_en_v1" to "prithvida/Splade_PP_en_v1" make the test crash on my side with:
EDIT: nevermind it's fastembed 0.2.5 released yesterday ! |
Also, with FastEmbed |
PR Ready I guess, last thing to do is removing building from source of Haystack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two last comments, but looks really good!
...d/src/haystack_integrations/components/embedders/fastembed/fastembed_sparse_text_embedder.py
Outdated
Show resolved
Hide resolved
integrations/fastembed/tests/test_fastembed_sparse_text_embedder.py
Outdated
Show resolved
Hide resolved
…bedders/fastembed/fastembed_sparse_text_embedder.py Co-authored-by: Stefano Fiorucci <[email protected]>
Resolved your comment @anakin87 and fixed test, ready to ship I guess ahah |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lambda-science and @anakin87.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging!
Related to this PR: #578 and this issue: #549
Today FastEmbed released support for Sparse Embedder that are usefull for hybrid search (support by some DocumentStore such as Qdrant and Pinecone).
The goal of this PR is to bring support for FastEmbed Sparse encoder into Haystack. Even if Haystack currently don't support Sparse Embedding in their Document structure/dataclass. EDIT: Haystack support Sparse Embedding in 2.1.0