Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qdrant: Support Sparse Vectors #549

Closed
lambda-science opened this issue Mar 6, 2024 · 4 comments
Closed

Qdrant: Support Sparse Vectors #549

lambda-science opened this issue Mar 6, 2024 · 4 comments
Assignees
Labels
feature request Ideas to improve an integration integration:qdrant

Comments

@lambda-science
Copy link
Contributor

lambda-science commented Mar 6, 2024

Is your feature request related to a problem? Please describe.
Qdrant v1.7.0 introduced sparse verctors (with SPLADE) and hybrid retrieval.
Could be cool to implement. https://qdrant.tech/articles/sparse-vectors/

Describe the solution you'd like
Allow to create collection with optional sparse vector and add a retrieve for hybrid search (and a SPLADE only ?)
Current:

    def _recreate_collection(self, collection_name: str, distance, embedding_dim: int):
        self.client.recreate_collection(
            collection_name=collection_name,
            vectors_config=rest.VectorParams(
                size=embedding_dim,
                distance=distance,
            ),
            shard_number=self.shard_number,
            replication_factor=self.replication_factor,
            write_consistency_factor=self.write_consistency_factor,
            on_disk_payload=self.on_disk_payload,
            hnsw_config=self.hnsw_config,
            optimizers_config=self.optimizers_config,
            wal_config=self.wal_config,
            quantization_config=self.quantization_config,
            init_from=self.init_from,
        )

could become as in the example article above:

    def _recreate_collection(self, collection_name: str, distance, embedding_dim: int):
        self.client.recreate_collection(
            collection_name=collection_name,
            vectors_config={
        "text-dense": rest.VectorParams(
            size=embedding_dim,
            distance=distance,
        )
    },
    sparse_vectors_config={
        "text-sparse": rest.SparseVectorParams(
            index=models.SparseIndexParams(
                on_disk=False,
            )
        )
    },
            shard_number=self.shard_number,
            replication_factor=self.replication_factor,
            write_consistency_factor=self.write_consistency_factor,
            on_disk_payload=self.on_disk_payload,
            hnsw_config=self.hnsw_config,
            optimizers_config=self.optimizers_config,
            wal_config=self.wal_config,
            quantization_config=self.quantization_config,
            init_from=self.init_from,
        )

However this requiere a number of component:

  • a SPLADE query embedder to embed the user question at query time => ISSUES if integrated inside this integration package at it probably requiere some big machine-learning libs ?
  • a SPLADE encoder to write document sparse vectors during indexation => ISSUES if integrated inside this integration package at it probably requiere some big machine-learning libs ?
  • a new hybrid retriever that can do a hybrid search such as:
client.search_batch(
    collection_name=collection_name,
    requests=[
        rest.SearchRequest(
            vector=rest.NamedVector(
                name="text-dense",
                vector=query_embedding,
            ),
            limit=top_k,
        ),
        rest.SearchRequest(
            vector=rest.NamedSparseVector(
                name="text-sparse",
                vector=rest.SparseVector(
                    indices=query_indices,
                    values=query_values,
                ),
            ),
            limit=top_k,
        ),
    ],
)

query_embedding results of classic query embedder
query_indices and query_values results of new SPLADE encoder

EDIT: Also Qdrant 1.8 is out 👀 https://qdrant.tech/articles/qdrant-1.8.x/ But I don't think it breaks anything with current implementation :)

@lambda-science lambda-science added the feature request Ideas to improve an integration label Mar 6, 2024
@lambda-science lambda-science changed the title Support Qdrant Sparse Vectors and Hybrid Retrival Qdrant: Support Sparse Vectors and Hybrid Retrival Mar 6, 2024
@lambda-science
Copy link
Contributor Author

@Anush008 maybe you could be interested by this.
I think I could suggest a PR in the upcoming days / week I haven't started looking into the implementation yet :)

@Anush008
Copy link
Contributor

We could have these if the sparse vector generation can be abstracted away by other Haystack embedding integrations.

Since Qdrant's implementation will have to stay agnostic to the vectors.

@lambda-science
Copy link
Contributor Author

lambda-science commented Mar 10, 2024

We could have these if the sparse vector generation can be abstracted away by other Haystack embedding integrations.

Since Qdrant's implementation will have to stay agnostic to the vectors.

Got it !

It would still need a small modification to be able to input sparse vectors to the run() and setup the collection with sparse vector and do sparse query
Like:

  • NEW query_by_sparse() in DocumentStore
  • Modify _recreate_collection() in DocumentStore to add Sparse Vector in vectors_config
  • NEW QdrantSparseRetriever in Retrivers that just calls the self._document_store.query_by_sparse()
    But yeah maybe I could work on a general component (outside of Qdrant) that can perform sparse embedding and give it to Qdrant object :)

And Un-related to this implementation (Qdrant agnostic):

@anakin87 anakin87 changed the title Qdrant: Support Sparse Vectors and Hybrid Retrival Qdrant: Support Sparse Vectors Apr 12, 2024
@anakin87
Copy link
Member

Let's close this issue.

In case we are interested in introducing a hybrid Retriever in the future, I suggest we open another one and discuss it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Ideas to improve an integration integration:qdrant
Projects
None yet
Development

No branches or pull requests

4 participants