-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Qdrant support #646
Merged
Merged
feat: Qdrant support #646
Changes from 3 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
bc5ff2e
feat: Qdrant support
Anush008 090bfc9
Merge branch 'main' into qdrant-load
Anush008 b3e8ffa
chore: format fondant_component.yaml
Anush008 8a6786e
test: index_qdrant test
Anush008 b072580
docs: client param README.md
Anush008 7624525
chore: remove client param
Anush008 4096c5c
fix: attribute assignment
Anush008 38acad6
docs: README autogen
Anush008 c5f1c90
Merge branch 'main' into qdrant-load
Anush008 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
FROM --platform=linux/amd64 python:3.8-slim as base | ||
|
||
# System dependencies | ||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install git -y | ||
|
||
# Install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Install Fondant | ||
# This is split from other requirements to leverage caching | ||
ARG FONDANT_VERSION=main | ||
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION} | ||
|
||
# Set the working directory to the component folder | ||
WORKDIR /component/src | ||
|
||
# Copy over src-files | ||
COPY src/ . | ||
|
||
ENTRYPOINT ["fondant", "execute", "main"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Index Qdrant | ||
|
||
### Description | ||
A Fondant component to load textual data and embeddings into a [Qdrant](https://qdrant.tech/) database. | ||
|
||
### Inputs / outputs | ||
|
||
**This component consumes:** | ||
|
||
- text | ||
- data: string | ||
- embedding: list<item: float> | ||
|
||
**This component produces no data.** | ||
|
||
> [!IMPORTANT] | ||
> A Qdrant collection has to created in advance with appropriate vector configurations. Find out how to [here](https://qdrant.tech/documentation/concepts/collections/). | ||
|
||
### Arguments | ||
|
||
The component takes the following arguments to alter its behavior: | ||
|
||
| argument | type | description | default | | ||
| -------- | ---- | ----------- | ------- | | ||
| collection_name | str | The name of the Qdrant collection to upsert data into. | / | | ||
| location | str | If `:memory:` - use in-memory Qdrant instance else use it as a url parameter. | None | | ||
| batch_size | int | The batch size to use when uploading points to Qdrant. | 100 | | ||
| parallelism | int | The number of parallel workers to use when uploading points to Qdrant. | None | | ||
| url | str | Either host or str of 'Optional[scheme], host, Optional[port], Optional[prefix]'. Eg. `http://localhost:6333` | None | | ||
| port | int | Port of the REST API interface.| 6333 | | ||
| grpc_port | str | Port of the gRPC interface. | 6334 | | ||
| prefer_grpc | bool | If `true` - use gRPC interface whenever possible in custom methods. | False | | ||
| https | bool | If `true` - use HTTPS(SSL) protocol. | False | | ||
| api_key | str | API key for authentication in Qdrant Cloud. | None | | ||
| prefix | str | If set, add `prefix` to the REST URL path. Example: `service/v1` will result in `http://localhost:6333/service/v1/{qdrant-endpoint}` for REST API. | None | | ||
| timeout | int | Timeout for REST and gRPC API requests. | 5 for REST, Unlimited for GRPC | | ||
| host | str | Host name of Qdrant service. If url and host are not set, defaults to 'localhost'. | None | | ||
| path | str | Persistence path for QdrantLocal. Eg. `local_data/qdrant` | None | | ||
| force_disable_check_same_thread | bool | Force disable check_same_thread for QdrantLocal sqlite connection. | False | | ||
|
||
|
||
### Usage | ||
|
||
You can add this component to your pipeline using the following code: | ||
|
||
```python | ||
from fondant.pipeline import ComponentOp | ||
|
||
index_qdrant_op = ComponentOp.from_registry( | ||
name="index_qdrant", | ||
# Add arguments | ||
arguments={ | ||
"collection_name": "fondant_loaded_data", | ||
# "location": "http://localhost:6333", | ||
# "batch_size": 100, | ||
} | ||
) | ||
pipeline.add_op(index_qdrant_op, dependencies=[...]) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
name: Index Qdrant | ||
description: >- | ||
A Fondant component to load textual data and embeddings into a Qdrant | ||
database. | ||
image: 'fndnt/index_qdrant:dev' | ||
tags: | ||
- Data writing | ||
consumes: | ||
text: | ||
fields: | ||
data: | ||
type: string | ||
embedding: | ||
type: array | ||
items: | ||
type: float32 | ||
args: | ||
collection_name: | ||
description: The name of the Qdrant collection to upsert data into. | ||
type: str | ||
location: | ||
description: The location of the Qdrant instance. | ||
type: str | ||
default: None | ||
batch_size: | ||
description: The batch size to use when uploading points to Qdrant. | ||
type: int | ||
default: 64 | ||
parallelism: | ||
description: The number of parallel workers to use when uploading points to Qdrant. | ||
type: int | ||
default: 1 | ||
url: | ||
description: >- | ||
Either host or str of 'Optional[scheme], host, Optional[port], | ||
Optional[prefix]'. | ||
type: str | ||
default: None | ||
port: | ||
description: Port of the REST API interface. | ||
type: int | ||
default: 6333 | ||
grpc_port: | ||
description: Port of the gRPC interface. | ||
type: int | ||
default: 6334 | ||
prefer_grpc: | ||
description: If `true` - use gRPC interface whenever possible in custom methods. | ||
type: bool | ||
default: false | ||
https: | ||
description: If `true` - use HTTPS(SSL) protocol. | ||
type: bool | ||
default: false | ||
api_key: | ||
description: API key for authentication in Qdrant Cloud. | ||
type: str | ||
default: None | ||
prefix: | ||
description: 'If set, add `prefix` to the REST URL path.' | ||
type: str | ||
default: None | ||
timeout: | ||
description: Timeout for API requests. | ||
type: int | ||
default: None | ||
host: | ||
description: >- | ||
Host name of Qdrant service. If url and host are not set, defaults to | ||
'localhost'. | ||
type: str | ||
default: None | ||
path: | ||
description: Persistence path for QdrantLocal. Eg. `local_data/qdrant` | ||
type: str | ||
default: None | ||
force_disable_check_same_thread: | ||
description: Force disable check_same_thread for QdrantLocal sqlite connection. | ||
type: bool | ||
default: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
qdrant_client==1.6.9 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
from typing import List, Optional | ||
|
||
import dask.dataframe as dd | ||
from fondant.component import DaskWriteComponent | ||
from qdrant_client import QdrantClient, models | ||
from qdrant_client.qdrant_fastembed import uuid | ||
|
||
|
||
class IndexQdrantComponent(DaskWriteComponent): | ||
def __init__( | ||
self, | ||
*_, | ||
collection_name: str, | ||
location: Optional[str] = None, | ||
batch_size: int = 64, | ||
parallelism: int = 1, | ||
url: Optional[str] = None, | ||
port: Optional[int] = 6333, | ||
grpc_port: int = 6334, | ||
prefer_grpc: bool = False, | ||
https: Optional[bool] = None, | ||
api_key: Optional[str] = None, | ||
prefix: Optional[str] = None, | ||
timeout: Optional[float] = None, | ||
host: Optional[str] = None, | ||
path: Optional[str] = None, | ||
force_disable_check_same_thread: bool = False, | ||
): | ||
self.client = QdrantClient( | ||
location=location, | ||
url=url, | ||
port=port, | ||
grpc_port=grpc_port, | ||
prefer_grpc=prefer_grpc, | ||
https=https, | ||
api_key=api_key, | ||
prefix=prefix, | ||
timeout=timeout, | ||
host=host, | ||
path=path, | ||
force_disable_check_same_thread=force_disable_check_same_thread, | ||
) | ||
self.collection_name = collection_name | ||
self.batch_size = batch_size | ||
self.parallelism = parallelism | ||
|
||
def write(self, dataframe: dd.DataFrame) -> None: | ||
records: List[models.Record] = [] | ||
for part in dataframe.partitions: | ||
df = part.compute() | ||
for row in df.itertuples(): | ||
payload = { | ||
"id_": str(row.Index), | ||
"passage": row.text_data, | ||
} | ||
id = str(uuid.uuid4()) | ||
embedding = row.text_embedding | ||
records.append(models.Record(id=id, payload=payload, vector=embedding)) | ||
|
||
self.client.upload_records( | ||
collection_name=self.collection_name, | ||
records=records, | ||
batch_size=self.batch_size, | ||
parallel=self.parallelism, | ||
) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is autogenerated by our precommit hooks. You can add any custom information you want to add (like the "important" note) to the
description
field in thefondant_component.yaml
file. It supports markdown.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved now.
38acad6