-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: remove vector points when unassociating datasource from collection #322
Conversation
backend/server/routers/collection.py
Outdated
VECTOR_STORE_CLIENT.delete_data_point_vectors(collection_name=request.collection_name, | ||
data_point_vectors=VECTOR_STORE_CLIENT.list_data_point_vectors( | ||
collection_name=request.collection_name, | ||
data_source_fqn=request.data_source_fqn | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change itself is fine I guess
Ideally,
- delete should be implemented directly instead of having to fetch the data points first. E.g. https://qdrant.tech/documentation/concepts/points/#delete-points
- Our vector store implementations have a hidden limit of 1e6 on points scroll. We should not have any limits, rather pagination with a batch size
cognita/backend/modules/vector_db/qdrant.py
Lines 235 to 237 in 783634c
if len(data_point_vectors) > MAX_SCROLL_LIMIT: stop = True break - This deletion should also be async worker job because the operation might take much more longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said this change might still be useful for collections with less than 10k data points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This deletion should also be async worker job because the operation might take much more longer
can use the same process_pool here used in #321 , with the same caveats.
If we already have 2 use cases for this, better to think about queue & worker based approach
…re/remove_vector_points
Changes:
Caveats:
Ready for review @chiragjn |
@kunwar31 Apologies for the delay, can you look into rebasing this PR? |
I am closing this PR since many changes have been made to collections and data sources APIs since this PR was creted. |
Solves #317