Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: remove vector points when unassociating datasource from collection #322

Closed

Conversation

kunwar31
Copy link
Contributor

@kunwar31 kunwar31 commented Sep 5, 2024

Solves #317

Comment on lines 143 to 148
VECTOR_STORE_CLIENT.delete_data_point_vectors(collection_name=request.collection_name,
data_point_vectors=VECTOR_STORE_CLIENT.list_data_point_vectors(
collection_name=request.collection_name,
data_source_fqn=request.data_source_fqn
)
)
Copy link
Member

@chiragjn chiragjn Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change itself is fine I guess

Ideally,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said this change might still be useful for collections with less than 10k data points

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This deletion should also be async worker job because the operation might take much more longer

can use the same process_pool here used in #321 , with the same caveats.
If we already have 2 use cases for this, better to think about queue & worker based approach

@kunwar31 kunwar31 marked this pull request as draft September 7, 2024 07:09
@kunwar31 kunwar31 marked this pull request as ready for review September 7, 2024 12:44
@kunwar31
Copy link
Contributor Author

kunwar31 commented Sep 7, 2024

Changes:

  1. Deletion uses run_in_executor, so no blocking of event loop
  2. added yield_data_point_vector_batches. list_data_point_vectors is now implemented using it.
  3. used yield_data_point_vector_batches and delete per batch, to reduce memory footprint
  4. Created delete_data_point_vectors_by_data_source method directly in base vectorDB class, so it can be used for all DBs

Caveats:

  1. delete should be implemented directly instead of having to fetch the data points first, atleast for the vectorDBs that support it.
    • will need to overide delete_data_point_vectors_by_data_source method for each vectorDB
    • create a new issue for this? too much going on in this PR
  2. singlestore does not have batched yield method. so the 1e6 scroll limit applies

Ready for review @chiragjn

@Blakeinstein
Copy link
Contributor

@kunwar31 Apologies for the delay, can you look into rebasing this PR?

@mnvsk97
Copy link
Contributor

mnvsk97 commented Oct 12, 2024

@kunwar31 Apologies for the delay, can you look into rebasing this PR?

Hi @kunwar31, I've rebased this branch with the main branch. Please validate again with all the vector databases that we support and let us know.

@mnvsk97
Copy link
Contributor

mnvsk97 commented Oct 24, 2024

I am closing this PR since many changes have been made to collections and data sources APIs since this PR was creted.

@mnvsk97 mnvsk97 closed this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants