-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: Chromadb document deletion not working #4880
Comments
🤖 AI-generated response by Steercode - chat with Langchain codebase It seems like you are trying to delete a document from the Chroma collection using the _collection.delete() method. However, the proper method to delete a document from the Chroma collection is delete_document(). You can use this method as follows:
Make sure to replace list_of_ids with the list of document IDs you want to delete. This should properly delete the documents from the Chroma collection. You can continue the conversation at https://steercode.com/ |
I have the same question, @anant-patankar , did you solved the problem? |
Ensure the list of ids are strings. |
The steercode solution does not work. It gives me the following error: |
Just replace function name delete_document with delete, there is no delete_document function in Chroma class.
|
Hi everyone, chiming in on this, I tried what you suggested and used I also tried: Is there a way to complete remove ids and corresponding data from the database, or completely remove en entire collection? Thank you, |
Hi - has anyone found a solution yet? I'm facing the same issue. |
Same issue here. I'm calling the endpoint api/v1/collections/<collection_name> method DELETE but I get only delete the entry in collections table, all the documents, metadata and embedding_fulltext_search* are still in the sqlite database |
Reference the below code. This works after the update to chroma, where now it uses SQLite instead of duckdb. Based on the
|
I have strong thoughts that will not delete documents store in embedding_fulltext_search* tables because in these tables there are no ids that would enable filtering via collection id nor document id. E.g. embedding_fulltext_search even have just one column which is the document itself, without any ids |
If you want to delete documents by IDs, consider the following code:
|
Hello I am also worried about this bug as well. I have followed the above to remove my documents I am noticing embedding data in Everything else in the db seems to be removed successfully except these two things. |
any update? it seems that chroma DB still include deleted data or keeping them as None value! |
@ALIYoussef same problem here. After I delete a document and get relevant documents, i got different documents as None and they correspond to the old deleted documents |
This is a serious issue for us. We are trying to delete outdated documents and replace them with updated documents in an active vector store. The following #!/usr/bin/env python
import chromadb
import numpy as np
import subprocess
import time
NUM_DOCS = 5_000
EMBEDDING_SIZE = 1000
VS_PATH = "./vs_test"
# disk usage in human readable format (e.g. '2,1GB')
du = lambda path: subprocess.check_output(["du", "-sh", path]).split()[0].decode("utf-8")
# create/open the vector store
client = chromadb.PersistentClient(VS_PATH)
collection = client.get_or_create_collection(name="test")
# delete existing documents
ids = collection.get()["ids"]
print(f"Deleting {len(ids)} existing documents...")
start_time = time.time()
if ids:
collection.delete(ids=ids)
print(f"{collection.count()} documents after deletion.")
end_time = time.time()
print(f"Document deletion runtime: {round(end_time - start_time)} seconds")
# add new documents
ids = [str(id) for id in range(NUM_DOCS)]
embeddings = [list(np.random.normal(size=EMBEDDING_SIZE)) for id in ids]
print(f"Adding {len(ids)} documents...")
start_time = time.time()
collection.upsert(ids=ids, documents=ids, embeddings=embeddings)
end_time = time.time()
print(f"{collection.count()} documents after addition.")
print(f"Document addition runtime: {round(end_time - start_time)} seconds")
# print on-disk size
print(f"Vector store size: {du('./vs_test')}")
print("") When I run this five times: rm -rf ./vs_test/ && \
./bug.py && \
./bug.py && \
./bug.py && \
./bug.py && \
./bug.py I get increasing addition runtimes and on-disk vector store sizes:
On our production vector store with ~55k documents, the document addition time grew to 11 minutes and the on-disk size grew to 4.2 GB after several deletion/addition cycles. We're using Chroma 0.5.4 and SQLite3 3.39.4. |
Hello all! I had the same problem in production and it was very serious for our company! We add collections with many vectors/documents and update them very often. I found a strange, and temporary, solution by testing numerous solutions.... ids = chromaColl.get()['ids']
if ids :
chromaColl.delete(ids)
del chromaColl
_chromadb.delete_collection(collectionName) Why is it absolutely necessary to call these 2 deletions in order to empty the data correctly? Thank you ! |
@jczic - thanks very much for sharing this! Does this allow the documents to be deleted and refreshed while there are active connections (with the understanding that those connections have a brief window of reduced data)? |
@chrispy-snps You can also try opening the |
This also seems to stem from the way Chroma is used, particularly in multithreaded/asynchronous mode. For my part, I don't use |
Hello all, I was having the same problem : I deleted some documents and I was trying to query my collection and I was having "None" values and this helped : from chromadb.api.client import SharedSystemClient
client._system.stop()
SharedSystemClient._identifer_to_system.pop(client._identifier, None) Adding this after the delete documents solved my problem 🙂 |
Issue you'd like to raise.
I am trying to delete a single document from Chroma db using the following code:
chroma_db = Chroma(persist_directory = embeddings_save_path,
embedding_function = OpenAIEmbeddings(model = os.getenv("EMBEDDING_MODEL_NAME"),
chunk_size = 1,
max_retries = 5)
)
chroma_db._collection.delete(ids = list_of_ids)
chroma_db.persist()
However, the document is not actually being deleted. After loading/re-loading the chroma db from local, it is still showing the document in it.
I have tried the following things to fix the issue:
I have made sure that the list of ids is correct.
I have tried deleting the document multiple times.
I have tried restarting the Chroma db server.
None of these things have worked.
I am not sure why the document is not being deleted. I would appreciate any help in resolving this issue.
Thanks,
Anant Patankar
Suggestion:
No response
The text was updated successfully, but these errors were encountered: