-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Indexing a large number of Documents resulted in ValueError with ChromaDB #9525
Comments
🤖 Hello @hieuv! I'm Dosu, a friendly bot here to assist while we wait for a human maintainer. I can help you troubleshoot issues, answer questions and guide you on how to contribute to this project. Let's tackle this together! It seems like you're encountering a similar issue to the one that was addressed in this pull request. However, the batch size limit in your case appears to be different from the one that was previously fixed. The ValueError you're encountering is likely due to ChromaDB's batch size limit, which is currently set at 5461. The LlamaIndex is attempting to run in batches of 41665, which exceeds this limit. One possible solution could be to adjust the batch size in your LlamaIndex configuration to match the ChromaDB limit. However, this would require a change in the LlamaIndex codebase, which is beyond my capabilities as I cannot create pull requests or issues. Alternatively, you could try to split your documents into smaller batches before feeding them into the LlamaIndex. This could potentially avoid the ValueError, but it might also affect the performance of your application. I hope this helps! If you have any further questions or need more clarification, please don't hesitate to ask. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Workaround for now is to just index the data in batches. We need to update the vector index code, its generating all embeddings in memory and then doing one giant insert into the db |
@logan-markewich The logic in LlamaIndex is already there, provided by #8019. It could be that ChromaDB updated their maximum again, but I have no way of confirming it. |
Hi, @hieuv, I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you reported a ValueError when indexing a large number of documents with ChromaDB, and the error occurs when the batch size exceeds the maximum allowed batch size of 5461. A workaround was suggested to index the data in smaller batches, and it seems that the logic in LlamaIndex has been updated to handle this issue as well. It was also mentioned that the issue might be related to a potential update in ChromaDB's maximum batch size. Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. |
Bug Description
Indexing a large number of Documents resulted in ValueError with ChromaDB.
In my case, I had 34k files, loaded into 191k+ Documents, and parsed into 229k Nodes.
I found #7648, and chroma-core/chroma#1049.
From the log, it seems that LlamaIndex correctly run in batch of 41665 as the fixed for #7648 intended. However, chromadb complains that the maximum batch size is 5461.
Log available below.
Using LlamaIndex 0.9.15 and ChromaDB 0.4.19
Version
0.9.15
Steps to Reproduce
Run this code against a large set of document. I suspect >5500 will do.
Relevant Logs/Tracbacks
The text was updated successfully, but these errors were encountered: