-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot submit more than 41,666 embeddings at once. #1049
Comments
@stofarius, what version did you upgrade from? In v 0.4.x Chroma started using SQLite, which has some limitations regarding the length of the SQL query. People started running into this problem around 0.4.5, so I think in 0.4.6, a limit was introduced. |
@tazarov, I upgraded from 0.3.26. I used the chromadb-migrate to migrate my data, inference was running ok. But then, I cleaned the local database and wanted to start the ingestion process again and now I ran into this issue. I had a long pause from playing with LLMs and I just started again. |
Is there a way to break things into smaller chunks? |
Basically I am using privateGPT to which I made some minor adjustments, like changes to use newer versions. I am trying to ingest around 19 000 txt documents (244 MB in total), maybe if trying to submit each document at once? |
Are you using langchain to do the chunking? I would just submit the documents in batches as the error message suggests. If you can do the chunking and then embed, you can use the max batch size as the error message says as your batch is made of chunks. |
@HammadB yes, using langchain for doing the chunking. I'll try as you suggested. Thank you! |
Hey @stofarius let us know if it worked. I'll also give it a try. Thanks! |
Hello, |
I believe I am running into the same issue but with one large pdf.
I printed a chapter worth of the text and re-ran the ingest.py, and it was able to complete. |
Hello,
It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me). |
@ilisparrow, glad you figured out a solution. There is a PR in the works for warning users about too large docs - #1008 I also had a gist about splitting texts using LangChain (not an ideal solution, but can give you an idea about chunking large texts) - https://gist.github.com/tazarov/e66c1d3ae298c424dc4ffc8a9a916a4a |
These are two separate problems
|
Max batch size was introduced in this PR #995 Max size depends on the local environment, so it is different per user. You can inspect it using the max_batch_size attr of the producer. I'll be adding a check and automatic split of batches to privateGPT to prevent this error. |
@stofarius could you take a look? Temporary solution that should solve your case in privateGPT. cc @ilisparrow |
- Including only CIP for review. Refs: chroma-core#1049
- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049
@imartinez apologies for my delayed answer, I was busy with family stuff and been away from the computer in the past days. I still have the same issue, just tried a run a few hours ago. |
@stofarius, We have a PR going #1077. Please have a look. The gist is that the Chroma client will take care of splitting large batches into smaller batches to avoid this kind of "expensive" error :) |
@tazarov just tried again right now, it works. Thank you, great job :) |
@stofarius, an important point that @HammadB raised was about failures of individual batches, in particular with the approach; while it can save developers a lot of money, especially on large batches it has the drawback of no guarantee of succeeding across all batches - e.g. lack of ACID-like behaviour. In light of that, I recognize that this is not an ideal implementation, but we can build upon that. For one, I feel we can use some reasonable retries. |
@tazarov I agree with you. For no, at least me I'm trying to do some experiments and see some results. For sure there is place for improvements, maybe even for some performance tuning :) But one thing at a time. |
My primary motivation was DWM (don't waste money), which, if you are submitting 40+k embeddings with OpenAI you will be doing for sure :D I just realized that we might add a new method for batch submissions (e.g. |
Very good motivation. I also don't like to waste my money so no, I'm not using OpenAI :))) |
- Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to split batches - Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error - Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches. Refs: chroma-core#1049
- Minor improvement suggested by @imartinez to pass API to create_batches utility method. Refs: chroma-core#1049
I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn't not work, any other workaround? |
@MarcoV10 You can get the actual max batch size from a Chroma client (create, for example, a chroma_client = chromadb.PersistentClient). The max batch size depends on your setup: max_batch_size = chroma_client._producer.max_batch_size Note that Chroma team is about to make the max_batch_size a public property of the API, but this should work in the meantime |
- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049
- Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to split batches - Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error - Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches. Refs: chroma-core#1049
This is a pretty bad regression in behavior of chromadb. Shouldn't have to use langchain to fix this. chromadb itself should batch if required. End user shouldn't have to do anything special. Now with this failing, I'll have to introduce my own ad hoc code in h2oGPT to work-around this new limitation. Or find an alternative to chromadb. |
The reason we didn't want chroma to batch internally is it makes partial failures much harder to reason about. Unfortunately this is a strict limitation of sqlite we cannot workaround. #1077 addresses this by exposing the max_batch_size and providing utilities to batch for you. |
I am closing this out for now as we have a fix that is being released today. But we can continue to discuss whether Chroma should perform the batching internally - we are open to feedback! |
The work-around was simple enough to batch myself, so it's ok. Any other limits like this since switch to sqlite that we should be aware of? |
- Including only CIP for review. Refs: #1049 ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - New proposal to handle large batches of embeddings gracefully ## Test plan *How are these changes tested?* - [ ] Tests pass locally with `pytest` for python, `yarn test` for js ## Documentation Changes TBD --------- Signed-off-by: sunilkumardash9 <[email protected]> Co-authored-by: Sunil Kumar Dash <[email protected]>
…#1077) - Including only CIP for review. Refs: chroma-core#1049 ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - New proposal to handle large batches of embeddings gracefully ## Test plan *How are these changes tested?* - [ ] Tests pass locally with `pytest` for python, `yarn test` for js ## Documentation Changes TBD --------- Signed-off-by: sunilkumardash9 <[email protected]> Co-authored-by: Sunil Kumar Dash <[email protected]>
Hi,
|
Im sorry to revive this, but im getting this exact message while trying to delete my 26mb csv file |
hey @luchontandil, You don't have to reinstall anything, but can you chunk or batch your data? We have a small utility function that can help you with that, have a look here - https://github.com/chroma-core/chroma/blob/main/chromadb/utils/batch_utils.py |
Hi,
I am using chromadb 0.4.7 and langchain 0.0.274 while trying to save some embeddings into a chromadb.
After upgrading to the versions mentioned above, I started to get the following error when trying to ingest some offline documents:
/chromadb/db/mixins/embeddings_queue.py", line 127, in submit_embeddings raise ValueError( ValueError: Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less.
Could you please let me know what was changed and why do I have this error?
Thank you!
The text was updated successfully, but these errors were encountered: