Cannot submit more than 41,666 embeddings at once. #1049

stofarius · 2023-08-27T15:41:06Z

Hi,

I am using chromadb 0.4.7 and langchain 0.0.274 while trying to save some embeddings into a chromadb.

After upgrading to the versions mentioned above, I started to get the following error when trying to ingest some offline documents:

/chromadb/db/mixins/embeddings_queue.py", line 127, in submit_embeddings raise ValueError( ValueError: Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less.

Could you please let me know what was changed and why do I have this error?

Thank you!

The text was updated successfully, but these errors were encountered:

tazarov · 2023-08-27T16:54:39Z

@stofarius, what version did you upgrade from? In v 0.4.x Chroma started using SQLite, which has some limitations regarding the length of the SQL query. People started running into this problem around 0.4.5, so I think in 0.4.6, a limit was introduced.

stofarius · 2023-08-27T16:59:34Z

@tazarov, I upgraded from 0.3.26. I used the chromadb-migrate to migrate my data, inference was running ok.

But then, I cleaned the local database and wanted to start the ingestion process again and now I ran into this issue.

I had a long pause from playing with LLMs and I just started again.

tazarov · 2023-08-27T17:02:11Z

Is there a way to break things into smaller chunks?

stofarius · 2023-08-27T17:14:06Z

Basically I am using privateGPT to which I made some minor adjustments, like changes to use newer versions.

I am trying to ingest around 19 000 txt documents (244 MB in total), maybe if trying to submit each document at once?

HammadB · 2023-08-27T23:04:40Z

Are you using langchain to do the chunking? I would just submit the documents in batches as the error message suggests. If you can do the chunking and then embed, you can use the max batch size as the error message says as your batch is made of chunks.

stofarius · 2023-08-28T08:58:23Z

@HammadB yes, using langchain for doing the chunking. I'll try as you suggested. Thank you!

imartinez · 2023-08-29T11:51:32Z

Hey @stofarius let us know if it worked. I'll also give it a try. Thanks!

ilisparrow · 2023-08-30T08:52:38Z

Hello,
I have the same problem I am using :
embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)
If my understanding is right. Even though the embeddings are calculated chunk by chunk they still are persisted all together.
There should be a splitter. I will post another message If I come up with a solution.

Jawn78 · 2023-08-30T20:35:39Z

I believe I am running into the same issue but with one large pdf.

Appending to existing vectorstore at db
Loading documents from source_documents
Loading new documents: 100%|██████████████████████| 2/2 [00:05<00:00,  2.88s/it]
Loaded 1160 new documents from source_documents
Split into 9394 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Traceback (most recent call last):
  File "C:\Users\Documents\PrivateGPT\privateGPT\ingest.py", line 169, in <module>
    main()
  File "C:\Users\Documents\PrivateGPT\privateGPT\ingest.py", line 155, in main     
    db.add_documents(texts)
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\langchain\vectorstores\base.py", line 101, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ectorstores\chroma.py", line 222, in add_texts
    raise e
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\langchain\vectorstores\chroma.py", line 208, in add_texts
    self._collection.upsert(
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\chromadb\api\models\Collection.py", line 298, in upsert
    self._client._upsert(
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\chromadb\api\segment.py", line 290, in _upsert
    self._producer.submit_embeddings(coll["topic"], records_to_submit)
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\chromadb\db\mixins\embeddings_queue.py", line 127, in submit_embeddings
    raise ValueError(
ValueError:
                Cannot submit more than 5,461 embeddings at once.
                Please submit your embeddings in batches of size
                5,461 or less.

I printed a chapter worth of the text and re-ran the ingest.py, and it was able to complete.

ilisparrow · 2023-08-30T21:21:03Z

Hello,
In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me).
Hope this helps.

tazarov · 2023-08-31T04:08:07Z

@ilisparrow, glad you figured out a solution. There is a PR in the works for warning users about too large docs - #1008

I also had a gist about splitting texts using LangChain (not an ideal solution, but can give you an idea about chunking large texts) - https://gist.github.com/tazarov/e66c1d3ae298c424dc4ffc8a9a916a4a

HammadB · 2023-08-31T06:16:48Z

These are two separate problems

The token limit of the model
The batch size chroma can accept

imartinez · 2023-08-31T11:41:17Z

Max batch size was introduced in this PR #995

Max size depends on the local environment, so it is different per user. You can inspect it using the max_batch_size attr of the producer.

I'll be adding a check and automatic split of batches to privateGPT to prevent this error.

imartinez · 2023-08-31T14:40:33Z

@stofarius could you take a look? Temporary solution that should solve your case in privateGPT. cc @ilisparrow
Instead of fixing the batch size, I'm accessing a protected attribute that contains the actual limit. I'm already working with @tazarov on the best way to expose that value to the public API.

zylon-ai/private-gpt#999

- Including only CIP for review. Refs: chroma-core#1049

- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049

stofarius · 2023-09-03T17:24:30Z

@imartinez apologies for my delayed answer, I was busy with family stuff and been away from the computer in the past days.

I still have the same issue, just tried a run a few hours ago.

tazarov · 2023-09-03T17:52:20Z

@stofarius, We have a PR going #1077. Please have a look.

The gist is that the Chroma client will take care of splitting large batches into smaller batches to avoid this kind of "expensive" error :)

stofarius · 2023-09-04T18:24:46Z

@tazarov just tried again right now, it works. Thank you, great job :)

tazarov · 2023-09-04T18:38:32Z

@stofarius, an important point that @HammadB raised was about failures of individual batches, in particular with the approach; while it can save developers a lot of money, especially on large batches it has the drawback of no guarantee of succeeding across all batches - e.g. lack of ACID-like behaviour.

In light of that, I recognize that this is not an ideal implementation, but we can build upon that. For one, I feel we can use some reasonable retries.

stofarius · 2023-09-04T18:46:06Z

@tazarov I agree with you. For no, at least me I'm trying to do some experiments and see some results. For sure there is place for improvements, maybe even for some performance tuning :) But one thing at a time.

tazarov · 2023-09-04T18:56:05Z

My primary motivation was DWM (don't waste money), which, if you are submitting 40+k embeddings with OpenAI you will be doing for sure :D

I just realized that we might add a new method for batch submissions (e.g. batch_add()), which will then be just syntactic sugar on top of the usual add() by batching things for the developer and implementing some retries and also save a state-file with your embeddings so you can retry them later. But then again ... baby steps :).

stofarius · 2023-09-04T18:58:09Z

Very good motivation. I also don't like to waste my money so no, I'm not using OpenAI :)))

- Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to split batches - Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error - Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches. Refs: chroma-core#1049

@imartinez

- Minor improvement suggested by @imartinez to pass API to create_batches utility method. Refs: chroma-core#1049

MarcoV10 · 2023-09-15T10:35:44Z

Hello, In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me). Hope this helps.

I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn't not work, any other workaround?

imartinez · 2023-09-15T13:14:22Z

Hello, In the end this what I did :
embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()
It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me). Hope this helps.
I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn't not work, any other workaround?

@MarcoV10 You can get the actual max batch size from a Chroma client (create, for example, a chroma_client = chromadb.PersistentClient). The max batch size depends on your setup:

max_batch_size = chroma_client._producer.max_batch_size

Note that Chroma team is about to make the max_batch_size a public property of the API, but this should work in the meantime

- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049

- Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to split batches - Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error - Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches. Refs: chroma-core#1049

pseudotensor · 2023-09-16T23:32:39Z

This is a pretty bad regression in behavior of chromadb.

Shouldn't have to use langchain to fix this. chromadb itself should batch if required. End user shouldn't have to do anything special.

Now with this failing, I'll have to introduce my own ad hoc code in h2oGPT to work-around this new limitation. Or find an alternative to chromadb.

HammadB · 2023-09-18T18:03:57Z

The reason we didn't want chroma to batch internally is it makes partial failures much harder to reason about. Unfortunately this is a strict limitation of sqlite we cannot workaround. #1077 addresses this by exposing the max_batch_size and providing utilities to batch for you.

HammadB · 2023-09-18T18:04:43Z

I am closing this out for now as we have a fix that is being released today. But we can continue to discuss whether Chroma should perform the batching internally - we are open to feedback!

pseudotensor · 2023-09-18T18:16:08Z

The work-around was simple enough to batch myself, so it's ok. Any other limits like this since switch to sqlite that we should be aware of?

- Including only CIP for review. Refs: #1049 ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - New proposal to handle large batches of embeddings gracefully ## Test plan *How are these changes tested?* - [ ] Tests pass locally with `pytest` for python, `yarn test` for js ## Documentation Changes TBD --------- Signed-off-by: sunilkumardash9 <[email protected]> Co-authored-by: Sunil Kumar Dash <[email protected]>

…#1077) - Including only CIP for review. Refs: chroma-core#1049 ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - New proposal to handle large batches of embeddings gracefully ## Test plan *How are these changes tested?* - [ ] Tests pass locally with `pytest` for python, `yarn test` for js ## Documentation Changes TBD --------- Signed-off-by: sunilkumardash9 <[email protected]> Co-authored-by: Sunil Kumar Dash <[email protected]>

OriginalGoku · 2023-10-26T20:49:34Z

Hi,
Then please fix your documentation on the Chroma website (API Documentation) which says the add function can handle +100K at a time.
This was misleading:

# add new items to a collection
# either one at a time
collection.add(
    embeddings=[1.5, 2.9, 3.4],
    metadatas={"uri": "img9.png", "style": "style1"},
    documents="doc1000101",
    ids="uri9",
)
# or many, up to 100k+!
collection.add(
    embeddings=[[1.5, 2.9, 3.4], [9.8, 2.3, 2.9]],
    metadatas=[{"style": "style1"}, {"style": "style2"}],
    ids=["uri9", "uri10"],
)

luchontandil · 2024-03-20T16:08:26Z

Im sorry to revive this, but im getting this exact message while trying to delete my 26mb csv file
"Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less."
Can i access the db manually or i should re-install everything ?

tazarov · 2024-03-20T16:50:19Z

hey @luchontandil, You don't have to reinstall anything, but can you chunk or batch your data? We have a small utility function that can help you with that, have a look here - https://github.com/chroma-core/chroma/blob/main/chromadb/utils/batch_utils.py

imartinez mentioned this issue Aug 29, 2023

Cannot submit more than 166 embeddings at once while ingesting zylon-ai/private-gpt#990

Closed

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 1, 2023

feat: CIP-5: Large Batch Handling Improvements Proposal

4a010ef

- Including only CIP for review. Refs: chroma-core#1049

tazarov mentioned this issue Sep 1, 2023

[ENH]: CIP-5: Large Batch Handling Improvements Proposal #1077

Merged

1 task

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 1, 2023

feat: CIP-5: Large Batch Handling Improvements Proposal

93f0eda

- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049

tazarov mentioned this issue Sep 1, 2023

[Bug]: ValueError: Cannot submit more than 5,461 embeddings at once. Please submit your embeddings in batches of size 5,461 or less. #1079

Closed

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 5, 2023

feat: CIP-5: Large Batch Handling Improvements Proposal

ee4ae63

- Minor improvement suggested by @imartinez to pass API to create_batches utility method. Refs: chroma-core#1049

mggger mentioned this issue Sep 6, 2023

feat: repository loader mem0ai/mem0#400

Closed

14 tasks

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 15, 2023

feat: CIP-5: Large Batch Handling Improvements Proposal

b38caae

- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049

pseudotensor mentioned this issue Sep 16, 2023

chroma 0.4 issue with too many embedings h2oai/h2ogpt#860

Closed

dcasota mentioned this issue Sep 18, 2023

GPU Support for Ollama on Microsoft Windows ollama/ollama#533

Closed

HammadB closed this as completed Sep 18, 2023

thebetauser mentioned this issue Sep 24, 2023

[Question] sqlite3.OperationalError: too many SQL variables PromtEngineer/localGPT#489

Open

codespearhead mentioned this issue Sep 25, 2023

ValueError when run ingest, about submit embedding zylon-ai/private-gpt#1040

Closed

OriginalGoku mentioned this issue Oct 26, 2023

[Bug]: Batch size exceeds maximum batch size using collection.add function #1298

Closed

hieuv mentioned this issue Dec 14, 2023

[Bug]: Indexing a large number of Documents resulted in ValueError with ChromaDB run-llama/llama_index#9525

Closed

This was referenced Apr 9, 2024

WIP: fix: support batching chromadb open-webui/open-webui#1474

Closed

fix: support batching chromadb open-webui/open-webui#1476

Merged

dcasota mentioned this issue Apr 22, 2024

PrivateGPT example is broken for me ollama/ollama#2572

Closed

JulianLopezB mentioned this issue Apr 25, 2024

ChromaDB: Batch size exceeds maximum batch size using collection.add function Marker-Inc-Korea/AutoRAG#372

Closed

niceblue88 mentioned this issue May 10, 2024

[Bug]: "Cannot submit more than 5,461 embeddings at once. Please submit your embeddings in batches of size 5,461 or less." but on running *.delete* #2181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot submit more than 41,666 embeddings at once. #1049

Cannot submit more than 41,666 embeddings at once. #1049

stofarius commented Aug 27, 2023

tazarov commented Aug 27, 2023

stofarius commented Aug 27, 2023

tazarov commented Aug 27, 2023

stofarius commented Aug 27, 2023

HammadB commented Aug 27, 2023

stofarius commented Aug 28, 2023

imartinez commented Aug 29, 2023

ilisparrow commented Aug 30, 2023

Jawn78 commented Aug 30, 2023

ilisparrow commented Aug 30, 2023 •

edited

Loading

tazarov commented Aug 31, 2023

HammadB commented Aug 31, 2023

imartinez commented Aug 31, 2023

imartinez commented Aug 31, 2023 •

edited

Loading

stofarius commented Sep 3, 2023

tazarov commented Sep 3, 2023

stofarius commented Sep 4, 2023

tazarov commented Sep 4, 2023

stofarius commented Sep 4, 2023

tazarov commented Sep 4, 2023

stofarius commented Sep 4, 2023

MarcoV10 commented Sep 15, 2023

imartinez commented Sep 15, 2023

pseudotensor commented Sep 16, 2023 •

edited

Loading

HammadB commented Sep 18, 2023

HammadB commented Sep 18, 2023 •

edited

Loading

pseudotensor commented Sep 18, 2023

OriginalGoku commented Oct 26, 2023 •

edited

Loading

luchontandil commented Mar 20, 2024

tazarov commented Mar 20, 2024

Cannot submit more than 41,666 embeddings at once. #1049

Cannot submit more than 41,666 embeddings at once. #1049

Comments

stofarius commented Aug 27, 2023

tazarov commented Aug 27, 2023

stofarius commented Aug 27, 2023

tazarov commented Aug 27, 2023

stofarius commented Aug 27, 2023

HammadB commented Aug 27, 2023

stofarius commented Aug 28, 2023

imartinez commented Aug 29, 2023

ilisparrow commented Aug 30, 2023

Jawn78 commented Aug 30, 2023

ilisparrow commented Aug 30, 2023 • edited Loading

tazarov commented Aug 31, 2023

HammadB commented Aug 31, 2023

imartinez commented Aug 31, 2023

imartinez commented Aug 31, 2023 • edited Loading

stofarius commented Sep 3, 2023

tazarov commented Sep 3, 2023

stofarius commented Sep 4, 2023

tazarov commented Sep 4, 2023

stofarius commented Sep 4, 2023

tazarov commented Sep 4, 2023

stofarius commented Sep 4, 2023

MarcoV10 commented Sep 15, 2023

imartinez commented Sep 15, 2023

pseudotensor commented Sep 16, 2023 • edited Loading

HammadB commented Sep 18, 2023

HammadB commented Sep 18, 2023 • edited Loading

pseudotensor commented Sep 18, 2023

OriginalGoku commented Oct 26, 2023 • edited Loading

luchontandil commented Mar 20, 2024

tazarov commented Mar 20, 2024

ilisparrow commented Aug 30, 2023 •

edited

Loading

imartinez commented Aug 31, 2023 •

edited

Loading

pseudotensor commented Sep 16, 2023 •

edited

Loading

HammadB commented Sep 18, 2023 •

edited

Loading

OriginalGoku commented Oct 26, 2023 •

edited

Loading