Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot submit more than 41,666 embeddings at once. #1049

Closed
stofarius opened this issue Aug 27, 2023 · 30 comments
Closed

Cannot submit more than 41,666 embeddings at once. #1049

stofarius opened this issue Aug 27, 2023 · 30 comments

Comments

@stofarius
Copy link

Hi,

I am using chromadb 0.4.7 and langchain 0.0.274 while trying to save some embeddings into a chromadb.

After upgrading to the versions mentioned above, I started to get the following error when trying to ingest some offline documents:

/chromadb/db/mixins/embeddings_queue.py", line 127, in submit_embeddings raise ValueError( ValueError: Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less.

Could you please let me know what was changed and why do I have this error?

Thank you!

@tazarov
Copy link
Contributor

tazarov commented Aug 27, 2023

@stofarius, what version did you upgrade from? In v 0.4.x Chroma started using SQLite, which has some limitations regarding the length of the SQL query. People started running into this problem around 0.4.5, so I think in 0.4.6, a limit was introduced.

@stofarius
Copy link
Author

@tazarov, I upgraded from 0.3.26. I used the chromadb-migrate to migrate my data, inference was running ok.

But then, I cleaned the local database and wanted to start the ingestion process again and now I ran into this issue.

I had a long pause from playing with LLMs and I just started again.

@tazarov
Copy link
Contributor

tazarov commented Aug 27, 2023

Is there a way to break things into smaller chunks?

@stofarius
Copy link
Author

Basically I am using privateGPT to which I made some minor adjustments, like changes to use newer versions.

I am trying to ingest around 19 000 txt documents (244 MB in total), maybe if trying to submit each document at once?

@HammadB
Copy link
Collaborator

HammadB commented Aug 27, 2023

Are you using langchain to do the chunking? I would just submit the documents in batches as the error message suggests. If you can do the chunking and then embed, you can use the max batch size as the error message says as your batch is made of chunks.

@stofarius
Copy link
Author

@HammadB yes, using langchain for doing the chunking. I'll try as you suggested. Thank you!

@imartinez
Copy link

Hey @stofarius let us know if it worked. I'll also give it a try. Thanks!

@ilisparrow
Copy link

Hello,
I have the same problem I am using :
embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)
If my understanding is right. Even though the embeddings are calculated chunk by chunk they still are persisted all together.
There should be a splitter. I will post another message If I come up with a solution.

@Jawn78
Copy link

Jawn78 commented Aug 30, 2023

I believe I am running into the same issue but with one large pdf.

Appending to existing vectorstore at db
Loading documents from source_documents
Loading new documents: 100%|██████████████████████| 2/2 [00:05<00:00,  2.88s/it]
Loaded 1160 new documents from source_documents
Split into 9394 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Traceback (most recent call last):
  File "C:\Users\Documents\PrivateGPT\privateGPT\ingest.py", line 169, in <module>
    main()
  File "C:\Users\Documents\PrivateGPT\privateGPT\ingest.py", line 155, in main     
    db.add_documents(texts)
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\langchain\vectorstores\base.py", line 101, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ectorstores\chroma.py", line 222, in add_texts
    raise e
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\langchain\vectorstores\chroma.py", line 208, in add_texts
    self._collection.upsert(
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\chromadb\api\models\Collection.py", line 298, in upsert
    self._client._upsert(
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\chromadb\api\segment.py", line 290, in _upsert
    self._producer.submit_embeddings(coll["topic"], records_to_submit)
  File "C:\Users\Documents\PrivateGPT\privateGPT\venv\Lib\site-packages\chromadb\db\mixins\embeddings_queue.py", line 127, in submit_embeddings
    raise ValueError(
ValueError:
                Cannot submit more than 5,461 embeddings at once.
                Please submit your embeddings in batches of size
                5,461 or less.

I printed a chapter worth of the text and re-ran the ingest.py, and it was able to complete.

@ilisparrow
Copy link

ilisparrow commented Aug 30, 2023

Hello,
In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me).
Hope this helps.

@tazarov
Copy link
Contributor

tazarov commented Aug 31, 2023

@ilisparrow, glad you figured out a solution. There is a PR in the works for warning users about too large docs - #1008

I also had a gist about splitting texts using LangChain (not an ideal solution, but can give you an idea about chunking large texts) - https://gist.github.com/tazarov/e66c1d3ae298c424dc4ffc8a9a916a4a

@HammadB
Copy link
Collaborator

HammadB commented Aug 31, 2023

These are two separate problems

  1. The token limit of the model
  2. The batch size chroma can accept

@imartinez
Copy link

Max batch size was introduced in this PR #995

Max size depends on the local environment, so it is different per user. You can inspect it using the max_batch_size attr of the producer.

I'll be adding a check and automatic split of batches to privateGPT to prevent this error.

@imartinez
Copy link

imartinez commented Aug 31, 2023

@stofarius could you take a look? Temporary solution that should solve your case in privateGPT. cc @ilisparrow
Instead of fixing the batch size, I'm accessing a protected attribute that contains the actual limit. I'm already working with @tazarov on the best way to expose that value to the public API.

zylon-ai/private-gpt#999

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 1, 2023
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 1, 2023
- Updated CIP
- Implementation done
- Added a new test in test_add

Refs: chroma-core#1049
@stofarius
Copy link
Author

@imartinez apologies for my delayed answer, I was busy with family stuff and been away from the computer in the past days.

I still have the same issue, just tried a run a few hours ago.

@tazarov
Copy link
Contributor

tazarov commented Sep 3, 2023

@stofarius, We have a PR going #1077. Please have a look.

The gist is that the Chroma client will take care of splitting large batches into smaller batches to avoid this kind of "expensive" error :)

@stofarius
Copy link
Author

@tazarov just tried again right now, it works. Thank you, great job :)

@tazarov
Copy link
Contributor

tazarov commented Sep 4, 2023

@stofarius, an important point that @HammadB raised was about failures of individual batches, in particular with the approach; while it can save developers a lot of money, especially on large batches it has the drawback of no guarantee of succeeding across all batches - e.g. lack of ACID-like behaviour.

In light of that, I recognize that this is not an ideal implementation, but we can build upon that. For one, I feel we can use some reasonable retries.

@stofarius
Copy link
Author

@tazarov I agree with you. For no, at least me I'm trying to do some experiments and see some results. For sure there is place for improvements, maybe even for some performance tuning :) But one thing at a time.

@tazarov
Copy link
Contributor

tazarov commented Sep 4, 2023

My primary motivation was DWM (don't waste money), which, if you are submitting 40+k embeddings with OpenAI you will be doing for sure :D

I just realized that we might add a new method for batch submissions (e.g. batch_add()), which will then be just syntactic sugar on top of the usual add() by batching things for the developer and implementing some retries and also save a state-file with your embeddings so you can retry them later. But then again ... baby steps :).

@stofarius
Copy link
Author

Very good motivation. I also don't like to waste my money so no, I'm not using OpenAI :)))

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 5, 2023
- Refactored the code to no longer do the splitting in _add,_update,_upsert
- Batch utils library remains as a utility method for users to split batches
- Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error
- Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches.

Refs: chroma-core#1049
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 5, 2023
- Minor improvement suggested by @imartinez to pass API to create_batches utility method.

Refs: chroma-core#1049
@MarcoV10
Copy link

Hello, In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me). Hope this helps.

I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn't not work, any other workaround?

@imartinez
Copy link

Hello, In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren't wasted (That's what happen to me). Hope this helps.

I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn't not work, any other workaround?

@MarcoV10 You can get the actual max batch size from a Chroma client (create, for example, a chroma_client = chromadb.PersistentClient). The max batch size depends on your setup:

max_batch_size = chroma_client._producer.max_batch_size

Note that Chroma team is about to make the max_batch_size a public property of the API, but this should work in the meantime

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 15, 2023
- Updated CIP
- Implementation done
- Added a new test in test_add

Refs: chroma-core#1049
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 15, 2023
- Refactored the code to no longer do the splitting in _add,_update,_upsert
- Batch utils library remains as a utility method for users to split batches
- Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error
- Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches.

Refs: chroma-core#1049
@pseudotensor
Copy link

pseudotensor commented Sep 16, 2023

This is a pretty bad regression in behavior of chromadb.

Shouldn't have to use langchain to fix this. chromadb itself should batch if required. End user shouldn't have to do anything special.

Now with this failing, I'll have to introduce my own ad hoc code in h2oGPT to work-around this new limitation. Or find an alternative to chromadb.

@HammadB
Copy link
Collaborator

HammadB commented Sep 18, 2023

The reason we didn't want chroma to batch internally is it makes partial failures much harder to reason about. Unfortunately this is a strict limitation of sqlite we cannot workaround. #1077 addresses this by exposing the max_batch_size and providing utilities to batch for you.

@HammadB
Copy link
Collaborator

HammadB commented Sep 18, 2023

I am closing this out for now as we have a fix that is being released today. But we can continue to discuss whether Chroma should perform the batching internally - we are open to feedback!

@HammadB HammadB closed this as completed Sep 18, 2023
@pseudotensor
Copy link

The work-around was simple enough to batch myself, so it's ok. Any other limits like this since switch to sqlite that we should be aware of?

HammadB pushed a commit that referenced this issue Sep 18, 2023
- Including only CIP for review.

Refs: #1049

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - New proposal to handle large batches of embeddings gracefully

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

## Documentation Changes
TBD

---------

Signed-off-by: sunilkumardash9 <[email protected]>
Co-authored-by: Sunil Kumar Dash <[email protected]>
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 21, 2023
…#1077)

- Including only CIP for review.

Refs: chroma-core#1049

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - New proposal to handle large batches of embeddings gracefully

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

## Documentation Changes
TBD

---------

Signed-off-by: sunilkumardash9 <[email protected]>
Co-authored-by: Sunil Kumar Dash <[email protected]>
@OriginalGoku
Copy link

OriginalGoku commented Oct 26, 2023

Hi,
Then please fix your documentation on the Chroma website (API Documentation) which says the add function can handle +100K at a time.
This was misleading:

# add new items to a collection
# either one at a time
collection.add(
    embeddings=[1.5, 2.9, 3.4],
    metadatas={"uri": "img9.png", "style": "style1"},
    documents="doc1000101",
    ids="uri9",
)
# or many, up to 100k+!
collection.add(
    embeddings=[[1.5, 2.9, 3.4], [9.8, 2.3, 2.9]],
    metadatas=[{"style": "style1"}, {"style": "style2"}],
    ids=["uri9", "uri10"],
)

@luchontandil
Copy link

Im sorry to revive this, but im getting this exact message while trying to delete my 26mb csv file
"Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less."
Can i access the db manually or i should re-install everything ?

@tazarov
Copy link
Contributor

tazarov commented Mar 20, 2024

hey @luchontandil, You don't have to reinstall anything, but can you chunk or batch your data? We have a small utility function that can help you with that, have a look here - https://github.com/chroma-core/chroma/blob/main/chromadb/utils/batch_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants