Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Indexing a large number of Documents resulted in ValueError with ChromaDB #9525

Closed
hieuv opened this issue Dec 14, 2023 · 4 comments
Closed
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@hieuv
Copy link

hieuv commented Dec 14, 2023

Bug Description

Indexing a large number of Documents resulted in ValueError with ChromaDB.
In my case, I had 34k files, loaded into 191k+ Documents, and parsed into 229k Nodes.

I found #7648, and chroma-core/chroma#1049.
From the log, it seems that LlamaIndex correctly run in batch of 41665 as the fixed for #7648 intended. However, chromadb complains that the maximum batch size is 5461.
Log available below.

Using LlamaIndex 0.9.15 and ChromaDB 0.4.19

Version

0.9.15

Steps to Reproduce

Run this code against a large set of document. I suspect >5500 will do.

# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("collection_name")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True,
    storage_context=storage_context,
)

Relevant Logs/Tracbacks

[REDACTED project folder]\venv\Scripts\python.exe [REDACTED project folder]\main.py
2023-12-14 13:46:57,044 I main.py L62: <module>(): Loading starts
2023-12-14 13:46:58,055 I [REDACTED custom loader module].py L74: _add_files(): Total files added: 34155
Loading files: 100%|██████████| 34155/34155 [02:20<00:00, 243.86file/s]
2023-12-14 13:49:18,132 I main.py L71: <module>(): Loading ends
2023-12-14 13:49:18,132 I main.py L72: <module>(): Indexing starts
Parsing nodes: 100%|██████████| 191025/191025 [02:54<00:00, 1094.34it/s]
Generating embeddings: 100%|██████████| 229210/229210 [1:09:27<00:00, 54.99it/s]
Traceback (most recent call last):
  File "[REDACTED project folder]\main.py", line 84, in <module>
    index = VectorStoreIndex.from_documents(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\indices\base.py", line 106, in from_documents
    return cls(
           ^^^^
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\indices\vector_store\base.py", line 49, in __init__
    super().__init__(
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\indices\base.py", line 71, in __init__
    index_struct = self.build_index_from_nodes(nodes)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\indices\vector_store\base.py", line 255, in build_index_from_nodes
    return self._build_index_from_nodes(nodes, **insert_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\indices\vector_store\base.py", line 236, in _build_index_from_nodes
    self._add_nodes_to_index(
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\indices\vector_store\base.py", line 190, in _add_nodes_to_index
    new_ids = self._vector_store.add(nodes, **insert_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED project folder]\venv\Lib\site-packages\llama_index\vector_stores\chroma.py", line 243, in add
    self._collection.add(
  File "[REDACTED project folder]\venv\Lib\site-packages\chromadb\api\models\Collection.py", line 168, in add
    self._client._add(ids, self.id, embeddings, metadatas, documents, uris)
  File "[REDACTED project folder]\venv\Lib\site-packages\chromadb\telemetry\opentelemetry\__init__.py", line 127, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "[REDACTED project folder]\venv\Lib\site-packages\chromadb\api\segment.py", line 344, in _add
    validate_batch(
  File "[REDACTED project folder]\venv\Lib\site-packages\chromadb\api\types.py", line 505, in validate_batch
    raise ValueError(
ValueError: Batch size 41665 exceeds maximum batch size 5461

Process finished with exit code 1
@hieuv hieuv added bug Something isn't working triage Issue needs to be triaged/prioritized labels Dec 14, 2023
Copy link

dosubot bot commented Dec 14, 2023

🤖

Hello @hieuv! I'm Dosu, a friendly bot here to assist while we wait for a human maintainer. I can help you troubleshoot issues, answer questions and guide you on how to contribute to this project. Let's tackle this together!

It seems like you're encountering a similar issue to the one that was addressed in this pull request. However, the batch size limit in your case appears to be different from the one that was previously fixed.

The ValueError you're encountering is likely due to ChromaDB's batch size limit, which is currently set at 5461. The LlamaIndex is attempting to run in batches of 41665, which exceeds this limit.

One possible solution could be to adjust the batch size in your LlamaIndex configuration to match the ChromaDB limit. However, this would require a change in the LlamaIndex codebase, which is beyond my capabilities as I cannot create pull requests or issues.

Alternatively, you could try to split your documents into smaller batches before feeding them into the LlamaIndex. This could potentially avoid the ValueError, but it might also affect the performance of your application.

I hope this helps! If you have any further questions or need more clarification, please don't hesitate to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@hieuv hieuv changed the title [Bug]: [Bug]: Indexing a large number of documents resulted in ValueError with ChromaDB Dec 14, 2023
@hieuv hieuv changed the title [Bug]: Indexing a large number of documents resulted in ValueError with ChromaDB [Bug]: Indexing a large number of Documents resulted in ValueError with ChromaDB Dec 14, 2023
@logan-markewich
Copy link
Collaborator

Workaround for now is to just index the data in batches.

We need to update the vector index code, its generating all embeddings in memory and then doing one giant insert into the db

@hieuv
Copy link
Author

hieuv commented Dec 15, 2023

@logan-markewich The logic in LlamaIndex is already there, provided by #8019.
The error log also reflects that the fix is working: The insert is for a size of 41665 exactly instead of everything (~219k).

It could be that ChromaDB updated their maximum again, but I have no way of confirming it.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 15, 2024
Copy link

dosubot bot commented Mar 15, 2024

Hi, @hieuv,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you reported a ValueError when indexing a large number of documents with ChromaDB, and the error occurs when the batch size exceeds the maximum allowed batch size of 5461. A workaround was suggested to index the data in smaller batches, and it seems that the logic in LlamaIndex has been updated to handle this issue as well. It was also mentioned that the issue might be related to a potential update in ChromaDB's maximum batch size.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants