Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Batch size exceeds maximum batch size using collection.add function #1298

Closed
OriginalGoku opened this issue Oct 26, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@OriginalGoku
Copy link

What happened?

I am trying to add a large number of items (354127 items) using collection.add but I receive this error message:
Batch size 354127 exceeds maximum batch size 41666

The client is as follows:
client = chromadb.PersistentClient(path=vectorDBlocation, settings = Settings(anonymized_telemetry=False))
I am using the following embedder:

large_embedding = 'sentence-transformers/all-mpnet-base-v2'
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2", device=processing_device)

And this is how I make the collection:

collection = client.get_or_create_collection(name=collection_name,
                                             embedding_function=sentence_transformer_ef,
                                             metadata={"hnsw:space": distance}) # l2 is the default)

Versions

I am using the free GPU on Google Colab.

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-0766164d5dd0> in <cell line: 1>()
----> 1 collection.add(
      2     ids=[str(i) for i in tqdm(range(len(text_chunks)))],  # IDs are just strings
      3     documents=text_chunks,
      4     metadatas= extended_meta_data
      5 )

3 frames
/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py in add(self, ids, embeddings, metadatas, documents)
     98         )
     99 
--> 100         self._client._add(ids, self.id, embeddings, metadatas, documents)
    101 
    102     def get(

/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py in wrapper(*args, **kwargs)
    125             global tracer, granularity
    126             if trace_granularity < granularity:
--> 127                 return f(*args, **kwargs)
    128             if not tracer:
    129                 return f(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/chromadb/api/segment.py in _add(self, ids, collection_id, embeddings, metadatas, documents)
    314         coll = self._get_collection(collection_id)
    315         self._manager.hint_use_collection(collection_id, t.Operation.ADD)
--> 316         validate_batch(
    317             (ids, embeddings, metadatas, documents),
    318             {"max_batch_size": self.max_batch_size},

/usr/local/lib/python3.10/dist-packages/chromadb/api/types.py in validate_batch(batch, limits)
    375 ) -> None:
    376     if len(batch[0]) > limits["max_batch_size"]:
--> 377         raise ValueError(
    378             f"Batch size {len(batch[0])} exceeds maximum batch size {limits['max_batch_size']}"
    379         )

ValueError: Batch size 354127 exceeds maximum batch size 41666
@OriginalGoku OriginalGoku added the bug Something isn't working label Oct 26, 2023
@OriginalGoku OriginalGoku reopened this Oct 26, 2023
@OriginalGoku
Copy link
Author

I realized that the issue is already being handled here:
[https://github.com//issues/1049](issue 1049)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant