We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wanted to embed 41k txt files and got this error :
Cannot submit more than 83,333 embeddings at once. Please submit your embeddings in batches of size 83,333 or less.
In the past it was working fine.
➜ h2ogpt git:(main) python src/make_db.py --user_path=/Users/slava/Documents/Development/private/ZendDeskTickets -collection_name=ZenDeskTickets 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41013/41013 [00:10<00:00, 4044.64it/s] 0it [00:00, ?it/s] Exceptions: 0/484630 [] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /Users/slava/Documents/Development/private/AI/h2ogpt/src/make_db.py:305 in <module> │ │ │ │ 302 │ │ 303 │ │ 304 if __name__ == "__main__": │ │ ❱ 305 │ H2O_Fire(make_db_main) │ │ 306 │ │ │ │ /Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py:59 in H2O_Fire │ │ │ │ 56 │ │ │ │ 57 │ │ args.append(f"--{new_key}={value}") │ │ 58 │ │ │ ❱ 59 │ fire.Fire(component=component, command=args) │ │ 60 │ │ 61 │ │ 62 def set_seed(seed: int): │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:141 in Fire │ │ │ │ 138 │ context.update(caller_globals) │ │ 139 │ context.update(caller_locals) │ │ 140 │ │ ❱ 141 component_trace = _Fire(component, args, parsed_flag_args, context, name) │ │ 142 │ │ 143 if component_trace.HasError(): │ │ 144 │ _DisplayError(component_trace) │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:475 in _Fire │ │ │ │ 472 │ is_class = inspect.isclass(component) │ │ 473 │ │ │ 474 │ try: │ │ ❱ 475 │ │ component, remaining_args = _CallAndUpdateTrace( │ │ 476 │ │ │ component, │ │ 477 │ │ │ remaining_args, │ │ 478 │ │ │ component_trace, │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:691 in │ │ _CallAndUpdateTrace │ │ │ │ 688 │ loop = asyncio.get_event_loop() │ │ 689 │ component = loop.run_until_complete(fn(*varargs, **kwargs)) │ │ 690 else: │ │ ❱ 691 │ component = fn(*varargs, **kwargs) │ │ 692 │ │ 693 if treatment == 'class': │ │ 694 │ action = trace.INSTANTIATED_CLASS │ │ │ │ /Users/slava/Documents/Development/private/AI/h2ogpt/src/make_db.py:292 in make_db_main │ │ │ │ 289 │ sources = [x for x in sources if 'exception' not in x.metadata] │ │ 290 │ │ │ 291 │ assert len(sources) > 0 or not fail_if_no_sources, "No sources found" │ │ ❱ 292 │ db = create_or_update_db(db_type, persist_directory, │ │ 293 │ │ │ │ │ │ │ collection_name, user_path, langchain_type, │ │ 294 │ │ │ │ │ │ │ sources, use_openai_embedding, add_if_exists, verbose, │ │ 295 │ │ │ │ │ │ │ hf_embedding_model, migrate_embedding_model, auto_migrate_d │ │ │ │ /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py:310 in │ │ create_or_update_db │ │ │ │ 307 │ │ if verbose: │ │ 308 │ │ │ print("Loading and updating db", flush=True) │ │ 309 │ │ │ ❱ 310 │ db = get_db(sources, │ │ 311 │ │ │ │ use_openai_embedding=use_openai_embedding, │ │ 312 │ │ │ │ db_type=db_type, │ │ 313 │ │ │ │ persist_directory=persist_directory, │ │ │ │ /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py:138 in get_db │ │ │ │ 135 │ │ │ else: │ │ 136 │ │ │ │ num_threads = max(1, n_jobs) │ │ 137 │ │ │ collection_metadata = {"hnsw:num_threads": num_threads} │ │ ❱ 138 │ │ │ db = Chroma.from_documents(documents=sources, │ │ 139 │ │ │ │ │ │ │ │ │ embedding=embedding, │ │ 140 │ │ │ │ │ │ │ │ │ persist_directory=persist_directory, │ │ 141 │ │ │ │ │ │ │ │ │ collection_name=collection_name, │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │ │ y:603 in from_documents │ │ │ │ 600 │ │ """ │ │ 601 │ │ texts = [doc.page_content for doc in documents] │ │ 602 │ │ metadatas = [doc.metadata for doc in documents] │ │ ❱ 603 │ │ return cls.from_texts( │ │ 604 │ │ │ texts=texts, │ │ 605 │ │ │ embedding=embedding, │ │ 606 │ │ │ metadatas=metadatas, │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │ │ y:567 in from_texts │ │ │ │ 564 │ │ │ collection_metadata=collection_metadata, │ │ 565 │ │ │ **kwargs, │ │ 566 │ │ ) │ │ ❱ 567 │ │ chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) │ │ 568 │ │ return chroma_collection │ │ 569 │ │ │ 570 │ @classmethod │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │ │ y:208 in add_texts │ │ │ │ 205 │ │ │ │ │ [embeddings[idx] for idx in non_empty_ids] if embeddings else None │ │ 206 │ │ │ │ ) │ │ 207 │ │ │ │ ids_with_metadata = [ids[idx] for idx in non_empty_ids] │ │ ❱ 208 │ │ │ │ self._collection.upsert( │ │ 209 │ │ │ │ │ metadatas=metadatas, │ │ 210 │ │ │ │ │ embeddings=embeddings_with_metadatas, │ │ 211 │ │ │ │ │ documents=texts_with_metadatas, │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/api/models/Collection. │ │ py:298 in upsert │ │ │ │ 295 │ │ │ ids, embeddings, metadatas, documents │ │ 296 │ │ ) │ │ 297 │ │ │ │ ❱ 298 │ │ self._client._upsert( │ │ 299 │ │ │ collection_id=self.id, │ │ 300 │ │ │ ids=ids, │ │ 301 │ │ │ embeddings=embeddings, │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/api/segment.py:290 in │ │ _upsert │ │ │ │ 287 │ │ for r in _records(t.Operation.UPSERT, ids, embeddings, metadatas, documents): │ │ 288 │ │ │ self._validate_embedding_record(coll, r) │ │ 289 │ │ │ records_to_submit.append(r) │ │ ❱ 290 │ │ self._producer.submit_embeddings(coll["topic"], records_to_submit) │ │ 291 │ │ │ │ 292 │ │ return True │ │ 293 │ │ │ │ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/db/mixins/embeddings_q │ │ ueue.py:127 in submit_embeddings │ │ │ │ 124 │ │ │ return [] │ │ 125 │ │ │ │ 126 │ │ if len(embeddings) > self.max_batch_size: │ │ ❱ 127 │ │ │ raise ValueError( │ │ 128 │ │ │ │ f""" │ │ 129 │ │ │ │ Cannot submit more than {self.max_batch_size:,} embeddings at once. │ │ 130 │ │ │ │ Please submit your embeddings in batches of size │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Cannot submit more than 83,333 embeddings at once. Please submit your embeddings in batches of size 83,333 or less.
The text was updated successfully, but these errors were encountered:
chroma-core/chroma#1049
I wasn't aware of this limitation of chromadb after sqlite was introduced.
Sorry, something went wrong.
45dc134
Should be worked-around now with batching in h2oGPT. Test confirms.
@pseudotensor working fine. Thanks. Btw, seems that new ChromaDB takes twice size as old one.
No branches or pull requests
Wanted to embed 41k txt files and got this error :
In the past it was working fine.
The text was updated successfully, but these errors were encountered: