Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chroma 0.4 issue with too many embedings #860

Closed
slavag opened this issue Sep 16, 2023 · 3 comments
Closed

chroma 0.4 issue with too many embedings #860

slavag opened this issue Sep 16, 2023 · 3 comments

Comments

@slavag
Copy link

slavag commented Sep 16, 2023

Wanted to embed 41k txt files and got this error :

Cannot submit more than 83,333 embeddings at once.
                Please submit your embeddings in batches of size
                83,333 or less.

In the past it was working fine.

➜  h2ogpt git:(main) python src/make_db.py --user_path=/Users/slava/Documents/Development/private/ZendDeskTickets -collection_name=ZenDeskTickets
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41013/41013 [00:10<00:00, 4044.64it/s]
0it [00:00, ?it/s]
Exceptions: 0/484630 []
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/make_db.py:305 in <module>              │
│                                                                                                  │
│   302                                                                                            │
│   303                                                                                            │
│   304 if __name__ == "__main__":                                                                 │
│ ❱ 305 │   H2O_Fire(make_db_main)                                                                 │
│   306                                                                                            │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py:59 in H2O_Fire                 │
│                                                                                                  │
│     56 │   │                                                                                     │
│     57 │   │   args.append(f"--{new_key}={value}")                                               │
│     58 │                                                                                         │
│ ❱   59 │   fire.Fire(component=component, command=args)                                          │
│     60                                                                                           │
│     61                                                                                           │
│     62 def set_seed(seed: int):                                                                  │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:141 in Fire        │
│                                                                                                  │
│   138 │   context.update(caller_globals)                                                         │
│   139 │   context.update(caller_locals)                                                          │
│   140                                                                                            │
│ ❱ 141   component_trace = _Fire(component, args, parsed_flag_args, context, name)                │
│   142                                                                                            │
│   143   if component_trace.HasError():                                                           │
│   144 │   _DisplayError(component_trace)                                                         │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:475 in _Fire       │
│                                                                                                  │
│   472 │     is_class = inspect.isclass(component)                                                │
│   473 │                                                                                          │
│   474 │     try:                                                                                 │
│ ❱ 475 │   │   component, remaining_args = _CallAndUpdateTrace(                                   │
│   476 │   │   │   component,                                                                     │
│   477 │   │   │   remaining_args,                                                                │
│   478 │   │   │   component_trace,                                                               │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:691 in             │
│ _CallAndUpdateTrace                                                                              │
│                                                                                                  │
│   688 │   loop = asyncio.get_event_loop()                                                        │
│   689 │   component = loop.run_until_complete(fn(*varargs, **kwargs))                            │
│   690   else:                                                                                    │
│ ❱ 691 │   component = fn(*varargs, **kwargs)                                                     │
│   692                                                                                            │
│   693   if treatment == 'class':                                                                 │
│   694 │   action = trace.INSTANTIATED_CLASS                                                      │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/make_db.py:292 in make_db_main          │
│                                                                                                  │
│   289 │   sources = [x for x in sources if 'exception' not in x.metadata]                        │
│   290 │                                                                                          │
│   291 │   assert len(sources) > 0 or not fail_if_no_sources, "No sources found"                  │
│ ❱ 292 │   db = create_or_update_db(db_type, persist_directory,                                   │
│   293 │   │   │   │   │   │   │    collection_name, user_path, langchain_type,                   │
│   294 │   │   │   │   │   │   │    sources, use_openai_embedding, add_if_exists, verbose,        │
│   295 │   │   │   │   │   │   │    hf_embedding_model, migrate_embedding_model, auto_migrate_d   │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py:310 in                 │
│ create_or_update_db                                                                              │
│                                                                                                  │
│    307 │   │   if verbose:                                                                       │
│    308 │   │   │   print("Loading and updating db", flush=True)                                  │
│    309 │                                                                                         │
│ ❱  310 │   db = get_db(sources,                                                                  │
│    311 │   │   │   │   use_openai_embedding=use_openai_embedding,                                │
│    312 │   │   │   │   db_type=db_type,                                                          │
│    313 │   │   │   │   persist_directory=persist_directory,                                      │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py:138 in get_db          │
│                                                                                                  │
│    135 │   │   │   else:                                                                         │
│    136 │   │   │   │   num_threads = max(1, n_jobs)                                              │
│    137 │   │   │   collection_metadata = {"hnsw:num_threads": num_threads}                       │
│ ❱  138 │   │   │   db = Chroma.from_documents(documents=sources,                                 │
│    139 │   │   │   │   │   │   │   │   │      embedding=embedding,                               │
│    140 │   │   │   │   │   │   │   │   │      persist_directory=persist_directory,               │
│    141 │   │   │   │   │   │   │   │   │      collection_name=collection_name,                   │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │
│ y:603 in from_documents                                                                          │
│                                                                                                  │
│   600 │   │   """                                                                                │
│   601 │   │   texts = [doc.page_content for doc in documents]                                    │
│   602 │   │   metadatas = [doc.metadata for doc in documents]                                    │
│ ❱ 603 │   │   return cls.from_texts(                                                             │
│   604 │   │   │   texts=texts,                                                                   │
│   605 │   │   │   embedding=embedding,                                                           │
│   606 │   │   │   metadatas=metadatas,                                                           │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │
│ y:567 in from_texts                                                                              │
│                                                                                                  │
│   564 │   │   │   collection_metadata=collection_metadata,                                       │
│   565 │   │   │   **kwargs,                                                                      │
│   566 │   │   )                                                                                  │
│ ❱ 567 │   │   chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)             │
│   568 │   │   return chroma_collection                                                           │
│   569 │                                                                                          │
│   570 │   @classmethod                                                                           │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │
│ y:208 in add_texts                                                                               │
│                                                                                                  │
│   205 │   │   │   │   │   [embeddings[idx] for idx in non_empty_ids] if embeddings else None     │
│   206 │   │   │   │   )                                                                          │
│   207 │   │   │   │   ids_with_metadata = [ids[idx] for idx in non_empty_ids]                    │
│ ❱ 208 │   │   │   │   self._collection.upsert(                                                   │
│   209 │   │   │   │   │   metadatas=metadatas,                                                   │
│   210 │   │   │   │   │   embeddings=embeddings_with_metadatas,                                  │
│   211 │   │   │   │   │   documents=texts_with_metadatas,                                        │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/api/models/Collection. │
│ py:298 in upsert                                                                                 │
│                                                                                                  │
│   295 │   │   │   ids, embeddings, metadatas, documents                                          │
│   296 │   │   )                                                                                  │
│   297 │   │                                                                                      │
│ ❱ 298 │   │   self._client._upsert(                                                              │
│   299 │   │   │   collection_id=self.id,                                                         │
│   300 │   │   │   ids=ids,                                                                       │
│   301 │   │   │   embeddings=embeddings,                                                         │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/api/segment.py:290 in  │
│ _upsert                                                                                          │
│                                                                                                  │
│   287 │   │   for r in _records(t.Operation.UPSERT, ids, embeddings, metadatas, documents):      │
│   288 │   │   │   self._validate_embedding_record(coll, r)                                       │
│   289 │   │   │   records_to_submit.append(r)                                                    │
│ ❱ 290 │   │   self._producer.submit_embeddings(coll["topic"], records_to_submit)                 │
│   291 │   │                                                                                      │
│   292 │   │   return True                                                                        │
│   293                                                                                            │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/db/mixins/embeddings_q │
│ ueue.py:127 in submit_embeddings                                                                 │
│                                                                                                  │
│   124 │   │   │   return []                                                                      │
│   125 │   │                                                                                      │
│   126 │   │   if len(embeddings) > self.max_batch_size:                                          │
│ ❱ 127 │   │   │   raise ValueError(                                                              │
│   128 │   │   │   │   f"""                                                                       │
│   129 │   │   │   │   Cannot submit more than {self.max_batch_size:,} embeddings at once.        │
│   130 │   │   │   │   Please submit your embeddings in batches of size                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: 
                Cannot submit more than 83,333 embeddings at once.
                Please submit your embeddings in batches of size
                83,333 or less.
@pseudotensor
Copy link
Collaborator

chroma-core/chroma#1049

I wasn't aware of this limitation of chromadb after sqlite was introduced.

@pseudotensor
Copy link
Collaborator

Should be worked-around now with batching in h2oGPT. Test confirms.

@slavag
Copy link
Author

slavag commented Sep 17, 2023

@pseudotensor working fine. Thanks.
Btw, seems that new ChromaDB takes twice size as old one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants