chroma 0.4 issue with too many embedings #860

slavag · 2023-09-16T12:29:55Z

Wanted to embed 41k txt files and got this error :

Cannot submit more than 83,333 embeddings at once.
                Please submit your embeddings in batches of size
                83,333 or less.

In the past it was working fine.

➜  h2ogpt git:(main) python src/make_db.py --user_path=/Users/slava/Documents/Development/private/ZendDeskTickets -collection_name=ZenDeskTickets
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41013/41013 [00:10<00:00, 4044.64it/s]
0it [00:00, ?it/s]
Exceptions: 0/484630 []
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/make_db.py:305 in <module>              │
│                                                                                                  │
│   302                                                                                            │
│   303                                                                                            │
│   304 if __name__ == "__main__":                                                                 │
│ ❱ 305 │   H2O_Fire(make_db_main)                                                                 │
│   306                                                                                            │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py:59 in H2O_Fire                 │
│                                                                                                  │
│     56 │   │                                                                                     │
│     57 │   │   args.append(f"--{new_key}={value}")                                               │
│     58 │                                                                                         │
│ ❱   59 │   fire.Fire(component=component, command=args)                                          │
│     60                                                                                           │
│     61                                                                                           │
│     62 def set_seed(seed: int):                                                                  │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:141 in Fire        │
│                                                                                                  │
│   138 │   context.update(caller_globals)                                                         │
│   139 │   context.update(caller_locals)                                                          │
│   140                                                                                            │
│ ❱ 141   component_trace = _Fire(component, args, parsed_flag_args, context, name)                │
│   142                                                                                            │
│   143   if component_trace.HasError():                                                           │
│   144 │   _DisplayError(component_trace)                                                         │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:475 in _Fire       │
│                                                                                                  │
│   472 │     is_class = inspect.isclass(component)                                                │
│   473 │                                                                                          │
│   474 │     try:                                                                                 │
│ ❱ 475 │   │   component, remaining_args = _CallAndUpdateTrace(                                   │
│   476 │   │   │   component,                                                                     │
│   477 │   │   │   remaining_args,                                                                │
│   478 │   │   │   component_trace,                                                               │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py:691 in             │
│ _CallAndUpdateTrace                                                                              │
│                                                                                                  │
│   688 │   loop = asyncio.get_event_loop()                                                        │
│   689 │   component = loop.run_until_complete(fn(*varargs, **kwargs))                            │
│   690   else:                                                                                    │
│ ❱ 691 │   component = fn(*varargs, **kwargs)                                                     │
│   692                                                                                            │
│   693   if treatment == 'class':                                                                 │
│   694 │   action = trace.INSTANTIATED_CLASS                                                      │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/make_db.py:292 in make_db_main          │
│                                                                                                  │
│   289 │   sources = [x for x in sources if 'exception' not in x.metadata]                        │
│   290 │                                                                                          │
│   291 │   assert len(sources) > 0 or not fail_if_no_sources, "No sources found"                  │
│ ❱ 292 │   db = create_or_update_db(db_type, persist_directory,                                   │
│   293 │   │   │   │   │   │   │    collection_name, user_path, langchain_type,                   │
│   294 │   │   │   │   │   │   │    sources, use_openai_embedding, add_if_exists, verbose,        │
│   295 │   │   │   │   │   │   │    hf_embedding_model, migrate_embedding_model, auto_migrate_d   │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py:310 in                 │
│ create_or_update_db                                                                              │
│                                                                                                  │
│    307 │   │   if verbose:                                                                       │
│    308 │   │   │   print("Loading and updating db", flush=True)                                  │
│    309 │                                                                                         │
│ ❱  310 │   db = get_db(sources,                                                                  │
│    311 │   │   │   │   use_openai_embedding=use_openai_embedding,                                │
│    312 │   │   │   │   db_type=db_type,                                                          │
│    313 │   │   │   │   persist_directory=persist_directory,                                      │
│                                                                                                  │
│ /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py:138 in get_db          │
│                                                                                                  │
│    135 │   │   │   else:                                                                         │
│    136 │   │   │   │   num_threads = max(1, n_jobs)                                              │
│    137 │   │   │   collection_metadata = {"hnsw:num_threads": num_threads}                       │
│ ❱  138 │   │   │   db = Chroma.from_documents(documents=sources,                                 │
│    139 │   │   │   │   │   │   │   │   │      embedding=embedding,                               │
│    140 │   │   │   │   │   │   │   │   │      persist_directory=persist_directory,               │
│    141 │   │   │   │   │   │   │   │   │      collection_name=collection_name,                   │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │
│ y:603 in from_documents                                                                          │
│                                                                                                  │
│   600 │   │   """                                                                                │
│   601 │   │   texts = [doc.page_content for doc in documents]                                    │
│   602 │   │   metadatas = [doc.metadata for doc in documents]                                    │
│ ❱ 603 │   │   return cls.from_texts(                                                             │
│   604 │   │   │   texts=texts,                                                                   │
│   605 │   │   │   embedding=embedding,                                                           │
│   606 │   │   │   metadatas=metadatas,                                                           │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │
│ y:567 in from_texts                                                                              │
│                                                                                                  │
│   564 │   │   │   collection_metadata=collection_metadata,                                       │
│   565 │   │   │   **kwargs,                                                                      │
│   566 │   │   )                                                                                  │
│ ❱ 567 │   │   chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)             │
│   568 │   │   return chroma_collection                                                           │
│   569 │                                                                                          │
│   570 │   @classmethod                                                                           │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/vectorstores/chroma.p │
│ y:208 in add_texts                                                                               │
│                                                                                                  │
│   205 │   │   │   │   │   [embeddings[idx] for idx in non_empty_ids] if embeddings else None     │
│   206 │   │   │   │   )                                                                          │
│   207 │   │   │   │   ids_with_metadata = [ids[idx] for idx in non_empty_ids]                    │
│ ❱ 208 │   │   │   │   self._collection.upsert(                                                   │
│   209 │   │   │   │   │   metadatas=metadatas,                                                   │
│   210 │   │   │   │   │   embeddings=embeddings_with_metadatas,                                  │
│   211 │   │   │   │   │   documents=texts_with_metadatas,                                        │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/api/models/Collection. │
│ py:298 in upsert                                                                                 │
│                                                                                                  │
│   295 │   │   │   ids, embeddings, metadatas, documents                                          │
│   296 │   │   )                                                                                  │
│   297 │   │                                                                                      │
│ ❱ 298 │   │   self._client._upsert(                                                              │
│   299 │   │   │   collection_id=self.id,                                                         │
│   300 │   │   │   ids=ids,                                                                       │
│   301 │   │   │   embeddings=embeddings,                                                         │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/api/segment.py:290 in  │
│ _upsert                                                                                          │
│                                                                                                  │
│   287 │   │   for r in _records(t.Operation.UPSERT, ids, embeddings, metadatas, documents):      │
│   288 │   │   │   self._validate_embedding_record(coll, r)                                       │
│   289 │   │   │   records_to_submit.append(r)                                                    │
│ ❱ 290 │   │   self._producer.submit_embeddings(coll["topic"], records_to_submit)                 │
│   291 │   │                                                                                      │
│   292 │   │   return True                                                                        │
│   293                                                                                            │
│                                                                                                  │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/chromadb/db/mixins/embeddings_q │
│ ueue.py:127 in submit_embeddings                                                                 │
│                                                                                                  │
│   124 │   │   │   return []                                                                      │
│   125 │   │                                                                                      │
│   126 │   │   if len(embeddings) > self.max_batch_size:                                          │
│ ❱ 127 │   │   │   raise ValueError(                                                              │
│   128 │   │   │   │   f"""                                                                       │
│   129 │   │   │   │   Cannot submit more than {self.max_batch_size:,} embeddings at once.        │
│   130 │   │   │   │   Please submit your embeddings in batches of size                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: 
                Cannot submit more than 83,333 embeddings at once.
                Please submit your embeddings in batches of size
                83,333 or less.

The text was updated successfully, but these errors were encountered:

pseudotensor · 2023-09-16T23:33:14Z

chroma-core/chroma#1049

I wasn't aware of this limitation of chromadb after sqlite was introduced.

pseudotensor · 2023-09-17T07:11:47Z

Should be worked-around now with batching in h2oGPT. Test confirms.

slavag · 2023-09-17T16:04:44Z

@pseudotensor working fine. Thanks.
Btw, seems that new ChromaDB takes twice size as old one.

pseudotensor closed this as completed in 45dc134 Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chroma 0.4 issue with too many embedings #860

chroma 0.4 issue with too many embedings #860

slavag commented Sep 16, 2023

pseudotensor commented Sep 16, 2023

pseudotensor commented Sep 17, 2023

slavag commented Sep 17, 2023

chroma 0.4 issue with too many embedings #860

chroma 0.4 issue with too many embedings #860

Comments

slavag commented Sep 16, 2023

pseudotensor commented Sep 16, 2023

pseudotensor commented Sep 17, 2023

slavag commented Sep 17, 2023