feat: Updated FAQ section with a few more commonly encountered issues

- Added ULIDs to document ids - Fixed some code inconsistencies in document ids docs - Improved the normalization in bring-your-own-embeddings.md
amikos-tech · May 18, 2024 · 62f563c · 62f563c
1 parent d7cc059
commit 62f563c
Show file tree

Hide file tree

Showing 3 changed files with 124 additions and 20 deletions.
diff --git a/docs/core/document-ids.md b/docs/core/document-ids.md
@@ -1,16 +1,20 @@
 # Document IDs
 
-Chroma is unopinionated about document IDs and delegates those decisions to the user. This frees users to build semantics around their IDs.
+Chroma is unopinionated about document IDs and delegates those decisions to the user. This frees users to build
+semantics around their IDs.
 
 ## Note on Compound IDs
 
-While you can choose to use IDs that are composed of multiple sub-IDs (e.g. `user_id` + `document_id`), it is important to highlight that Chroma does not support querying by partial ID.
+While you can choose to use IDs that are composed of multiple sub-IDs (e.g. `user_id` + `document_id`), it is important
+to highlight that Chroma does not support querying by partial ID.
 
 ## Common Practices
 
 ### UUIDs
 
-UUIDs are a common choice for document IDs. They are unique, and can be generated in a distributed fashion. They are also opaque, which means that they do not contain any information about the document itself. This can be a good thing, as it allows you to change the document without changing the ID.
+UUIDs are a common choice for document IDs. They are unique, and can be generated in a distributed fashion. They are
+also opaque, which means that they do not contain any information about the document itself. This can be a good thing,
+as it allows you to change the document without changing the ID.
 
 ```python
 import uuid
@@ -22,13 +26,13 @@ my_documents = [
 ]
 
 client = chromadb.Client()
-
-collection.add(ids=[uuid.uuid4() for _ in range(len(documents))], documents=my_documents)
+collection = client.get_or_create_collection("collection")
+collection.add(ids=[f"{uuid.uuid4()}" for _ in range(len(my_documents))], documents=my_documents)
 ```
 
 #### Caveats
 
-!!! tip "Predictable Ordering" 
+!!! tip "Predictable Ordering"
 
     UUIDs especially v4 are not lexicographically sortable. In its current version (0.4.x-0.5.0) Chroma orders responses 
     of `get()` by the ID of the documents. Therefore, if you need predictable ordering, you may want to consider a different ID strategy.
@@ -38,16 +42,54 @@ collection.add(ids=[uuid.uuid4() for _ in range(len(documents))], documents=my_d
     UUIDs are 128 bits long, which can be a lot of overhead if you have a large number of documents. If you are concerned 
     about storage overhead, you may want to consider a different ID strategy.
 
+### ULIDs
+
+ULIDs are a variant of UUIDs that are lexicographically sortable. They are also 128 bits long, like UUIDs, but they are
+encoded in a way that makes them sortable. This can be useful if you need predictable ordering of your documents.
+
+ULIDs are also shorter than UUIDs, which can save you some storage space. They are also opaque, like UUIDs, which means
+that they do not contain any information about the document itself.
+
+Install the `ulid-py` package to generate ULIDs.
+
+```bash
+pip install py-ulid
+```
+
+```python
+from ulid import ULID
+import chromadb
+
+my_documents = [
+    "Hello, world!",
+    "Hello, Chroma!"
+]
+_ulid = ULID()
+
+client = chromadb.Client()
+
+collection = client.get_or_create_collection("name")
+
+collection.add(ids=[f"{_ulid.generate()}" for _ in range(len(my_documents))], documents=my_documents)
+```
+
+### NanoIDs
+
+Coming soon.
+
 ### Hashes
 
-Hashes are another common choice for document IDs. They are unique, and can be generated in a distributed fashion. They are also opaque, which means that they do not contain any information about the document itself. This can be a good thing, as it allows you to change the document without changing the ID.
+Hashes are another common choice for document IDs. They are unique, and can be generated in a distributed fashion. They
+are also opaque, which means that they do not contain any information about the document itself. This can be a good
+thing, as it allows you to change the document without changing the ID.
 
 ```python
 import hashlib
 import os
 import chromadb
 
-def generate_sha256_hash():
+
+def generate_sha256_hash() -> str:
     # Generate a random number
     random_data = os.urandom(16)
     # Create a SHA256 hash object
@@ -64,31 +106,36 @@ my_documents = [
 ]
 
 client = chromadb.Client()
-
-collection.add(ids=[generate_sha256_hash() for _ in range(len(documents))], documents=my_documents)
+collection = client.get_or_create_collection("collection")
+collection.add(ids=[generate_sha256_hash() for _ in range(len(my_documents))], documents=my_documents)
 ```
 
-It is also possible to use the document as basis for the hash, the downside of that is that when the document changes and you have a semantic around the text as relating to the hash, you may need to update the hash.
+It is also possible to use the document as basis for the hash, the downside of that is that when the document changes
+and you have a semantic around the text as relating to the hash, you may need to update the hash.
 
 ```python
 import hashlib
 import chromadb
 
-def generate_sha256_hash_from_text(text):
+
+def generate_sha256_hash_from_text(text) -> str:
     # Create a SHA256 hash object
     sha256_hash = hashlib.sha256()
     # Update the hash object with the text encoded to bytes
     sha256_hash.update(text.encode('utf-8'))
     # Return the hexadecimal representation of the hash
     return sha256_hash.hexdigest()
+
+
 my_documents = [
     "Hello, world!",
     "Hello, Chroma!"
 ]
 
 client = chromadb.Client()
-
-collection.add(ids=[generate_sha256_hash_from_text(documents[i]) for i in range(len(documents))], documents=my_documents)
+collection = client.get_or_create_collection("collection")
+collection.add(ids=[generate_sha256_hash_from_text(my_documents[i]) for i in range(len(my_documents))],
+               documents=my_documents)
 ```
 
 ## Semantic Strategies

diff --git a/docs/embeddings/bring-your-own-embeddings.md b/docs/embeddings/bring-your-own-embeddings.md
@@ -66,10 +66,12 @@ class TransformerEmbeddingFunction(EmbeddingFunction[Documents]):
             )
 
     @staticmethod
-    def _normalize(v: npt.NDArray) -> npt.NDArray:
-        norm = np.linalg.norm(v, axis=1)
-        norm[norm == 0] = 1e-12
-        return cast(npt.NDArray, v / norm[:, np.newaxis])
+    def _normalize(vector: npt.NDArray) -> npt.NDArray:
+        """Normalizes a vector to unit length using L2 norm."""
+        norm = np.linalg.norm(vector)
+        if norm == 0:
+            return vector
+        return vector / norm
 
     def __call__(self, input: Documents) -> Embeddings:
         inputs = self._tokenizer(

diff --git a/docs/faq/index.md b/docs/faq/index.md
@@ -54,14 +54,13 @@ ef = SentenceTransformerEmbeddingFunction(model_name="FacebookAI/xlm-roberta-lar
 print(ef(["test"]))
 ```
 
-!!! warn "Warning" 
+!!! warn "Warning"
 
     Not all models will work with the above method. Also mean pooling may not be the best strategy for the model. 
     Read the model card and try to understand what if any pooling the creators recommend. You may also want to normalize
     the embeddings before adding them to Chroma (pass `normalize_embeddings=True` to the `SentenceTransformerEmbeddingFunction` 
     EF constructor).
 
-
 ## Commonly Encountered Problems
 
 ### Collection Dimensionality Mismatch
@@ -97,4 +96,60 @@ use the same EmbeddingFunction when adding or querying a collection.
     If you do not specify an `embedding_function` when creating (`client.create_collection`) or getting
     (`client.get_or_create_collection`) a collection, Chroma wil use its default [embedding function](https://docs.trychroma.com/embeddings#default-all-minilm-l6-v2).
 
+### Large Distances in Search Results
+
+**Symptoms:**
+
+When querying a collection, you get results that are in the 10s or 100s.
+
+**Context:**
+
+Frequently when using you own embedding function.
+
+**Cause:**
+
+The embeddings are not normalized.
+
+**Explanation/Solution:**
+
+`L2` (Euclidean distance) and `IP` (inner product) distance metrics are sensitive to the magnitude of the vectors.
+Chroma uses `L2` by
+default. Therefore, it is recommended to normalize the embeddings before adding them to Chroma.
+
+Here is an example how to normalize embeddings using L2 norm:
+
+```python
+import numpy as np
+
+
+def normalize_L2(vector):
+    """Normalizes a vector to unit length using L2 norm."""
+    norm = np.linalg.norm(vector)
+    if norm == 0:
+        return vector
+    return vector / norm
+```
+
+### `OperationalError: no such column: collections.topic`
+
+**Symptoms:**
+
+The error `OperationalError: no such column: collections.topic` is raised when trying to access Chroma locally or
+remotely.
+
+**Context:**
+
+After upgrading to Chroma `0.5.0` or accessing your Chroma persistent data with Chroma client version `0.5.0`.
+
+**Cause:**
+
+In version `0.5.x` Chroma has made some SQLite3 schema changes that are not backwards compatible with the previous
+versions. Once you access your persistent data on the server or locally with the new Chroma version it will
+automatically migrate to the new schema. This operation is not reversible.
+
+**Explanation/Solution:**
+
+To resolve this issue you will need to upgrade all your clients accessing the Chroma data to version `0.5.x`.
 
+Here's a link to the migration performed by
+Chroma - https://github.com/chroma-core/chroma/blob/main/chromadb/migrations/sysdb/00005-remove-topic.sqlite.sql