Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: feat: add chroma memory store #449

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
aec227a
feat: add chroma data store
joowon-dm-snu Apr 14, 2023
254a3b7
feat: add chroma memory store
joowon-dm-snu Apr 14, 2023
4ff3614
test: add simple e2e test
joowon-dm-snu Apr 14, 2023
c4afdd6
Merge branch 'main' into feat--add-chroma-memory
dluc Apr 14, 2023
41c6ad6
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 Apr 17, 2023
8f68a39
fix: update pre-commit black version
joowon-dm-snu Apr 21, 2023
a635775
Merge remote-tracking branch 'origin/main' into feat--add-chroma-memory
joowon-dm-snu Apr 22, 2023
de5d7f7
feat: update default behaviors
joowon-dm-snu Apr 22, 2023
0fc7c97
test: add test for chroma
joowon-dm-snu Apr 22, 2023
8b2e231
feat: update chroma
joowon-dm-snu Apr 22, 2023
0a483f6
Merge branch 'main' into feat--add-chroma-memory
joowon-dm-snu Apr 22, 2023
971c8a4
docs: update doc
joowon-dm-snu Apr 22, 2023
04abc91
Merge branch 'feat--add-chroma-memory' of https://github.com/joowon-d…
joowon-dm-snu Apr 22, 2023
d00df0d
test: add e2e test for chroma db
joowon-dm-snu Apr 22, 2023
e2d87eb
doc: update chroma memory store doc
joowon-dm-snu Apr 22, 2023
829bd8b
run pre-commit
joowon-dm-snu Apr 22, 2023
fb5135d
doc: update
joowon-dm-snu Apr 22, 2023
3aa3c57
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 Apr 24, 2023
cdf83e3
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 Apr 26, 2023
6785b20
add dependencies
joowon-dm-snu Apr 26, 2023
0b7f18a
Merge branch 'main' into feat--add-chroma-memory
joowon-dm-snu Apr 26, 2023
7d16b99
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 Apr 26, 2023
19e2a7e
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 Apr 26, 2023
096decc
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 Apr 26, 2023
8ae56ed
Merge remote-tracking branch 'origin/main' into feat--add-chroma-memory
joowon-dm-snu May 4, 2023
113bc41
Merge branch 'feat--add-chroma-memory' of https://github.com/joowon-d…
joowon-dm-snu May 11, 2023
94e07b5
Merge branch 'main' into feat--add-chroma-memory
awharrison-28 May 11, 2023
7346fb0
added poetry group for chromadb
awharrison-28 May 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
run: |
python -m pip install --upgrade pip
python -m pip install poetry pytest
cd python && poetry install --with hugging_face
cd python && poetry install --with hugging_face --with chromadb
- name: Run Integration Tests
shell: bash
env: # Set Azure credentials secret as an input
Expand Down
832 changes: 831 additions & 1 deletion python/poetry.lock

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@ torch = "^2.0.0"
transformers = "^4.28.1"
sentence-transformers = "^2.2.2"


[tool.poetry.group.chromadb.dependencies]
chromadb = "^0.3.22"

[tool.isort]
profile = "black"

Expand Down
2 changes: 1 addition & 1 deletion python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
openai==0.27.0
numpy==1.24.2
aiofiles==23.1.0
aiofiles==23.1.0
3 changes: 2 additions & 1 deletion python/semantic_kernel/memory/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (c) Microsoft. All rights reserved.
from semantic_kernel.memory.chroma_memory_store import ChromaMemoryStore
from semantic_kernel.memory.volatile_memory_store import VolatileMemoryStore

__all__ = ["VolatileMemoryStore"]
__all__ = ["VolatileMemoryStore", "ChromaMemoryStore"]
203 changes: 203 additions & 0 deletions python/semantic_kernel/memory/chroma_memory_store.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# Copyright (c) Microsoft. All rights reserved.

"""
ChromaMemoryStore provides functionality to find the nearest matches based on embedding similarity.
By inheriting ChromaDataStore, ChromaMemoryStore is storing and retrieving data for SemanticTextMemory.
For information about the connection to ChromaDB and persistency settings, please check the ChromaDataStore class.

The similarity_compute_func parameter can affect the behavior of the get_nearest_matches_async method.
similarity_compute_func can be one of the following:

1) "sk-default" (default): Use Semantic Kernel's default compute similarity function. It computes cosine similarity
between the query embedding and the embeddings in the collection.

2) "chroma": Use ChromaDB's default distance as the similarity score. In this case, lower values are considered better
matches. Note that min_relevance_score should be adjusted accordingly for this case.

3) Custom function: Provide a custom function that computes similarity scores between the query embedding and the
embeddings in the collection. The custom function should have the signature Callable[[ndarray, ndarray], ndarray]
and return a numpy array of similarity scores.

Example:

# Create a ChromaMemoryStore with the default similarity computation function
chroma_memory_store = ChromaMemoryStore()

# Or use ChromaDB's default distance as the similarity score
chroma_memory_store_chroma = ChromaMemoryStore(similarity_compute_func="chroma")

# Or provide a custom similarity computation function
def custom_similarity(embedding: ndarray, embedding_array: ndarray) -> ndarray:
# custom implementation
pass

chroma_memory_store_custom = ChromaMemoryStore(similarity_compute_func=custom_similarity)
"""

import inspect
from logging import Logger
from typing import TYPE_CHECKING, Callable, List, Optional, Tuple, Union

from numpy import array, linalg, ndarray
from semantic_kernel.memory.memory_record import MemoryRecord
from semantic_kernel.memory.memory_store_base import MemoryStoreBase
from semantic_kernel.memory.storage.chroma_data_store import ChromaDataStore
from semantic_kernel.utils.null_logger import NullLogger

if TYPE_CHECKING:
import chromadb.config

AvailableComputeSimilarityFunction = Union[str, Callable[[ndarray, ndarray], ndarray]]
DEFAULT_COMPUTE_SIMILARITY_FUNCTIONS = ["sk-default", "chroma"]


def validate_similarity_function(func) -> bool:
if func in DEFAULT_COMPUTE_SIMILARITY_FUNCTIONS:
return True
else:
# validate typing for custom compute_similarity function
# Callable[[ndarray, ndarray], ndarray]
param_a = inspect.Parameter(
"a", inspect.Parameter.POSITIONAL_OR_KEYWORD, annotation=ndarray
)
param_b = inspect.Parameter(
"b", inspect.Parameter.POSITIONAL_OR_KEYWORD, annotation=ndarray
)
expected_signature = inspect.Signature(
[param_a, param_b], return_annotation=ndarray
)
function_signature = inspect.signature(func)

return function_signature == expected_signature


class ChromaMemoryStore(ChromaDataStore, MemoryStoreBase):
def __init__(
self,
logger: Optional[Logger] = None,
similarity_fetch_limit: int = 5,
similarity_compute_func: AvailableComputeSimilarityFunction = "sk-default",
persist_directory: Optional[str] = None,
client_settings: Optional["chromadb.config.Settings"] = None,
) -> None:
assert validate_similarity_function(similarity_compute_func)
self._similarity_compute_func = similarity_compute_func
self._similarity_fetch_limit = similarity_fetch_limit

super().__init__(
persist_directory=persist_directory, client_settings=client_settings
)
self._logger = logger or NullLogger()

async def get_nearest_matches_async(
self,
collection: str,
embedding: ndarray,
limit: int = 1,
min_relevance_score: float = 0.7,
) -> List[Tuple[MemoryRecord, float]]:
collection = await self.get_collection_async(collection)
if collection is None:
return []

query_results = collection.query(
query_embeddings=embedding.tolist(),
n_results=self._similarity_fetch_limit,
include=["embeddings", "metadatas", "documents", "distances"],
)

# Convert the collection of embeddings into a numpy array (stacked)
embedding_array = array(query_results["embeddings"][0])
embedding_array = embedding_array.reshape(embedding_array.shape[0], -1)

# If the query embedding has shape (1, embedding_size),
# reshape it to (embedding_size,)
if len(embedding.shape) == 2:
embedding = embedding.reshape(
embedding.shape[1],
)

# Compute similarity scores
if self._similarity_compute_func == "sk-default":
# Case 1) use semantic kernel's default compute similarity function
similarity_score = self.compute_similarity_scores(
embedding, embedding_array
)
elif self._similarity_compute_func == "chroma":
# Case 2) use chroma's default distance
similarity_score = query_results["distances"][0]
else:
# Case 3) use custom similarity function
similarity_score = self._similarity_compute_func(embedding, embedding_array)

# Convert query results into memory records
record_list = [
(record, distance)
for record, distance in zip(
self.query_results_to_memory_records(query_results),
similarity_score,
)
]

if self._similarity_compute_func == "chroma":
# default chroma uses distance as similarity score (lower is better)
filtered_results = [x for x in record_list if x[1] <= min_relevance_score]
top_results = filtered_results[:limit]
else:
sorted_results = sorted(
record_list,
key=lambda x: x[1],
reverse=True,
)

filtered_results = [
x for x in sorted_results if x[1] >= min_relevance_score
]
top_results = filtered_results[:limit]

return top_results

def compute_similarity_scores(
self, embedding: ndarray, embedding_array: ndarray
) -> ndarray:
"""
Semantic kernel's default compute similarity function.
(Code from VolatileMemoryStore)

Compute the similarity scores between the
query embedding and all the embeddings in the collection.
Ignore the corresponding operation if zero vectors
are involved (in query embedding or the embedding collection)

:param embedding: The query embedding.
:param embedding_array: The collection of embeddings.
:return: similarity_scores: The similarity scores between the query embedding
and all the embeddings in the collection.
"""

query_norm = linalg.norm(embedding)
collection_norm = linalg.norm(embedding_array, axis=1)

# Compute indices for which the similarity scores can be computed
valid_indices = (query_norm != 0) & (collection_norm != 0)

# Initialize the similarity scores with -1 to distinguish the cases
# between zero similarity from orthogonal vectors and invalid similarity
similarity_scores = array([-1.0] * embedding_array.shape[0])

if valid_indices.any():
similarity_scores[valid_indices] = embedding.dot(
embedding_array[valid_indices].T
) / (query_norm * collection_norm[valid_indices])
if not valid_indices.all():
self._logger.warning(
"Some vectors in the embedding collection are zero vectors."
"Ignoring cosine similarity score computation for those vectors."
)
else:
raise ValueError(
f"Invalid vectors, cannot compute cosine similarity scores"
f"for zero vectors"
f"{embedding_array} or {embedding}"
)
return similarity_scores
Loading