Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: type errors in chromadb/utils/embedding_functions.py #1169

Open
beggers opened this issue Sep 21, 2023 · 5 comments
Open

[Bug]: type errors in chromadb/utils/embedding_functions.py #1169

beggers opened this issue Sep 21, 2023 · 5 comments
Assignees
Labels
bug Something isn't working by-chroma good first issue Good for newcomers

Comments

@beggers
Copy link
Contributor

beggers commented Sep 21, 2023

What happened?

mypy complains about type errors in chromadb/utils/embedding_functions.py, specifically with some __call__ and _normalize methods. See some of the #type: ignores in #1163 for specifics.

It's worth figuring out the correct type annotations for these methods and the methods they call into so that our type checking works as intended.

Versions

Chroma latest, python 3.9.6, macOS

Relevant log output

No response

@beggers beggers added bug Something isn't working good first issue Good for newcomers labels Sep 21, 2023
@Shubhrajit007
Copy link

Sir, I want to go ahead with this issue and want to give it try. I will give my best, please allow mw to do so.

@beggers
Copy link
Contributor Author

beggers commented Sep 21, 2023

Thanks Shubhrajit! Feel free to assign me as a reviewer on the PR whenever it's ready. Thank you.

beggers added a commit that referenced this issue Sep 22, 2023
## Description of changes
This PR accomplishes two things:
- Adds batching to metrics to decrease load to Posthog
- Adds more metric instrumentation

Each `TelemetryEvent` type now has a `batch_size` member defining how
many of that Event to include in a batch. `TelemetryEvent`s with
`batch_size > 1` must also define `can_batch()` and `batch()` methods to
do the actual batching -- our posthog client can't do this itself since
different `TelemetryEvent`s use different count fields. The Posthog
client combines events until they hit their `batch_size` then fires them
off as one event.

NB: this means we can drop up to `batch_size` events -- since we only
batch `add()` calls right now this seems fine, though we may want to
address it in the future.

As for the additional telemetry, I pretty much copied Anton's draft
#859 with some minor changes.

Other considerations: Maybe we should implement `can_batch()` and
`batch()` on all events, even those which don't currently use them? I'd
prefer not to leave dead code hanging around but happy to go either way.

I created a ticket for the type ignores:
#1169

## Test plan
pytest passes modulo a couple unrelated failures

With `print(event.properties)` in posthog client's `_direct_capture()`:
```
>>> import chromadb
>>> client = chromadb.Client()
{'batch_size': 1}
>>> collection = client.create_collection("sample_collection")
{'batch_size': 1, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'embedding_function': 'ONNXMiniLM_L6_V2'}
>>> collection.add(
...     documents=["This is document1", "This is document2"], # we embed for you, or bring your own
...     metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on arbitrary metadata!
...     ids=["doc1", "doc2"], # must be unique for each doc 
... )
{'batch_size': 1, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'add_amount': 2, 'with_documents': 2, 'with_metadata': 2}
>>> for i in range(50):
...   collection.add(documents=[str(i)], ids=[str(i)])
... 
{'batch_size': 20, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'add_amount': 20, 'with_documents': 20, 'with_metadata': 0}
{'batch_size': 20, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'add_amount': 20, 'with_documents': 20, 'with_metadata': 0}
>>> for i in range(50):
...   collection.add(documents=[str(i) + ' ' + str(n) for n in range(20)], ids=[str(i) + ' ' + str(n) for n in range(20)])
... 
{'batch_size': 20, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'add_amount': 210, 'with_documents': 210, 'with_metadata': 0}
{'batch_size': 20, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'add_amount': 400, 'with_documents': 400, 'with_metadata': 0}
{'batch_size': 20, 'collection_uuid': 'bb19d790-4ec7-436c-b781-46dab047625d', 'add_amount': 400, 'with_documents': 400, 'with_metadata': 0}
```

## Documentation Changes
chroma-core/docs#139

chroma-core/docs@a4fd57d
@DevMadhav13
Copy link
Contributor

if this issue still open, i would like to work on it

@beggers
Copy link
Contributor Author

beggers commented Dec 12, 2023

@DevMadhav13 I believe it's still open!

@DevMadhav13
Copy link
Contributor

@beggers PR Raised for this issue. Please review.

beggers added a commit that referenced this issue Jan 11, 2024
## Description of changes

- Added correct type annotations for these methods #1169 

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*

---------

Co-authored-by: Ran <[email protected]>
Co-authored-by: Ben Eggers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working by-chroma good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants