Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update repo #10

Merged
merged 30 commits into from
Aug 29, 2023
Merged

update repo #10

merged 30 commits into from
Aug 29, 2023

Conversation

sunilkumardash9
Copy link
Owner

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

How are these changes tested?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

atroyn and others added 30 commits August 3, 2023 22:36
## Description of changes

This PR creates a new starter notebook intended to familiarize people
with the very basic, core functionality of embedding retrieval with
Chroma. It's self-contained, and hopefully straightforward and easy to
understand.

There is also a minor fix to the experimental notebook. 

## Test plan

Ran the notebook, also via Colab. 

## Documentation Changes
None. 

## TODO
- [x] #880 Canonical 'chat
with your documents'
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - Allow fast api to consume none for add/update
 - New functionality
	 - ...

## Test plan
Manually tested by passing none to the API. Added a unit test. 

## Documentation Changes
None required.
## Description of changes

Release 0.4.5
## Description of changes

#895

Serve registered API routes regardless of whether they end in "/".

In the linked bug @HammadB and I discussed looping over all the routes
at the end of `__init__`, but the typechecker complains: the only thing
we know about `self._app.routes` is that they're instances of
`BaseRoute`, which may not have the fields we need. Rather than do type
coercion I subclassed `fastapi.APIRouter`.

The logic to set `kwargs["include_in_schema"]` is maybe too clever,
should we just do an if-else instead?

## Test plan

```
0 ~ beggers % curl http://localhost:8000/api/v1/collections/ -v
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /api/v1/collections/ HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.1.2
> Accept: */*
> 
< HTTP/1.1 200 OK
< date: Thu, 03 Aug 2023 19:14:01 GMT
< server: uvicorn
< content-length: 2
< content-type: application/json
< 
* Connection #0 to host localhost left intact
[]%                                                                                   
0 ~ beggers % curl http://localhost:8000/api/v1/collections -v 
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /api/v1/collections HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.1.2
> Accept: */*
> 
< HTTP/1.1 200 OK
< date: Thu, 03 Aug 2023 19:14:02 GMT
< server: uvicorn
< content-length: 2
< content-type: application/json
< 
* Connection #0 to host localhost left intact
[]%
```

Also, `pytest` locally and CI. I'd like to confirm this doesn't break JS
client generation and doc generation -- how can I do that?

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*

No documentation changes needed for this.
This cleans up the JS API
- remove `increment_index`
- removes `createIndex`
- removes `rawSql`

It also bumps the version to 1.5.6 (not yet released) and fixes a type
that changed.
## Description of changes

After the recent changes in the Web AI structure, the embedder function
became invalid because of different imports and because the browser
version of the library is only ESM, as mentioned
[here](#824 (comment)).

To fix the issue, I updated the import paths and changed the `require`
statements to `import` for the browser version of the function.

**Important note:**

When using the embedded in the browser, you should import
`chromadb/dist/module` because the [browser
version](https://www.npmjs.com/package/@visheratin/web-ai) of Web AI is
ESM-only.
For Node.js, you can use regular `chromadb` import because [Node.js
version](https://www.npmjs.com/package/@visheratin/web-ai-node) of the
library is CJS.

Additionally, because Web AI since recently supports multimodal
embeddings, I added building multimodal embeddings using CLIP-base.

## Test plan

I tested the changes locally for both browser and Node.js. The built
`chromadb` package if someone wants to test it in their projects, is
available
[here](https://drive.google.com/file/d/1cNLsHGd1VmiFiamvsEaMVGA7ng56QQKG/view?usp=sharing).

## Documentation Changes

There likely will be needed changes regarding the multimodal
functionality.
- The project will now clone the chroma repo and deploy docker-compose
from it
- Introduced a new TF var chroma_release to be able to deploy a any (oh
well almost any, chroma version, defaults to 0.4.5)
- Added output instance_public_ip var to print out the IP address of the
VM

Refs: #950

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Made it possible to deploy any version of chroma through TF var
chroma_release (defaults to 0.4.5)
- Made it so that the VM would clone the repo and checkout the release
tag then run docker compose from it
	 - Added output of the public IP of the VM after TF deployment
	 - Improved slightly the README.md

## Test plan
*How are these changes tested?*

```bash
gcloud auth application-default login
cd examples/deployments/google-cloud-compute
terraform init
export TF_VAR_project_id=<your_project_id> #take note of this as it must be present in all of the subsequent steps
export TF_VAR_chroma_release=0.4.5 #set the chroma release to deploy
terraform apply -auto-approve
terraform output instance_public_ip
export instance_public_ip=$(terraform output instance_public_ip)
curl -v http://$instance_public_ip:8000/api/v1
terraform destroy -auto-approve
```

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*
## Description of changes

*Summarize the changes made by this PR.*
Closes Issue: #923 

Adds the following types available as a top-level import in addition to
the `Collection` class:
- `CollectionMetadata`,
- `Documents`,
- `EmbeddingFunction`,
- `Embeddings`,
- `IDs`,
- `Include`,
- `Metadata`,
- `Where`,
- `QueryResult`,
- `GetResult`,
- `WhereDocument`,
- `UpdateCollectionMetadata`,

## Test plan
*How are these changes tested?*
- No additional tests were added

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*
- It might make sense to make clear these cans can be used from the top
level.
## Description of changes
Previously we were not using the FTS search index correctly.
https://sqlite.org/fts5.html#full_text_query_syntax Expects that you
query using the table name of the FTS table, not using the column name.
If you want to query by column name, you have to use column filters as
discussed in the link above. We opt to take the path suggested here
https://sqlite.org/forum/forumpost/1d45a7f6e17a3460 and match on id in
addition to filtering that specific column. The query planner leverages
this appropriately as confirmed in EXPLAIN.

Since we were doing speculative delete queries, assuming the index was
leveraged, this was incredibly slow. However now it is much faster.

Explain Before
```-- SCAN VIRTUAL TABLE INDEX 0:``` -> Full table scan.

Explain After 
``` -- SCAN VIRTUAL TABLE INDEX 0:M2 ``` -> Scans the index itself

The net effect of this is a large increase in write speed and also now
the write path time does not grow with table size.

### Quick Benchmark Results
N = 100k uniformly random vectors
D = 128
Metadata = one small key: value pair
Document = randomly generated string of length 100

Added with batch size = 1000

**Without Fix, Overall Time = 469s. Time to add a batch grows linearly
to >8000 ms**
<img width="590" alt="Screenshot 2023-08-09 at 5 53 24 PM"
src="https://github.com/chroma-core/chroma/assets/5598697/89dde745-9231-4f3f-b62c-bf8486f7e970">

**With Fix, Overall Time = 102s. Time to add a batch grows sublinearly
to ~1200 ms**
<img width="587" alt="Screenshot 2023-08-09 at 5 43 12 PM"
src="https://github.com/chroma-core/chroma/assets/5598697/2a771788-e5d9-4afe-bacb-dfbfb51b6cd1">

We will also want to make sure that the read path leverages this way of
querying. Will address that in a follow up PR.

## Test plan
Existing tests cover the scope of this change.

## Documentation Changes
None required.
## Description of changes
Adds a batch submit_embeddings() to the producer API. Implements this
behavior in sqlite based embeddings queue. Refactors tests to add a
fixture for both single and batch submit and substitutes all use of
produce in tests with the fixture so both methods are uniformly tested.

This PR is stacked on #958, please read that before reading this.

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- This PR makes the sqlite based embeddings queue batch calls to sqlite
based on the batch size of calls to DML operations.
	 
### Quick Benchmarks
N = 100k
D = 128
metadata = {simple key: simple value}
Document = random 100 character string

**Before - Overall time: 102s**
<img width="587" alt="Screenshot 2023-08-09 at 5 43 12 PM"
src="https://github.com/chroma-core/chroma/assets/5598697/752809a2-78a0-4d5b-a238-fc41ca2635cc">

**After - Overall time: 36s**
<img width="583" alt="Screenshot 2023-08-09 at 7 37 41 PM"
src="https://github.com/chroma-core/chroma/assets/5598697/eb1759c8-8f62-4ed7-ad88-450b3f72692b">

**_Overall, when compared to main this and #958 decrease the time of
this benchmark from 469s -> 36s._**

	
Todo:
- [x] Clean up code
- [x] Rethink assumptions about submit_embeddings ordering guarantees
- [x] Add tests
- [x] Add batching to other DML - upsert(), update(), delete()

## Test plan
Existing tests cover a bulk of the functionality.
New tests TODO:

- [x] Producer/Consumer tests for submit_embeddings
- [x] Segment tests with batch embeddings

## Documentation Changes
None required. But a misc feature idea is hyperparameter sweep for the
optimal batch_size for your platform
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Added enhanced examples of how to use `where` filtering with logical
operators based on community questions

## Test plan
Run the jupyter notebook
`examples/basic_functionality/where_filtering.ipynb`

## Documentation Changes
No document updates.
## Description of changes
Closes #927 
*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 -Fix #927 by adding hnsw params to test_vector segment tests
 - New functionality
	 - N/A

## Test plan
These are tests 

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- chroma-core/docs#98 Landed documentation fixes
to address this but modified documentation generated from the code. This
PR amends the code to be correct so that subsequent generations are
correct.
 - New functionality
	 - N/A

## Test plan
Not needed.

## Documentation Changes
In v0.4.6 we should regen API docs and release.
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Improves the performance of the duplicate ID validator. I noticed this
while trying to add > 100k records to a collection. I had duplicates in
my ID array (which I didn't know at the time), and I eventually
terminated the operation because it was taking so long. I happened to
notice in the traceback that it was stuck in the `validate_ids`
function, which looks like it might be O(N^2).
- This PR currently changes just the implementation. That said, it would
be easy to display only the e.g. top 10 most common dupes (so as to
avoid printing over 1k IDs to console output, as happened in my case
:upside_down_face: ). Let me know if you would prefer that.

## Test plan
*How are these changes tested?*
1. I tested the implementation within a Jupyter notebook on the same
list of IDs that I had passed to `collection.add`, and it isolated the
dupes in about half a second.

![image](https://github.com/chroma-core/chroma/assets/85196623/d6c3ac89-3c8b-44cd-b01c-59bf6cd01a66)

2. I pip installed this branch, tried to add the same items to the
collection, and confirmed that I received the error message without a
long delay.

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?* No docstrings or
documentation changes are needed.
## Description of changes
fix proposition for #917

---------

Co-authored-by: Hammad Bashir <[email protected]>
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Similar to #958 this makes the read path of metadata segment properly
use the index, leading to a >10x speedup in query performance while
using get(). Since we want $contains to support substring matches, it
must use the trigram tokenizer. LIKE based substring matching is only
supported by the trigram tokenizer. Before this was doing a full table
scan. See https://sqlite.org/fts5.html#the_trigram_tokenizer
	 
- Adds a migration to use trigram tokenizer
- Add validation to disallow empty where document.
- Change the FTS index to rely on rowid for uniqueness, instead of
deleting speculatively on write path, rely on integrity checks.
- Remove LIKE escaping since its not supported by FTS and degrades to
full table scans.
- Update tests to treat _ and % as wildcards.

## Test plan
Existing tests

## Documentation Changes
Clarify that where filtering will ignore _ and %.
## Description of changes
The current function is looking for the tar.gz file instead of checking
if the folder already exists, so if the tar.gz gets deleted after
extraction, it downloads it again.. This PR resolves this and checks for
the model in the extracted folder before attempting to download or
extract again.

## Test plan
By using it

## Documentation Changes
I didn't find any documentation about how this does the download.
Release 0.4.6
## Description of changes
Currently not all bundlers will chose the correct CJS/ESM build when
importing the js client because we don't define them using
[conditional-exports](https://nodejs.org/api/packages.html#conditional-exports).
This change adds the required configuration in the package.json to fix
that.

 - Improvements & Bug fixes
	 - support conditional exports in js client


## Test plan
I have tested this manually. We have a Next.js setup which throws an
error with v1.5.6, we make use of that to detect which build (cjs/esm)
nextjs tries to load. In both cases we expect an error with the path
from which nextjs loads the module.


### Reproduce the issue
- checkout
jeffchuber/nextjs-chroma@2df1a21
- run yarn && yarn dev
- open http://localhost:3000/
- you should see the following error in the browser
```
Failed to compile

./node_modules/chromadb/dist/main/embeddings/WebAIEmbeddingFunction.js:157:0
Module not found: Can't resolve '@visheratin/web-ai'

https://nextjs.org/docs/messages/module-not-found

Import trace for requested module:
./node_modules/chromadb/dist/main/index.js
./src/app/page.js

This error occurred during the build process and can only be dismissed by fixing the error.
```
Nextjs loads the cjs build from `dist/main`.

### Confirm the change resolves the issue
- checkout
jeffchuber/nextjs-chroma@2df1a21
- remove current chromadb package (1.5.6) using `yarn remove chromadb`
- install chromadb client with the [change in this
pr](450159e)
using package linking or using verdaccio
- open http://localhost:3000/
- you should see the following error in the browser
```
Failed to compile

./node_modules/chromadb/dist/module/embeddings/WebAIEmbeddingFunction.js:131:0
Module not found: Can't resolve '@visheratin/web-ai'

https://nextjs.org/docs/messages/module-not-found

Import trace for requested module:
./node_modules/chromadb/dist/module/index.js
./src/app/page.js

This error occurred during the build process and can only be dismissed by fixing the error.
```
Nextjs now loads the esm build from `dist/module`

## Documentation Changes
I added a reference to
https://nodejs.org/api/packages.html#conditional-exports in the commit
message
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- In v0.4.6 we introduced batching at the sqlite layer for the
embeddings_queue to maximize throughput through submit_embeddings calls.
SQlite has a max size of variables it can bind at once that is defined
at compile time. This changes makes the embeddings_queue introspect its
compile time flags to inform its max_batch_size. Max_batch_size is a new
property method added to the Producer class that indicates the maximum
batch size submit_embeddings() will accept. Right now we use the error
internal to submit_embeddings in the sqlite impl, but callers could use
this to pre-validate input and throw a message as well.
 - New functionality
	 - None

## Test plan
Added a test for the below and above batch size cases.

## Documentation Changes
We should update the usage guide to cover this caveat.
## Description of changes

Addresses #948, #583, #970

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Prevent unrestricted deletes and instead point users to more explicit
alternatives.
	 - Make embeddings_queue properly log the exception in mock_async mode
	 - In tests, reraise from embeddings_queue
- Fix a related bug where delete() on an empty segment throws a
misleading error by checking if index in segment is initialized (#970)
 - New functionality
	 - None

## Test plan
Added unit tests validating that error cases are properly error'ing.
Existing tests should make sure delete still works.

## Documentation Changes
None required.
…template (#987)

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Add contributing guide to readme and a check if tests pass to PR
template
 - New functionality
	 - None
	 

Addresses #968

## Test plan
These are documentation changes. No tests required.

## Documentation Changes
These are documentation changes.
## Description of changes

*Summarize the changes made by this PR.*
 - New functionality
	 - Auth Provide Client and Server Side Abstractions
	 - Basic Auth Provider

## Test plan
Unit tests for authorized endpoints

## Documentation Changes
Docs should change to describe how to use auth providers on the client
and server. CIP added in `docs/`
- Updated logger level to avoid confusion

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Changed logger level message when registering new Auth providers to
debug to avoid confusing developers.
	 - Change the status of the CIP to `Accepted`
- Added `/docs` and `/openapi.json` to the publicly accessible endpoints
(this may or many not be a security concern - TBD)

## Test plan

Local tests are passing—no new tests.

## Documentation Changes
N/A
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- The docker-compose will now pick up ENV vars from supplied
`--env-file`

## Test plan

Manual tests done for both with and without env file scenarios

## Documentation Changes
Docs updated accordingly
Release 0.4.7
Note: Cross-version testing was failing as it was loading config twice
which caused the issue with pydantic.

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Fixing `chromadb/test/property/test_cross_version_persist.py` test
failures.

## Test plan
*How are these changes tested?*

- [x] Tests pass locally with `pytest` for python

## Documentation Changes
No docs need updating
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- #799 introduced a max version for pydantic but this change was not
propagated to the thin client. This makes the same change in the thin
client.
 - New functionality
	 - ...

## Test plan
Existing tests for client

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

## Documentation Changes
None
Refs: #1027

## Description of changes

*Summarize the changes made by this PR.*
 - New functionality
        - Baseline functionality and tests implemented
        - Example notebook updated
- Minor refactor on the client creds provider to allow for user specific
credentials fetching.

## Test plan
*How are these changes tested?*

- [x] Tests pass locally with `pytest` for python (regression)
- [x] New fixtures added for token-based auth

## Documentation Changes
Docs should be updated to highlight the new supported auth method.
Release 0.4.8
@sunilkumardash9 sunilkumardash9 merged commit c7ab07f into sunilkumardash9:main Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.