update repo #10

sunilkumardash9 · 2023-08-29T13:24:05Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- ...
New functionality
- ...

Test plan

How are these changes tested?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

## Description of changes This PR creates a new starter notebook intended to familiarize people with the very basic, core functionality of embedding retrieval with Chroma. It's self-contained, and hopefully straightforward and easy to understand. There is also a minor fix to the experimental notebook. ## Test plan Ran the notebook, also via Colab. ## Documentation Changes None. ## TODO - [x] #880 Canonical 'chat with your documents'

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Allow fast api to consume none for add/update - New functionality - ... ## Test plan Manually tested by passing none to the API. Added a unit test. ## Documentation Changes None required.

## Description of changes Release 0.4.5

@HammadB

## Description of changes #895 Serve registered API routes regardless of whether they end in "/". In the linked bug @HammadB and I discussed looping over all the routes at the end of `__init__`, but the typechecker complains: the only thing we know about `self._app.routes` is that they're instances of `BaseRoute`, which may not have the fields we need. Rather than do type coercion I subclassed `fastapi.APIRouter`. The logic to set `kwargs["include_in_schema"]` is maybe too clever, should we just do an if-else instead? ## Test plan ``` 0 ~ beggers % curl http://localhost:8000/api/v1/collections/ -v * Trying 127.0.0.1:8000... * Connected to localhost (127.0.0.1) port 8000 (#0) > GET /api/v1/collections/ HTTP/1.1 > Host: localhost:8000 > User-Agent: curl/8.1.2 > Accept: */* > < HTTP/1.1 200 OK < date: Thu, 03 Aug 2023 19:14:01 GMT < server: uvicorn < content-length: 2 < content-type: application/json < * Connection #0 to host localhost left intact []% 0 ~ beggers % curl http://localhost:8000/api/v1/collections -v * Trying 127.0.0.1:8000... * Connected to localhost (127.0.0.1) port 8000 (#0) > GET /api/v1/collections HTTP/1.1 > Host: localhost:8000 > User-Agent: curl/8.1.2 > Accept: */* > < HTTP/1.1 200 OK < date: Thu, 03 Aug 2023 19:14:02 GMT < server: uvicorn < content-length: 2 < content-type: application/json < * Connection #0 to host localhost left intact []% ``` Also, `pytest` locally and CI. I'd like to confirm this doesn't break JS client generation and doc generation -- how can I do that? ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* No documentation changes needed for this.

This cleans up the JS API - remove `increment_index` - removes `createIndex` - removes `rawSql` It also bumps the version to 1.5.6 (not yet released) and fixes a type that changed.

## Description of changes After the recent changes in the Web AI structure, the embedder function became invalid because of different imports and because the browser version of the library is only ESM, as mentioned [here](#824 (comment)). To fix the issue, I updated the import paths and changed the `require` statements to `import` for the browser version of the function. **Important note:** When using the embedded in the browser, you should import `chromadb/dist/module` because the [browser version](https://www.npmjs.com/package/@visheratin/web-ai) of Web AI is ESM-only. For Node.js, you can use regular `chromadb` import because [Node.js version](https://www.npmjs.com/package/@visheratin/web-ai-node) of the library is CJS. Additionally, because Web AI since recently supports multimodal embeddings, I added building multimodal embeddings using CLIP-base. ## Test plan I tested the changes locally for both browser and Node.js. The built `chromadb` package if someone wants to test it in their projects, is available [here](https://drive.google.com/file/d/1cNLsHGd1VmiFiamvsEaMVGA7ng56QQKG/view?usp=sharing). ## Documentation Changes There likely will be needed changes regarding the multimodal functionality.

- The project will now clone the chroma repo and deploy docker-compose from it - Introduced a new TF var chroma_release to be able to deploy a any (oh well almost any, chroma version, defaults to 0.4.5) - Added output instance_public_ip var to print out the IP address of the VM Refs: #950 ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Made it possible to deploy any version of chroma through TF var chroma_release (defaults to 0.4.5) - Made it so that the VM would clone the repo and checkout the release tag then run docker compose from it - Added output of the public IP of the VM after TF deployment - Improved slightly the README.md ## Test plan *How are these changes tested?* ```bash gcloud auth application-default login cd examples/deployments/google-cloud-compute terraform init export TF_VAR_project_id=<your_project_id> #take note of this as it must be present in all of the subsequent steps export TF_VAR_chroma_release=0.4.5 #set the chroma release to deploy terraform apply -auto-approve terraform output instance_public_ip export instance_public_ip=$(terraform output instance_public_ip) curl -v http://$instance_public_ip:8000/api/v1 terraform destroy -auto-approve ``` ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*

## Description of changes *Summarize the changes made by this PR.* Closes Issue: #923 Adds the following types available as a top-level import in addition to the `Collection` class: - `CollectionMetadata`, - `Documents`, - `EmbeddingFunction`, - `Embeddings`, - `IDs`, - `Include`, - `Metadata`, - `Where`, - `QueryResult`, - `GetResult`, - `WhereDocument`, - `UpdateCollectionMetadata`, ## Test plan *How are these changes tested?* - No additional tests were added ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* - It might make sense to make clear these cans can be used from the top level.

## Description of changes Previously we were not using the FTS search index correctly. https://sqlite.org/fts5.html#full_text_query_syntax Expects that you query using the table name of the FTS table, not using the column name. If you want to query by column name, you have to use column filters as discussed in the link above. We opt to take the path suggested here https://sqlite.org/forum/forumpost/1d45a7f6e17a3460 and match on id in addition to filtering that specific column. The query planner leverages this appropriately as confirmed in EXPLAIN. Since we were doing speculative delete queries, assuming the index was leveraged, this was incredibly slow. However now it is much faster. Explain Before ```-- SCAN VIRTUAL TABLE INDEX 0:``` -> Full table scan. Explain After ``` -- SCAN VIRTUAL TABLE INDEX 0:M2 ``` -> Scans the index itself The net effect of this is a large increase in write speed and also now the write path time does not grow with table size. ### Quick Benchmark Results N = 100k uniformly random vectors D = 128 Metadata = one small key: value pair Document = randomly generated string of length 100 Added with batch size = 1000 **Without Fix, Overall Time = 469s. Time to add a batch grows linearly to >8000 ms** <img width="590" alt="Screenshot 2023-08-09 at 5 53 24 PM" src="https://github.com/chroma-core/chroma/assets/5598697/89dde745-9231-4f3f-b62c-bf8486f7e970"> **With Fix, Overall Time = 102s. Time to add a batch grows sublinearly to ~1200 ms** <img width="587" alt="Screenshot 2023-08-09 at 5 43 12 PM" src="https://github.com/chroma-core/chroma/assets/5598697/2a771788-e5d9-4afe-bacb-dfbfb51b6cd1"> We will also want to make sure that the read path leverages this way of querying. Will address that in a follow up PR. ## Test plan Existing tests cover the scope of this change. ## Documentation Changes None required.

## Description of changes Adds a batch submit_embeddings() to the producer API. Implements this behavior in sqlite based embeddings queue. Refactors tests to add a fixture for both single and batch submit and substitutes all use of produce in tests with the fixture so both methods are uniformly tested. This PR is stacked on #958, please read that before reading this. *Summarize the changes made by this PR.* - Improvements & Bug fixes - This PR makes the sqlite based embeddings queue batch calls to sqlite based on the batch size of calls to DML operations. ### Quick Benchmarks N = 100k D = 128 metadata = {simple key: simple value} Document = random 100 character string **Before - Overall time: 102s** <img width="587" alt="Screenshot 2023-08-09 at 5 43 12 PM" src="https://github.com/chroma-core/chroma/assets/5598697/752809a2-78a0-4d5b-a238-fc41ca2635cc"> **After - Overall time: 36s** <img width="583" alt="Screenshot 2023-08-09 at 7 37 41 PM" src="https://github.com/chroma-core/chroma/assets/5598697/eb1759c8-8f62-4ed7-ad88-450b3f72692b"> **_Overall, when compared to main this and #958 decrease the time of this benchmark from 469s -> 36s._** Todo: - [x] Clean up code - [x] Rethink assumptions about submit_embeddings ordering guarantees - [x] Add tests - [x] Add batching to other DML - upsert(), update(), delete() ## Test plan Existing tests cover a bulk of the functionality. New tests TODO: - [x] Producer/Consumer tests for submit_embeddings - [x] Segment tests with batch embeddings ## Documentation Changes None required. But a misc feature idea is hyperparameter sweep for the optimal batch_size for your platform

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Added enhanced examples of how to use `where` filtering with logical operators based on community questions ## Test plan Run the jupyter notebook `examples/basic_functionality/where_filtering.ipynb` ## Documentation Changes No document updates.

## Description of changes Closes #927 *Summarize the changes made by this PR.* - Improvements & Bug fixes -Fix #927 by adding hnsw params to test_vector segment tests - New functionality - N/A ## Test plan These are tests ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - chroma-core/docs#98 Landed documentation fixes to address this but modified documentation generated from the code. This PR amends the code to be correct so that subsequent generations are correct. - New functionality - N/A ## Test plan Not needed. ## Documentation Changes In v0.4.6 we should regen API docs and release.

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Improves the performance of the duplicate ID validator. I noticed this while trying to add > 100k records to a collection. I had duplicates in my ID array (which I didn't know at the time), and I eventually terminated the operation because it was taking so long. I happened to notice in the traceback that it was stuck in the `validate_ids` function, which looks like it might be O(N^2). - This PR currently changes just the implementation. That said, it would be easy to display only the e.g. top 10 most common dupes (so as to avoid printing over 1k IDs to console output, as happened in my case :upside_down_face: ). Let me know if you would prefer that. ## Test plan *How are these changes tested?* 1. I tested the implementation within a Jupyter notebook on the same list of IDs that I had passed to `collection.add`, and it isolated the dupes in about half a second. ![image](https://github.com/chroma-core/chroma/assets/85196623/d6c3ac89-3c8b-44cd-b01c-59bf6cd01a66) 2. I pip installed this branch, tried to add the same items to the collection, and confirmed that I received the error message without a long delay. ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* No docstrings or documentation changes are needed.

## Description of changes fix proposition for #917 --------- Co-authored-by: Hammad Bashir <[email protected]>

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Similar to #958 this makes the read path of metadata segment properly use the index, leading to a >10x speedup in query performance while using get(). Since we want $contains to support substring matches, it must use the trigram tokenizer. LIKE based substring matching is only supported by the trigram tokenizer. Before this was doing a full table scan. See https://sqlite.org/fts5.html#the_trigram_tokenizer - Adds a migration to use trigram tokenizer - Add validation to disallow empty where document. - Change the FTS index to rely on rowid for uniqueness, instead of deleting speculatively on write path, rely on integrity checks. - Remove LIKE escaping since its not supported by FTS and degrades to full table scans. - Update tests to treat _ and % as wildcards. ## Test plan Existing tests ## Documentation Changes Clarify that where filtering will ignore _ and %.

## Description of changes The current function is looking for the tar.gz file instead of checking if the folder already exists, so if the tar.gz gets deleted after extraction, it downloads it again.. This PR resolves this and checks for the model in the extracted folder before attempting to download or extract again. ## Test plan By using it ## Documentation Changes I didn't find any documentation about how this does the download.

Release 0.4.6

## Description of changes Currently not all bundlers will chose the correct CJS/ESM build when importing the js client because we don't define them using [conditional-exports](https://nodejs.org/api/packages.html#conditional-exports). This change adds the required configuration in the package.json to fix that. - Improvements & Bug fixes - support conditional exports in js client ## Test plan I have tested this manually. We have a Next.js setup which throws an error with v1.5.6, we make use of that to detect which build (cjs/esm) nextjs tries to load. In both cases we expect an error with the path from which nextjs loads the module. ### Reproduce the issue - checkout jeffchuber/nextjs-chroma@2df1a21 - run yarn && yarn dev - open http://localhost:3000/ - you should see the following error in the browser ``` Failed to compile ./node_modules/chromadb/dist/main/embeddings/WebAIEmbeddingFunction.js:157:0 Module not found: Can't resolve '@visheratin/web-ai' https://nextjs.org/docs/messages/module-not-found Import trace for requested module: ./node_modules/chromadb/dist/main/index.js ./src/app/page.js This error occurred during the build process and can only be dismissed by fixing the error. ``` Nextjs loads the cjs build from `dist/main`. ### Confirm the change resolves the issue - checkout jeffchuber/nextjs-chroma@2df1a21 - remove current chromadb package (1.5.6) using `yarn remove chromadb` - install chromadb client with the [change in this pr](450159e) using package linking or using verdaccio - open http://localhost:3000/ - you should see the following error in the browser ``` Failed to compile ./node_modules/chromadb/dist/module/embeddings/WebAIEmbeddingFunction.js:131:0 Module not found: Can't resolve '@visheratin/web-ai' https://nextjs.org/docs/messages/module-not-found Import trace for requested module: ./node_modules/chromadb/dist/module/index.js ./src/app/page.js This error occurred during the build process and can only be dismissed by fixing the error. ``` Nextjs now loads the esm build from `dist/module` ## Documentation Changes I added a reference to https://nodejs.org/api/packages.html#conditional-exports in the commit message

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - In v0.4.6 we introduced batching at the sqlite layer for the embeddings_queue to maximize throughput through submit_embeddings calls. SQlite has a max size of variables it can bind at once that is defined at compile time. This changes makes the embeddings_queue introspect its compile time flags to inform its max_batch_size. Max_batch_size is a new property method added to the Producer class that indicates the maximum batch size submit_embeddings() will accept. Right now we use the error internal to submit_embeddings in the sqlite impl, but callers could use this to pre-validate input and throw a message as well. - New functionality - None ## Test plan Added a test for the below and above batch size cases. ## Documentation Changes We should update the usage guide to cover this caveat.

## Description of changes Addresses #948, #583, #970 *Summarize the changes made by this PR.* - Improvements & Bug fixes - Prevent unrestricted deletes and instead point users to more explicit alternatives. - Make embeddings_queue properly log the exception in mock_async mode - In tests, reraise from embeddings_queue - Fix a related bug where delete() on an empty segment throws a misleading error by checking if index in segment is initialized (#970) - New functionality - None ## Test plan Added unit tests validating that error cases are properly error'ing. Existing tests should make sure delete still works. ## Documentation Changes None required.

…template (#987) ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Add contributing guide to readme and a check if tests pass to PR template - New functionality - None Addresses #968 ## Test plan These are documentation changes. No tests required. ## Documentation Changes These are documentation changes.

## Description of changes *Summarize the changes made by this PR.* - New functionality - Auth Provide Client and Server Side Abstractions - Basic Auth Provider ## Test plan Unit tests for authorized endpoints ## Documentation Changes Docs should change to describe how to use auth providers on the client and server. CIP added in `docs/`

- Updated logger level to avoid confusion ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Changed logger level message when registering new Auth providers to debug to avoid confusing developers. - Change the status of the CIP to `Accepted` - Added `/docs` and `/openapi.json` to the publicly accessible endpoints (this may or many not be a security concern - TBD) ## Test plan Local tests are passing—no new tests. ## Documentation Changes N/A

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - The docker-compose will now pick up ENV vars from supplied `--env-file` ## Test plan Manual tests done for both with and without env file scenarios ## Documentation Changes Docs updated accordingly

Release 0.4.7

Note: Cross-version testing was failing as it was loading config twice which caused the issue with pydantic. ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Fixing `chromadb/test/property/test_cross_version_persist.py` test failures. ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python ## Documentation Changes No docs need updating

## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - #799 introduced a max version for pydantic but this change was not propagated to the thin client. This makes the same change in the thin client. - New functionality - ... ## Test plan Existing tests for client - [ ] Tests pass locally with `pytest` for python, `yarn test` for js ## Documentation Changes None

Refs: #1027 ## Description of changes *Summarize the changes made by this PR.* - New functionality - Baseline functionality and tests implemented - Example notebook updated - Minor refactor on the client creds provider to allow for user specific credentials fetching. ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python (regression) - [x] New fixtures added for token-based auth ## Documentation Changes Docs should be updated to highlight the new supported auth method.

Release 0.4.8

atroyn and others added 30 commits August 3, 2023 22:36

Release 0.4.5 (#930)

1538696

## Description of changes Release 0.4.5

clean up JS API, remove increment_index, createIndex and rawSql (#922)

608d01e

This cleans up the JS API - remove `increment_index` - removes `createIndex` - removes `rawSql` It also bumps the version to 1.5.6 (not yet released) and fixes a type that changed.

better resetting error messaging (#918)

1d4aef7

## Description of changes fix proposition for #917 --------- Co-authored-by: Hammad Bashir <[email protected]>

Release 0.4.6 (#988)

4116cad

Release 0.4.6

Release 0.4.7 (#1031)

f71f1d6

Release 0.4.7

Release 0.4.8 (#1056)

4332adf

Release 0.4.8

sunilkumardash9 merged commit c7ab07f into sunilkumardash9:main Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update repo #10

update repo #10

sunilkumardash9 commented Aug 29, 2023

update repo #10

update repo #10

Conversation

sunilkumardash9 commented Aug 29, 2023

Description of changes

Test plan

Documentation Changes