Add lists to embedding metadata #840

Russell-Pollari · 2023-07-18T15:17:51Z

Description of changes

Improvements & Bug fixes
- closes: Adding lists to the metadata #227
New functionality
- Can add lists of string, ints, or floats to embedding metadata
- Can use $contains operator on lists in where filters

Examples:

# Adding lists
collection.add(
    embeddings=[[1.0, 1.1], [2.0, 2.2]],
    ids=["id_1", "id_2"],
    metadatas=[
        {
            "strings": ["a", "b", "c"],
            "ints": [1, 2, 3],
            "floats": [1.0, 2.0, 3.0],
        },
        {
            "strings": ["d", "e", "f"],
            "ints": [4, 5, 6],
            "floats": [4.0, 5.0, 6.0],
        },
    ],
)

# Updating lists
collection.update(
    ids=["id_1", "id_2"],
    metadatas=[
        {
            "strings": ["a_1", "b_1", "c_1"],
        },
        {
            "strings": ["d_2", "e_2", "f_2"],
            "ints": [9, 10],
        },
    ],
)

# where filters
items = collection.get(where={"strings": {"$contains": "a_1"}})
items = collection.get(
    where={"$or": [{"strings": {"$contains": "a_1"}}, {"ints": {"$contains": 9}}]}
)

Test plan

Existing tests + new tests passing locally

Documentation Changes

Docs will need an update describing metadata and where filters

github-actions · 2023-07-18T15:18:02Z

Russell-Pollari · 2023-07-18T15:19:48Z

chromadb/migrations/metadb/00002-embedding-metadata.sqlite.sql

+    string_value TEXT,
+    float_value REAL,
+    int_value INTEGER,
+    FOREIGN KEY (id, key) REFERENCES embedding_metadata(id, key) ON DELETE CASCADE


I couldn't get this delete cascade to work. So I manually delete them similar to how embedding_metadata is deleted.

Should I remove the ON DELETE or spend some time figuring it out?

Russell-Pollari · 2023-07-18T15:20:54Z

chromadb/segment/impl/metadata/sqlite.py

@@ -235,6 +271,19 @@ def _update_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> No
            sql, params = get_sql(q)
            cur.execute(sql, params)

+        lists_to_update = [k for k, v in metadata.items() if isinstance(v, list)]


if a metadata key for a list is updated, first delete the old list before inserting the new elements

chromadb/test/segment/test_metadata.py

Russell-Pollari · 2023-07-26T16:37:06Z

Update on the way to address conflicts

I think I should also add property/hypothesis tests?

Just want to confirm that my solution here is directionally correct. @HammadB ?

Russell-Pollari · 2023-07-26T16:40:30Z

chromadb/migrations/metadb/00003-embedding-metadata.sqlite.sql

+    string_value TEXT,
+    float_value REAL,
+    int_value INTEGER,
+    bool_value BOOLEAN,


would be odd to have a list of bools, but the implementation is simpler if this schema matches embedding_metadata

LytixDev · 2023-07-31T14:16:46Z

🤖 🚀 LGTM 🚀 🤖

Russell-Pollari · 2023-08-07T19:03:56Z

Working on getting the property tests passing

Russell-Pollari · 2023-08-09T00:34:58Z

Fixed the failing test, was a result of a bad merge

Working on adding hypothesis strategies to property tests for adding and filtering metadata lists. Will push this week

sebastiannberg · 2023-08-09T10:39:37Z

LGTM

Russell-Pollari · 2023-08-11T00:34:24Z

Totally forgot about tests for the JS client. A couple of tests expected an error when using $contains in ` a where filter. Just pushed a fix

Russell-Pollari · 2023-08-11T14:14:11Z

Have some updated property tests for lists running (and passing) locally. Currently blocked by some mypy errors from the precommit hook

Russell-Pollari · 2023-08-11T15:09:09Z

chromadb/api/types.py

+    "WhereOperator",
+    "LogicalOperator",


to silence some mypy errors in strategies.py (e.g types.WhereOperator is not defined)

Russell-Pollari · 2023-08-11T15:09:50Z

chromadb/test/property/strategies.py

@@ -308,12 +314,15 @@ def metadata(draw: st.DrawFn, collection: Collection) -> types.Metadata:
    if collection.known_metadata_keys:
        for key in collection.known_metadata_keys.keys():
            if key in metadata:
-                del metadata[key]
+                del metadata[key]  # type: ignore


to silence Mapping[str, Union[str, int, float, bool, List[Union[str, int, float, bool]]]]" has no attribute "__delitem__"

Russell-Pollari · 2023-08-11T15:10:09Z

chromadb/test/property/strategies.py

+                Union[str, int, float, bool, List[Union[str, int, float, bool]]]
+            ],
+        ] = {k: st.just(v) for k, v in collection.known_metadata_keys.items()}
+        metadata.update(draw(st.fixed_dictionaries({}, optional=sampling_dict)))  # type: ignore


Mapping[str, Union[str, int, float, bool, List[Union[str, int, float, bool]]]]" has no attribute "update"

Russell-Pollari · 2023-08-11T15:12:33Z

@HammadB can you try running the tests again?

levand · 2023-08-11T16:12:43Z

Hi @Russell-Pollari , thanks so much for the PR. We definitely want to implement this feature so appreciate the contribution.

And, thanks for working on the property tests. Having those rock solid with this new feature are definitely necessary before we merge, so we can make sure we've handled edge cases and have feature parity across all implementations. Much appreciated.

I have one higher-level question before reviewing this code in detail. The PR talks about adding "list" support, but in your schema, I don't see any notion of ordering at the storage layer. Which means we're dealing with logical sets, rather than lists. I think that's ok. Sets fully allow us to answer contains? queries, and then we don't have to worry about the complexities of list ordering.

But in that case, I'd want to make sure that the semantics are precise and tested:

The code and docs should say "set" instead of list consistently.
eq and neq should follow set semantics and ignore order.
Adding an item to a set where it's already present should be a no-op. (And, specifically, if I "add" an item multiple times then "remove" it once, the set should not end up containing the item.)
We should return appropriate set data structures in the relevant languages wherever possible (accepting lists as inputs is fine though.)

Does that make sense?

andrewmurraydavid · 2023-08-12T18:40:16Z

@Russell-Pollari this is great work!
This might be a bit of a tangent question, but if we'd need to query where an value is not in a metadata array property, would this still work? Happy to provide more details for use cases, but thought i might pop the question since there currently is support for $neq (in the context of single values).

Russell-Pollari · 2023-08-13T14:19:18Z

@andrewmurraydavid
This PR only introduces $contains for lists. But if this gets merged, would like to follow up with more operators for lists (e.g. $not_contains, $all)

andrewmurraydavid · 2023-08-13T17:54:43Z

@Russell-Pollari got it. I'm trying to understand why this PR wouldn't add $not_contains functionality since that would create consistency between single values and array values operations. If/when the PR is merged, there will be an inconsistency between query operators since only the contains would be available on array values.
Not trying to be pushy, just curious.

Russell-Pollari · 2023-08-13T19:26:04Z

@andrewmurraydavid There would be some uphill work involved with other operators, and I don't want to add more complexity. It's best to keep PRs to change one thing at a time.

This PR is already sizeable and still needs review by the Chroma team. But if they approve of my approach, I will definitely get to work on more operators

levand · 2023-08-15T18:02:27Z

@Russell-Pollari

Since sets are not JSON serializable, there has to be a lot of conversion from lists to sets and vice-versa. The result, IMO, is a worse UX for developers using the API

I think it's ok to use a list in json if we document and treat it like a set (never do anything that depends on order, and assume it has no duplicates.) Developers shouldn't need to be exposed to the conversion, right? And they can still pass in lists (we'll handle the conversion for them, we just need to document that dups will be removed and order won't be maintained.)

I might be missing something though... how do you think this would affect the UX for developers? Are there specific operations that you think would be confusing or ambiguous?

Your approach to adding the list index to the table will certainly work (albeit at the cost of backend complexity/performance)... curious to think through the UX implications more thoroughly before we make that decision.

Russell-Pollari · 2023-08-15T19:23:53Z

@levand I'm probably overstating the UX issue. But wouldn't this would require maintaining separate types for user input and return values? Sending a List and getting a Set in return.

I suppose good documentation can get around this, but sticking to lists would let developers grok all they need just from looking at the Metadata type.

From a code perspective, I also think sticking with lists would be less complex overall (as the changes touch considerably less parts of the code base). Though I don't have a good sense of the relative performance costs of each.

levand · 2023-08-15T20:11:20Z

@Russell-Pollari ok. I'm going to defer to @HammadB on the DevX issue since he has a way better intuition than I do about what our users need/want.

I can live with with either sets or lists provided we are consistent with the semantics of whichever we choose.

HammadB · 2023-08-15T22:55:41Z

I agree with @levand that we should preserve the semantics of what we choose (I.E If we do lists, they should preserve order).

I am not convinced the overhead of users expecting list semantics is worth the implementation cost (however the cost itself is unclear to me). What are the use cases where order matters?

The main use case I think we want to support is I have a bag-of-ints/floats/strings and users want to check if a value is in the target bag. This pushed me in favor of set semantics but I agree that passing around set() objects is odd. One off-the-cuff solution here is to not use lists OR sets and just use our own type that wraps these objects and provides set semantics.

What I'd like to see is an analysis of the set impl vs the list impl from a performance, use-case and schema perspective. Also I need to think about the distributed case.

Russell-Pollari · 2023-08-16T13:46:31Z

What are the use cases where order matters?

Honestly, I can't think of any. But my thinking, which may be wrong, is that sticking with Lists simplifies the implementation and the required mental models for users while still solving the main use case—which I agree is to check if a value is in the target bag/list.

What I'd like to see is an analysis of the set impl vs the list impl from a performance, use-case and schema perspective. Also I need to think about the distributed case.

Roger that. I have the lists impl 95% complete. Will confirm all tests pass locally and push to this branch and mark it as ready to review. Will also try and push up another branch with a sets implementation.

Do you have a recommended method(s) to test performance? @HammadB @levand

tazarov · 2023-08-19T19:39:31Z

@HammadB, a bit of a tangential question related to this. Have you considered using JSON1 Extensions (delivered by default with most SQLite distros since 3.9).

The distinct benefits are:

Arbitrary objects as metadata
JSONPath expressions
Will cover a more extensive set of requirements from end-users with a single change
Less technical debt when it comes to multiple implementations for various set/list/dict etc. semantics
Potentially extend the supported datatypes without further implementation e.g. dates and timestamps (will need further consideration about the possible use of operators in Where)

Maybe there are things I am missing at this late hour.

Russell-Pollari · 2023-08-21T13:34:37Z

@tazarov
I think the issue with that approach is that, with arbitrary JSON objects, you lose the ability to create indexes to speed up queries.

tazarov · 2023-08-21T13:50:23Z

@tazarov I think the issue with that approach is that, with arbitrary JSON objects, you lose the ability to create indexes to speed up queries.

@Russell-Pollari, you might be right, and I don't think most people will be needing indexes over arbitrary data.

Still, SQLite seems to offer a way to create indexes based on JSON data:

CREATE INDEX idx_json_key ON <tbl> (json_extract(<col>, '$.key'));

Perhaps JSON indexes are something that can be added to Chroma further down the road.

Buckler89 · 2023-08-31T11:36:52Z

@levand @Russell-Pollari Any update on that?

LazyAIEnjoyer · 2023-09-04T10:47:14Z

What is the status on the implementation?

Russell-Pollari · 2023-09-06T00:52:18Z

Have not had time to explore an alternative implementation with sets.

Will aim to get this branch up to date with main tomorrow

gururise · 2023-09-18T23:34:15Z

Any updates on this? Definitely a very useful feature. Noticed a PR for JS adding $in and $nin support

gururise · 2023-10-06T16:33:45Z

Anything we can do to get this PR merged?

Russell-Pollari · 2023-10-10T15:25:28Z

sorry folks, I've been awfully busy and this branch has fallen way behind. Last I heard the chroma team was thinking about adding a custom metadata indexing feature, and this would have to fit in with the design choices around that. Going to close this for now.

russell-pollari added 4 commits July 18, 2023 10:31

Add migration for embedding metadata lists table

26f9ce4

Update metadata types and validators to allow for lists

5352711

Update sqlite metadata segment to handle lists

5f21db7

Add tests

a0d6dce

Russell-Pollari commented Jul 18, 2023

View reviewed changes

chromadb/test/segment/test_metadata.py Outdated Show resolved Hide resolved

Merge branch 'main' into metadata-lists

1a78ff0

Support bools in metadata lists

b853407

Russell-Pollari commented Jul 26, 2023

View reviewed changes

jeffchuber mentioned this pull request Aug 4, 2023

support lists in metadatas #754

Closed

gururise mentioned this pull request Aug 6, 2023

[Feature Request]: add new filter options for the retriever like $contains or $in like in Pinecone for list metadata #936

Closed

Fix boolean saving as int

3211d49

Fix failing JS tests

a6a4ea5

russell-pollari added 2 commits August 11, 2023 09:22

Fixup handling of bool lists

ac5fc5f

Fix value criterion

111ff6a

Update hypothesis strategies and tests for metadata lists

84d837d

Russell-Pollari commented Aug 11, 2023

View reviewed changes

Preserve list order

0c40ef6

Handle empty lists

dbdf984

russell-pollari added 5 commits August 13, 2023 15:30

Merge branch 'main' into metadata-lists

dd43e68

Fixup handling of empty list

e298a2c

Update types and tests for js client

9385a63

Merge branch 'main' into metadata-lists

31c9556

Rename migrations file

7f14db6

Russell-Pollari marked this pull request as draft August 15, 2023 19:24

Fix property strategies for collection metadata

8ce72e7

Add patch for cross version tests

35f24b2

Russell-Pollari marked this pull request as ready for review August 17, 2023 01:27

Russell-Pollari closed this Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lists to embedding metadata #840

Add lists to embedding metadata #840

Russell-Pollari commented Jul 18, 2023

github-actions bot commented Jul 18, 2023

Russell-Pollari Jul 18, 2023

Russell-Pollari Jul 18, 2023 •

edited

Loading

Russell-Pollari commented Jul 26, 2023

Russell-Pollari Jul 26, 2023

LytixDev commented Jul 31, 2023

Russell-Pollari commented Aug 7, 2023

Russell-Pollari commented Aug 9, 2023

sebastiannberg commented Aug 9, 2023

Russell-Pollari commented Aug 11, 2023

Russell-Pollari commented Aug 11, 2023

Russell-Pollari Aug 11, 2023

Russell-Pollari Aug 11, 2023

Russell-Pollari Aug 11, 2023

Russell-Pollari commented Aug 11, 2023

levand commented Aug 11, 2023

andrewmurraydavid commented Aug 12, 2023 •

edited

Loading

Russell-Pollari commented Aug 13, 2023 •

edited

Loading

andrewmurraydavid commented Aug 13, 2023 •

edited

Loading

Russell-Pollari commented Aug 13, 2023

levand commented Aug 15, 2023

Russell-Pollari commented Aug 15, 2023

levand commented Aug 15, 2023

HammadB commented Aug 15, 2023 •

edited

Loading

Russell-Pollari commented Aug 16, 2023

tazarov commented Aug 19, 2023

Russell-Pollari commented Aug 21, 2023

tazarov commented Aug 21, 2023

Buckler89 commented Aug 31, 2023

LazyAIEnjoyer commented Sep 4, 2023

Russell-Pollari commented Sep 6, 2023

gururise commented Sep 18, 2023 •

edited

Loading

gururise commented Oct 6, 2023

Russell-Pollari commented Oct 10, 2023

Add lists to embedding metadata #840

Add lists to embedding metadata #840

Conversation

Russell-Pollari commented Jul 18, 2023

Description of changes

Examples:

Test plan

Documentation Changes

github-actions bot commented Jul 18, 2023

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Russell-Pollari Jul 18, 2023

Choose a reason for hiding this comment

Russell-Pollari Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

Russell-Pollari commented Jul 26, 2023

Russell-Pollari Jul 26, 2023

Choose a reason for hiding this comment

LytixDev commented Jul 31, 2023

Russell-Pollari commented Aug 7, 2023

Russell-Pollari commented Aug 9, 2023

sebastiannberg commented Aug 9, 2023

Russell-Pollari commented Aug 11, 2023

Russell-Pollari commented Aug 11, 2023

Russell-Pollari Aug 11, 2023

Choose a reason for hiding this comment

Russell-Pollari Aug 11, 2023

Choose a reason for hiding this comment

Russell-Pollari Aug 11, 2023

Choose a reason for hiding this comment

Russell-Pollari commented Aug 11, 2023

levand commented Aug 11, 2023

andrewmurraydavid commented Aug 12, 2023 • edited Loading

Russell-Pollari commented Aug 13, 2023 • edited Loading

andrewmurraydavid commented Aug 13, 2023 • edited Loading

Russell-Pollari commented Aug 13, 2023

levand commented Aug 15, 2023

Russell-Pollari commented Aug 15, 2023

levand commented Aug 15, 2023

HammadB commented Aug 15, 2023 • edited Loading

Russell-Pollari commented Aug 16, 2023

tazarov commented Aug 19, 2023

Russell-Pollari commented Aug 21, 2023

tazarov commented Aug 21, 2023

Buckler89 commented Aug 31, 2023

LazyAIEnjoyer commented Sep 4, 2023

Russell-Pollari commented Sep 6, 2023

gururise commented Sep 18, 2023 • edited Loading

gururise commented Oct 6, 2023

Russell-Pollari commented Oct 10, 2023

Russell-Pollari Jul 18, 2023 •

edited

Loading

andrewmurraydavid commented Aug 12, 2023 •

edited

Loading

Russell-Pollari commented Aug 13, 2023 •

edited

Loading

andrewmurraydavid commented Aug 13, 2023 •

edited

Loading

HammadB commented Aug 15, 2023 •

edited

Loading

gururise commented Sep 18, 2023 •

edited

Loading