Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lists to embedding metadata #840

Closed

Conversation

Russell-Pollari
Copy link
Contributor

Description of changes

  • Improvements & Bug fixes
  • New functionality
    • Can add lists of string, ints, or floats to embedding metadata
    • Can use $contains operator on lists in where filters

Examples:

# Adding lists
collection.add(
    embeddings=[[1.0, 1.1], [2.0, 2.2]],
    ids=["id_1", "id_2"],
    metadatas=[
        {
            "strings": ["a", "b", "c"],
            "ints": [1, 2, 3],
            "floats": [1.0, 2.0, 3.0],
        },
        {
            "strings": ["d", "e", "f"],
            "ints": [4, 5, 6],
            "floats": [4.0, 5.0, 6.0],
        },
    ],
)

# Updating lists
collection.update(
    ids=["id_1", "id_2"],
    metadatas=[
        {
            "strings": ["a_1", "b_1", "c_1"],
        },
        {
            "strings": ["d_2", "e_2", "f_2"],
            "ints": [9, 10],
        },
    ],
)

# where filters
items = collection.get(where={"strings": {"$contains": "a_1"}})
items = collection.get(
    where={"$or": [{"strings": {"$contains": "a_1"}}, {"ints": {"$contains": 9}}]}
)

Test plan

Existing tests + new tests passing locally

Documentation Changes

Docs will need an update describing metadata and where filters

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

string_value TEXT,
float_value REAL,
int_value INTEGER,
FOREIGN KEY (id, key) REFERENCES embedding_metadata(id, key) ON DELETE CASCADE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get this delete cascade to work. So I manually delete them similar to how embedding_metadata is deleted.

Should I remove the ON DELETE or spend some time figuring it out?

@@ -235,6 +271,19 @@ def _update_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> No
sql, params = get_sql(q)
cur.execute(sql, params)

lists_to_update = [k for k, v in metadata.items() if isinstance(v, list)]
Copy link
Contributor Author

@Russell-Pollari Russell-Pollari Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a metadata key for a list is updated, first delete the old list before inserting the new elements

@Russell-Pollari
Copy link
Contributor Author

Update on the way to address conflicts

I think I should also add property/hypothesis tests?

Just want to confirm that my solution here is directionally correct. @HammadB ?

string_value TEXT,
float_value REAL,
int_value INTEGER,
bool_value BOOLEAN,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be odd to have a list of bools, but the implementation is simpler if this schema matches embedding_metadata

@LytixDev
Copy link

🤖 🚀 LGTM 🚀 🤖

@Russell-Pollari
Copy link
Contributor Author

Working on getting the property tests passing

@Russell-Pollari
Copy link
Contributor Author

Fixed the failing test, was a result of a bad merge

Working on adding hypothesis strategies to property tests for adding and filtering metadata lists. Will push this week

@sebastiannberg
Copy link

LGTM

@Russell-Pollari
Copy link
Contributor Author

Totally forgot about tests for the JS client. A couple of tests expected an error when using $contains in ` a where filter. Just pushed a fix

@Russell-Pollari
Copy link
Contributor Author

Have some updated property tests for lists running (and passing) locally. Currently blocked by some mypy errors from the precommit hook

Comment on lines +23 to +24
"WhereOperator",
"LogicalOperator",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to silence some mypy errors in strategies.py (e.g types.WhereOperator is not defined)

@@ -308,12 +314,15 @@ def metadata(draw: st.DrawFn, collection: Collection) -> types.Metadata:
if collection.known_metadata_keys:
for key in collection.known_metadata_keys.keys():
if key in metadata:
del metadata[key]
del metadata[key] # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to silence Mapping[str, Union[str, int, float, bool, List[Union[str, int, float, bool]]]]" has no attribute "__delitem__"

Union[str, int, float, bool, List[Union[str, int, float, bool]]]
],
] = {k: st.just(v) for k, v in collection.known_metadata_keys.items()}
metadata.update(draw(st.fixed_dictionaries({}, optional=sampling_dict))) # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping[str, Union[str, int, float, bool, List[Union[str, int, float, bool]]]]" has no attribute "update"

@Russell-Pollari
Copy link
Contributor Author

@HammadB can you try running the tests again?

@levand
Copy link
Contributor

levand commented Aug 11, 2023

Hi @Russell-Pollari , thanks so much for the PR. We definitely want to implement this feature so appreciate the contribution.

And, thanks for working on the property tests. Having those rock solid with this new feature are definitely necessary before we merge, so we can make sure we've handled edge cases and have feature parity across all implementations. Much appreciated.

I have one higher-level question before reviewing this code in detail. The PR talks about adding "list" support, but in your schema, I don't see any notion of ordering at the storage layer. Which means we're dealing with logical sets, rather than lists. I think that's ok. Sets fully allow us to answer contains? queries, and then we don't have to worry about the complexities of list ordering.

But in that case, I'd want to make sure that the semantics are precise and tested:

  1. The code and docs should say "set" instead of list consistently.
  2. eq and neq should follow set semantics and ignore order.
  3. Adding an item to a set where it's already present should be a no-op. (And, specifically, if I "add" an item multiple times then "remove" it once, the set should not end up containing the item.)
  4. We should return appropriate set data structures in the relevant languages wherever possible (accepting lists as inputs is fine though.)

Does that make sense?

@andrewmurraydavid
Copy link

andrewmurraydavid commented Aug 12, 2023

@Russell-Pollari this is great work!
This might be a bit of a tangent question, but if we'd need to query where an value is not in a metadata array property, would this still work? Happy to provide more details for use cases, but thought i might pop the question since there currently is support for $neq (in the context of single values).

@Russell-Pollari
Copy link
Contributor Author

Russell-Pollari commented Aug 13, 2023

@andrewmurraydavid
This PR only introduces $contains for lists. But if this gets merged, would like to follow up with more operators for lists (e.g. $not_contains, $all)

@andrewmurraydavid
Copy link

andrewmurraydavid commented Aug 13, 2023

@Russell-Pollari got it. I'm trying to understand why this PR wouldn't add $not_contains functionality since that would create consistency between single values and array values operations. If/when the PR is merged, there will be an inconsistency between query operators since only the contains would be available on array values.
Not trying to be pushy, just curious.

@Russell-Pollari
Copy link
Contributor Author

@andrewmurraydavid There would be some uphill work involved with other operators, and I don't want to add more complexity. It's best to keep PRs to change one thing at a time.

This PR is already sizeable and still needs review by the Chroma team. But if they approve of my approach, I will definitely get to work on more operators

@levand
Copy link
Contributor

levand commented Aug 15, 2023

@Russell-Pollari

Since sets are not JSON serializable, there has to be a lot of conversion from lists to sets and vice-versa. The result, IMO, is a worse UX for developers using the API

I think it's ok to use a list in json if we document and treat it like a set (never do anything that depends on order, and assume it has no duplicates.) Developers shouldn't need to be exposed to the conversion, right? And they can still pass in lists (we'll handle the conversion for them, we just need to document that dups will be removed and order won't be maintained.)

I might be missing something though... how do you think this would affect the UX for developers? Are there specific operations that you think would be confusing or ambiguous?

Your approach to adding the list index to the table will certainly work (albeit at the cost of backend complexity/performance)... curious to think through the UX implications more thoroughly before we make that decision.

@Russell-Pollari
Copy link
Contributor Author

@levand I'm probably overstating the UX issue. But wouldn't this would require maintaining separate types for user input and return values? Sending a List and getting a Set in return.

I suppose good documentation can get around this, but sticking to lists would let developers grok all they need just from looking at the Metadata type.

From a code perspective, I also think sticking with lists would be less complex overall (as the changes touch considerably less parts of the code base). Though I don't have a good sense of the relative performance costs of each.

@Russell-Pollari Russell-Pollari marked this pull request as draft August 15, 2023 19:24
@levand
Copy link
Contributor

levand commented Aug 15, 2023

@Russell-Pollari ok. I'm going to defer to @HammadB on the DevX issue since he has a way better intuition than I do about what our users need/want.

I can live with with either sets or lists provided we are consistent with the semantics of whichever we choose.

@HammadB
Copy link
Collaborator

HammadB commented Aug 15, 2023

I agree with @levand that we should preserve the semantics of what we choose (I.E If we do lists, they should preserve order).

I am not convinced the overhead of users expecting list semantics is worth the implementation cost (however the cost itself is unclear to me). What are the use cases where order matters?

The main use case I think we want to support is I have a bag-of-ints/floats/strings and users want to check if a value is in the target bag. This pushed me in favor of set semantics but I agree that passing around set() objects is odd. One off-the-cuff solution here is to not use lists OR sets and just use our own type that wraps these objects and provides set semantics.

What I'd like to see is an analysis of the set impl vs the list impl from a performance, use-case and schema perspective. Also I need to think about the distributed case.

@Russell-Pollari
Copy link
Contributor Author

What are the use cases where order matters?

Honestly, I can't think of any. But my thinking, which may be wrong, is that sticking with Lists simplifies the implementation and the required mental models for users while still solving the main use case—which I agree is to check if a value is in the target bag/list.

What I'd like to see is an analysis of the set impl vs the list impl from a performance, use-case and schema perspective. Also I need to think about the distributed case.

Roger that. I have the lists impl 95% complete. Will confirm all tests pass locally and push to this branch and mark it as ready to review. Will also try and push up another branch with a sets implementation.

Do you have a recommended method(s) to test performance? @HammadB @levand

@Russell-Pollari Russell-Pollari marked this pull request as ready for review August 17, 2023 01:27
@tazarov
Copy link
Contributor

tazarov commented Aug 19, 2023

@HammadB, a bit of a tangential question related to this. Have you considered using JSON1 Extensions (delivered by default with most SQLite distros since 3.9).

The distinct benefits are:

  • Arbitrary objects as metadata
  • JSONPath expressions
  • Will cover a more extensive set of requirements from end-users with a single change
  • Less technical debt when it comes to multiple implementations for various set/list/dict etc. semantics
  • Potentially extend the supported datatypes without further implementation e.g. dates and timestamps (will need further consideration about the possible use of operators in Where)

Maybe there are things I am missing at this late hour.

@Russell-Pollari
Copy link
Contributor Author

@tazarov
I think the issue with that approach is that, with arbitrary JSON objects, you lose the ability to create indexes to speed up queries.

@tazarov
Copy link
Contributor

tazarov commented Aug 21, 2023

@tazarov I think the issue with that approach is that, with arbitrary JSON objects, you lose the ability to create indexes to speed up queries.

@Russell-Pollari, you might be right, and I don't think most people will be needing indexes over arbitrary data.

Still, SQLite seems to offer a way to create indexes based on JSON data:

CREATE INDEX idx_json_key ON <tbl> (json_extract(<col>, '$.key'));

Perhaps JSON indexes are something that can be added to Chroma further down the road.

@Buckler89
Copy link

@levand @Russell-Pollari Any update on that?

@LazyAIEnjoyer
Copy link

What is the status on the implementation?

@Russell-Pollari
Copy link
Contributor Author

Have not had time to explore an alternative implementation with sets.

Will aim to get this branch up to date with main tomorrow

@gururise
Copy link

gururise commented Sep 18, 2023

Any updates on this? Definitely a very useful feature. Noticed a PR for JS adding $in and $nin support

@gururise
Copy link

gururise commented Oct 6, 2023

Anything we can do to get this PR merged?

@Russell-Pollari
Copy link
Contributor Author

sorry folks, I've been awfully busy and this branch has fallen way behind. Last I heard the chroma team was thinking about adding a custom metadata indexing feature, and this would have to fit in with the design choices around that. Going to close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding lists to the metadata
10 participants