-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lists to embedding metadata #840
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
string_value TEXT, | ||
float_value REAL, | ||
int_value INTEGER, | ||
FOREIGN KEY (id, key) REFERENCES embedding_metadata(id, key) ON DELETE CASCADE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get this delete cascade to work. So I manually delete them similar to how embedding_metadata
is deleted.
Should I remove the ON DELETE
or spend some time figuring it out?
@@ -235,6 +271,19 @@ def _update_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> No | |||
sql, params = get_sql(q) | |||
cur.execute(sql, params) | |||
|
|||
lists_to_update = [k for k, v in metadata.items() if isinstance(v, list)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a metadata key for a list is updated, first delete the old list before inserting the new elements
Update on the way to address conflicts I think I should also add property/hypothesis tests? Just want to confirm that my solution here is directionally correct. @HammadB ? |
string_value TEXT, | ||
float_value REAL, | ||
int_value INTEGER, | ||
bool_value BOOLEAN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be odd to have a list of bools, but the implementation is simpler if this schema matches embedding_metadata
🤖 🚀 LGTM 🚀 🤖 |
Working on getting the property tests passing |
Fixed the failing test, was a result of a bad merge Working on adding hypothesis strategies to property tests for adding and filtering metadata lists. Will push this week |
LGTM |
Totally forgot about tests for the JS client. A couple of tests expected an error when using |
Have some updated property tests for lists running (and passing) locally. Currently blocked by some mypy errors from the precommit hook |
"WhereOperator", | ||
"LogicalOperator", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to silence some mypy errors in strategies.py (e.g types.WhereOperator is not defined
)
@@ -308,12 +314,15 @@ def metadata(draw: st.DrawFn, collection: Collection) -> types.Metadata: | |||
if collection.known_metadata_keys: | |||
for key in collection.known_metadata_keys.keys(): | |||
if key in metadata: | |||
del metadata[key] | |||
del metadata[key] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to silence Mapping[str, Union[str, int, float, bool, List[Union[str, int, float, bool]]]]" has no attribute "__delitem__"
Union[str, int, float, bool, List[Union[str, int, float, bool]]] | ||
], | ||
] = {k: st.just(v) for k, v in collection.known_metadata_keys.items()} | ||
metadata.update(draw(st.fixed_dictionaries({}, optional=sampling_dict))) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping[str, Union[str, int, float, bool, List[Union[str, int, float, bool]]]]" has no attribute "update"
@HammadB can you try running the tests again? |
Hi @Russell-Pollari , thanks so much for the PR. We definitely want to implement this feature so appreciate the contribution. And, thanks for working on the property tests. Having those rock solid with this new feature are definitely necessary before we merge, so we can make sure we've handled edge cases and have feature parity across all implementations. Much appreciated. I have one higher-level question before reviewing this code in detail. The PR talks about adding "list" support, but in your schema, I don't see any notion of ordering at the storage layer. Which means we're dealing with logical sets, rather than lists. I think that's ok. Sets fully allow us to answer But in that case, I'd want to make sure that the semantics are precise and tested:
Does that make sense? |
@Russell-Pollari this is great work! |
@andrewmurraydavid |
@Russell-Pollari got it. I'm trying to understand why this PR wouldn't add |
@andrewmurraydavid There would be some uphill work involved with other operators, and I don't want to add more complexity. It's best to keep PRs to change one thing at a time. This PR is already sizeable and still needs review by the Chroma team. But if they approve of my approach, I will definitely get to work on more operators |
I think it's ok to use a list in json if we document and treat it like a set (never do anything that depends on order, and assume it has no duplicates.) Developers shouldn't need to be exposed to the conversion, right? And they can still pass in lists (we'll handle the conversion for them, we just need to document that dups will be removed and order won't be maintained.) I might be missing something though... how do you think this would affect the UX for developers? Are there specific operations that you think would be confusing or ambiguous? Your approach to adding the list index to the table will certainly work (albeit at the cost of backend complexity/performance)... curious to think through the UX implications more thoroughly before we make that decision. |
@levand I'm probably overstating the UX issue. But wouldn't this would require maintaining separate types for user input and return values? Sending a I suppose good documentation can get around this, but sticking to lists would let developers grok all they need just from looking at the From a code perspective, I also think sticking with lists would be less complex overall (as the changes touch considerably less parts of the code base). Though I don't have a good sense of the relative performance costs of each. |
@Russell-Pollari ok. I'm going to defer to @HammadB on the DevX issue since he has a way better intuition than I do about what our users need/want. I can live with with either sets or lists provided we are consistent with the semantics of whichever we choose. |
I agree with @levand that we should preserve the semantics of what we choose (I.E If we do lists, they should preserve order). I am not convinced the overhead of users expecting list semantics is worth the implementation cost (however the cost itself is unclear to me). What are the use cases where order matters? The main use case I think we want to support is I have a bag-of-ints/floats/strings and users want to check if a value is in the target bag. This pushed me in favor of set semantics but I agree that passing around set() objects is odd. One off-the-cuff solution here is to not use lists OR sets and just use our own type that wraps these objects and provides set semantics. What I'd like to see is an analysis of the set impl vs the list impl from a performance, use-case and schema perspective. Also I need to think about the distributed case. |
Honestly, I can't think of any. But my thinking, which may be wrong, is that sticking with
Roger that. I have the lists impl 95% complete. Will confirm all tests pass locally and push to this branch and mark it as ready to review. Will also try and push up another branch with a sets implementation. Do you have a recommended method(s) to test performance? @HammadB @levand |
@HammadB, a bit of a tangential question related to this. Have you considered using The distinct benefits are:
|
@tazarov |
@Russell-Pollari, you might be right, and I don't think most people will be needing indexes over arbitrary data. Still, SQLite seems to offer a way to create indexes based on JSON data: CREATE INDEX idx_json_key ON <tbl> (json_extract(<col>, '$.key')); Perhaps JSON indexes are something that can be added to Chroma further down the road. |
@levand @Russell-Pollari Any update on that? |
What is the status on the implementation? |
Have not had time to explore an alternative implementation with sets. Will aim to get this branch up to date with |
Any updates on this? Definitely a very useful feature. Noticed a PR for JS adding $in and $nin support |
Anything we can do to get this PR merged? |
sorry folks, I've been awfully busy and this branch has fallen way behind. Last I heard the chroma team was thinking about adding a custom metadata indexing feature, and this would have to fit in with the design choices around that. Going to close this for now. |
Description of changes
$contains
operator on lists in where filtersExamples:
Test plan
Existing tests + new tests passing locally
Documentation Changes
Docs will need an update describing metadata and where filters