-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support lists as groupby keys #8039
Comments
This issue has been labeled |
Related to #8039 and #10181 Contributes to #10186 This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments. It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded while the current behavior treats inner nulls as **unequal**. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) URL: #10770
This is partially resolved by #10770. The remaining work to close this issue is to implement lexicographical comparators |
This looks closer after merging #11129 but not quite working yet. sorting on list columns works:
but groupby does not due to a cuDF error. Perhaps list types aren't supported in the index as groupby would require
|
Correct -- there's no
For now I think (2) would be the simpler/faster option. |
Note: Implementing the |
Understood. I'd be hesitant to expand our API surface area by introducing a new |
@randerzander, is this request still a high priority for you? @PointKernel and I have been experimenting with exposing list-groupby in Python and have run into a few different issues. Notably, even Pandas does not support the example snippet first posted, so we have nothing to test against: import pandas as pd
df = pd.DataFrame({
'id': [0, 1],
'id_lst': [[0, 0], [1, 1]],
'val': [0, 1]
})
df.groupby(['id', 'id_lst']).val.sum()
# TypeError: unhashable type: 'list' Supporting grouping by lists in Python will need us to define and document semantics different from Pandas, and involves some non-trivial refactoring of our code. I'm trying to understand if there's a pressing need to do that. |
It's not high priority, no |
I'd like to be able to use lists as keys in a groupby:
Result:
One workaround is once string list concatenation merges, converting
id_lst
to a tokenized string and using the string representation as the grouping key.The text was updated successfully, but these errors were encountered: