-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Story - Supporting row operators on nested types #10186
Comments
This may be achieved by specializing |
This issue has been labeled |
This PR implements equality comparator for LIST columns. This only supports "self" comparison for now, meaning the two rows to be compared should belong to the same table. A comparator that works on rows of two different tables will be implemented in another PR. This works only on "sanitized" list columns. See #10291 for details. This will partially support #10186. Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Mike Wilson (https://github.com/hyperbolic2346) - Jake Hemstad (https://github.com/jrhemstad) - Jordan Jacobelli (https://github.com/Ethyling) URL: #10289
Contributes to #10186 Authors: - Devavret Makkar (https://github.com/devavret) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #10641
Related to #8039 and #10181 Contributes to #10186 This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments. It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded while the current behavior treats inner nulls as **unequal**. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) URL: #10770
Still needed |
Contributes to #10186 This PR enables lexicographic comparisons between list columns. The comparison is robust to arbitrary levels of nesting, but does not yet support lists of (lists of...) structs. The comparison is based on the Dremel encoding already in use in the Parquet file format. To assist reviewers, here is a reasonably complete list of the changes in this PR: 1. A helper function to get per-column Dremel data (for list columns) when constructing a preprocessed table, which now owns the Dremel data. 2. Updating the set of lexicographically compatible columns to now include list columns as long as they do not have any nested structs within. 3. An implementation of `lexicographic::device_row_comparator::operator()` for lists. **This function is the heart of the change to enable comparisons between list columns.** 4. A new benchmark for sorting that uses list data. 5. An update to a preexisting rolling collect set test that previously failed (because it requires list comparison) but now works. 6. New tests for list comparison. Authors: - Devavret Makkar (https://github.com/devavret) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Mark Harris (https://github.com/harrism) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #11129
This story issue has served us well! I'm closing it in favor of #11844 where we will continue this work. |
There have been several requests to enable row operators on nested types. This issue is to track all related issues as a story.
There are three types of row operators we need to support (equality comparison
==
, lexicographic comparison<
, and hashing#
) on two different nested types (LIST
andSTRUCT
).==
NULL_MIN
andNULL_MAX
still outstanding, see #11520==
,<
==
+#
) / (==
+<
)==
+#
for hash groupby or==
+<
for sort groupby. Also a Spark req #10181<
<
==
+#
) / (==
+<
)drop_duplicates
uses (==
+<
) right now but will be optimized to use (==
+#
) in #10030#
struct_device_view
#
<
list<struct>
column is values, not keysThe plan
This will be supported using multiple PRs, first covering 1-table row comparators and hashing for nested types, then extending the row comparators with 2-table versions:
<
==
#
==
<
<
#
<
<
The text was updated successfully, but these errors were encountered: