Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Story - Supporting row operators on nested types #10186

Closed
devavret opened this issue Feb 1, 2022 · 4 comments
Closed

[FEA] Story - Supporting row operators on nested types #10186

devavret opened this issue Feb 1, 2022 · 4 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

devavret commented Feb 1, 2022

There have been several requests to enable row operators on nested types. This issue is to track all related issues as a story.

There are three types of row operators we need to support (equality comparison ==, lexicographic comparison <, and hashing #) on two different nested types (LIST and STRUCT).

Status Issue Type Operation Notes
✔️ #8683 STRUCT == Solved by flattening
NULL_MIN and NULL_MAX still outstanding, see #11520 #8964 STRUCT ==, < Proposed solution by flattening #9452
waiting on cuDF list index support, see #8039 #8039 LIST (== + #) / (== + <) Requires == + # for hash groupby or == + < for sort groupby. Also a Spark req #10181
waiting on min/max struct in hash-groupby #8974 STRUCT < Individual action items being solved with flattening
✔️ #5890 LIST < Plain list sorting. Spark req #10184
awaiting cuDF troubleshooting, see #6784 #6784 LIST (== + #) / (== + <) drop_duplicates uses (== + <) right now but will be optimized to use (== + #) in #10030
✔️ #9119 STRUCT # Needs a struct_device_view
✔️ #10378 LIST # We have list hashing, Spark-compatible Murmur3 hashing for lists
Proposal in #11222 #10408 LIST of STRUCT < Different from groupby on list because here the list<struct> column is values, not keys

The plan

This will be supported using multiple PRs, first covering 1-table row comparators and hashing for nested types, then extending the row comparators with 2-table versions:

PR Column Type Operation Number of tables Dependencies Notes
#10164 Struct < 1 Refactor of existing functionality. Introduces new owning operator API
#10289 List + struct (arbitrary nesting) == 1 #10164, #10291 Works only for "sanitized list" and structs with nulls pushed down
#10641 List + struct (arbitrary nesting) # 1 #10289 Same requirements as #10289
#10883 List + struct (arbitrary nesting) == 2 #10289, Same requirements as #10289, also see #10508
#10730 Struct < 2 #10164 #10508
#11129 List < 1 #10164, #10291 Should work for struct of list but won't work for list of structs
#11292 List Spark-# 1
List of struct < 1 Proposal in #11222
Struct of list < 1 We expected #11129 would cover these types, but there is an unresolved issue
@devavret devavret added feature request New feature or request Needs Triage Need team to review and classify labels Feb 1, 2022
@devavret
Copy link
Contributor Author

devavret commented Feb 1, 2022

We would need to figure out how to pass column_device_view information as part of the object.

This may be achieved by specializing element_hasher for list_view and passing list_device_view to hash_function. Then instead of defining MurmurHash3_32<cudf::list_view> we can define MurmurHash3_32<cudf::list_device_view>

@github-actions
Copy link

github-actions bot commented Mar 5, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Apr 13, 2022
This PR implements equality comparator for LIST columns. This only supports "self" comparison for now, meaning the two rows to be compared should belong to the same table. A comparator that works on rows of two different tables will be implemented in another PR.

This works only on "sanitized" list columns. See #10291 for details. 

This will partially support #10186.

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Jordan Jacobelli (https://github.com/Ethyling)

URL: #10289
rapids-bot bot pushed a commit that referenced this issue Apr 29, 2022
rapids-bot bot pushed a commit that referenced this issue May 25, 2022
Related to #8039 and #10181 

Contributes to #10186 

This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments.

It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded 
 while the current behavior treats inner nulls as **unequal**.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)

URL: #10770
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 24, 2022
@sameerz
Copy link
Contributor

sameerz commented Aug 26, 2022

Still needed

rapids-bot bot pushed a commit that referenced this issue Sep 12, 2022
Contributes to #10186

This PR enables lexicographic comparisons between list columns. The comparison is robust to arbitrary levels of nesting, but does not yet support lists of (lists of...) structs. The comparison is based on the Dremel encoding already in use in the Parquet file format. To assist reviewers, here is a reasonably complete list of the changes in this PR:
1. A helper function to get per-column Dremel data (for list columns) when constructing a preprocessed table, which now owns the Dremel data.
2. Updating the set of lexicographically compatible columns to now include list columns as long as they do not have any nested structs within.
3. An implementation of `lexicographic::device_row_comparator::operator()` for lists. **This function is the heart of the change to enable comparisons between list columns.**
4. A new benchmark for sorting that uses list data.
5. An update to a preexisting rolling collect set test that previously failed (because it requires list comparison) but now works.
6. New tests for list comparison.

Authors:
  - Devavret Makkar (https://github.com/devavret)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11129
@GregoryKimball
Copy link
Contributor

This story issue has served us well! I'm closing it in favor of #11844 where we will continue this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants