Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Remaining dictionary column work in libcudf #5963

Closed
17 of 18 tasks
davidwendt opened this issue Aug 13, 2020 · 8 comments
Closed
17 of 18 tasks

[FEA] Remaining dictionary column work in libcudf #5963

davidwendt opened this issue Aug 13, 2020 · 8 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@davidwendt
Copy link
Contributor

davidwendt commented Aug 13, 2020

This issue attempts to outline the known remaining work to enable column type DICTIONARY. These items are meant to be fulfilled with multiple PRs.

References to other open dictionary issues:

I did a search for dictionary32 to find all the places where specialization was added to libcudf to simply throw an exception as a place-holder. Here is a list of those APIs (in no particular order):

There are also other libcudf APIs that do not have the place-holder code that will likely need dictionary columns specialization in some way:

  • binaryops
  • unaryops
  • groupby
  • join (keys match)
  • minmax
  • quantile
  • replace
  • replace_nulls
  • rolling
  • scan (not supported)
  • upper_bound, lower_bound (keys match)

The '(keys match)' items mean these could just work with the indices if the two input columns had matching keys.
Most of the other libcudf APIs use gather() which is already implemented for dictionary.

Also cuIO in general does not support the current dictionary type. The writers, at the very least, could call cudf::dictionary::decode() to convert them to a non-dictionary column before writing. Likewise, the readers could call cudf::dictionary::encode() after the initial column is created.

Lastly, I have a prototype of an index-type normalizing iterator which I hope could be useful for managing the indices. This would allow the child indices column to be any (unsigned) integer type without requiring adding DICTIONARY8 or DICTIONARY16 types to libcudf.

@davidwendt davidwendt added feature request New feature or request Needs Triage Need team to review and classify labels Aug 13, 2020
@davidwendt davidwendt added the libcudf Affects libcudf (C++/CUDA) code. label Aug 13, 2020
@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Aug 18, 2020
@kkraus14
Copy link
Collaborator

Lastly, I have a prototype of an index-type normalizing iterator which I hope could be useful for managing the indices. This would allow the child indices column to be any (unsigned) integer type without requiring adding DICTIONARY8 or DICTIONARY16 types to libcudf.

This is super important for cuDF Python to be able to adopt libcudf dictionary types. Thank you for continuing to think about this problem 😄.

@davidwendt
Copy link
Contributor Author

I'm starting to see some interesting patterns here that were not covered in the requirements discussion #3535.

When a cudf API accepts two columns like in copy_range, merge, contains, replace, etc both columns are required to have the same type. This makes sense since the values are combined to make a new column. For dictionaries, this strictly means both columns must have the same key type.

Furthermore, if both dictionary columns have identical keys then APIs like contains, lower_bound and upper_bound can work strictly with the indices since their outputs are bool (or integer) columns regardless of the input types. So, much of the specialization logic for scatter, replace, contains, etc has been to first rebuild both input dictionary columns so they have matching keys. Then, the resulting dictionary is the new matched keys and the output of running the original operation on just the two indices of the matched columns. This could be a waste if the dictionary keys already matched. We could add a check to see if the keys matched first but this would be wasteful too if the keys already matched. Should we just make it a requirement to the caller to ensure the keys match in these cases?

Also, should these APIs actually require both columns be DICTIONARY type? For example, fill and scatter-scalar only require the scalar argument to match the keys type. It seems reasonable that I would want to scatter strings column into a dictionary column for example. Or perform a search within a dictionary column using values from a non-dictionary column (or vice versa). Although, it may be more efficient to convert the non-dictionary to a dictionary with matching keys before performing the operation since the overall code logic may be equivalent.

@kkraus14 @harrism @jrhemstad

@kkraus14
Copy link
Collaborator

We could add a check to see if the keys matched first but this would be wasteful too if the keys already matched.

Does a dictionary column allow using a shared pointer or something similar to the keys? Could we do a pointer equivalence check instead of having to do a full equality check?

Should we just make it a requirement to the caller to ensure the keys match in these cases?

That sounds like a huge footgun from my perspective. Users will end up calling APIs with mismatched dictionaries and getting incorrect results back as opposed to erroring isn't acceptable in my mind.

Also, should these APIs actually require both columns be DICTIONARY type? For example, fill and scatter-scalar only require the scalar argument to match the keys type. It seems reasonable that I would want to scatter strings column into a dictionary column for example. Or perform a search within a dictionary column using values from a non-dictionary column (or vice versa). Although, it may be more efficient to convert the non-dictionary to a dictionary with matching keys before performing the operation since the overall code logic may be equivalent.

Yes, the natural user experience would be to provide a scalar or column of the type of the keys, but ideally we can support the paths of key-type, equal dictionary-type, and unequal dictionary-type columns and scalars.

harrism pushed a commit that referenced this issue Oct 20, 2020
Reference #5963

This PR adds dictionary specialization logic to the cudf::merge API.
The code first ensures corresponding dictionary columns have matched keys. The main merge logic itself is unchanged. The specialization code occurs after the row-order index values are created. The specialized dictionary logic uses the row-order vector to merge the indices of the two input dictionary columns and constructs an output dictionary column using the matched keys along with the merged indices.

This PR also includes gtests in the new merge_dictionary_tests.cpp
harrism pushed a commit that referenced this issue Oct 29, 2020
Reference #5963

This PR adds dictionary specialization logic to the cudf::unary_operation API.
All math and bitwise operations return new dictionary columns. Logical operations return BOOL8 type columns.
This includes a reworked/simplified math_ops.cu and removal of unary_ops.cuh which is not needed.
Also, the gtests unary_ops_test.cpp math/logical operations were split out into a new math_ops_test.cpp to simplify updating tests.
harrism pushed a commit that referenced this issue Nov 20, 2020
Reference #5963

Add support for dictionary column types in cudf::minmax. This PR also adds support for strings column types as well. The code was refactored to re-use the internal reduce_device function for all supported types. The strings specialization logic just creates a string_scalar result which requires a copy step to build a new string_scalar object from a string_view (output from the reduce_device function). The dictionary support was to just to substitute the keys() column in the detail function right before the type-dispatcher call.

Added a gtests for strings as well as dictionary with strings keys.
rapids-bot bot pushed a commit that referenced this issue Dec 1, 2020
Reference #5963 

This PR adds dictionary column type support to the set of `cudf::reduce` functions.
This PR depends on utilities added in PR #6651 

Here are the reduce operations that will be included in this PR.
- [x] all
- [x] any
- [x] max
- [x] mean
- [x] median
- [x] min
- [x] nth_element
- [x] product
- [x] quantile
- [x] std
- [x] sum_of_squares
- [x] sum
- [x] unique_count
- [x] var

Authors:
  - davidwendt <[email protected]>

Approvers:
  - Mike Wendt
  - AJ Schmidt
  - Ram (Ramakrishna Prabhu)
  - Karthikeyan

URL: #6666
rapids-bot bot pushed a commit that referenced this issue Jan 5, 2021
Reference #5963 Add dictionary support to groupby.

- [x] argmax
- [x] argmin
- [x] collect
- [x] count
- [x] max
- [x] mean* 
- [x] median
- [x] min
- [x] nth element
- [x] nunique
- [x] quantile
- [x] std*
- [x] sum* 
- [x] var* 

* _not supported due to 10.2 compile segfault_

Authors:
  - davidwendt <[email protected]>

Approvers:
  - Jake Hemstad
  - Karthikeyan

URL: #6585
rapids-bot bot pushed a commit that referenced this issue Jan 28, 2021
Reference #5963 

Add support for dictionary column to `cudf::rolling_window` (non-udf)

Rolling aggregations
- [x] min/max
- [x] lead/lag
- [x] counting, row-number

These only require aggregating the dictionary indices and do not need to access the keys.

Authors:
  - David (@davidwendt)

Approvers:
  - Mark Harris (@harrism)
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

URL: #7186
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@harrism
Copy link
Member

harrism commented Mar 16, 2021

@davidwendt are the checklists up to date in the description?

@davidwendt
Copy link
Contributor Author

@davidwendt are the checklists up to date in the description?

Yes.
I'm not sure binary-ops makes sense for dictionary though maybe there are a few ops that do.
And I was waiting for the rolling-window implementation to settle.
There may have been some new APIs added that probably need to consider dictionary as well.
And I'm sure that whenever the Python integration starts it will find issues in the checked items above as well.

@davidwendt
Copy link
Contributor Author

After some discussion with @kkraus14 (late last year), the binary-ops for dictionaries may not make sense since the output would not have any correlation with the input -- keys on the output would not likely match with either input dictionary. Perhaps the min/max operations would make the most sense since these could operate on just the indices and the output could be a dictionary with matching keys. The other difficulty is making sure the keys match between the two input columns.

@davidwendt
Copy link
Contributor Author

Since only binary-ops is left and dictionary support may not make sense here, I'm going to close this issue. We can open a separate feature request(s) if the cudf team thinks dictionary column support should be added for specific in binary-ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants