[FEA] Remaining dictionary column work in libcudf #5963

davidwendt · 2020-08-13T14:23:59Z

kkraus14 · 2020-08-18T15:30:59Z

Lastly, I have a prototype of an index-type normalizing iterator which I hope could be useful for managing the indices. This would allow the child indices column to be any (unsigned) integer type without requiring adding DICTIONARY8 or DICTIONARY16 types to libcudf.

This is super important for cuDF Python to be able to adopt libcudf dictionary types. Thank you for continuing to think about this problem 😄.

davidwendt · 2020-09-28T18:51:49Z

I'm starting to see some interesting patterns here that were not covered in the requirements discussion #3535.

When a cudf API accepts two columns like in copy_range, merge, contains, replace, etc both columns are required to have the same type. This makes sense since the values are combined to make a new column. For dictionaries, this strictly means both columns must have the same key type.

Furthermore, if both dictionary columns have identical keys then APIs like contains, lower_bound and upper_bound can work strictly with the indices since their outputs are bool (or integer) columns regardless of the input types. So, much of the specialization logic for scatter, replace, contains, etc has been to first rebuild both input dictionary columns so they have matching keys. Then, the resulting dictionary is the new matched keys and the output of running the original operation on just the two indices of the matched columns. This could be a waste if the dictionary keys already matched. We could add a check to see if the keys matched first but this would be wasteful too if the keys already matched. Should we just make it a requirement to the caller to ensure the keys match in these cases?

Also, should these APIs actually require both columns be DICTIONARY type? For example, fill and scatter-scalar only require the scalar argument to match the keys type. It seems reasonable that I would want to scatter strings column into a dictionary column for example. Or perform a search within a dictionary column using values from a non-dictionary column (or vice versa). Although, it may be more efficient to convert the non-dictionary to a dictionary with matching keys before performing the operation since the overall code logic may be equivalent.

@kkraus14 @harrism @jrhemstad

kkraus14 · 2020-09-28T22:44:56Z

We could add a check to see if the keys matched first but this would be wasteful too if the keys already matched.

Does a dictionary column allow using a shared pointer or something similar to the keys? Could we do a pointer equivalence check instead of having to do a full equality check?

Should we just make it a requirement to the caller to ensure the keys match in these cases?

That sounds like a huge footgun from my perspective. Users will end up calling APIs with mismatched dictionaries and getting incorrect results back as opposed to erroring isn't acceptable in my mind.

Also, should these APIs actually require both columns be DICTIONARY type? For example, fill and scatter-scalar only require the scalar argument to match the keys type. It seems reasonable that I would want to scatter strings column into a dictionary column for example. Or perform a search within a dictionary column using values from a non-dictionary column (or vice versa). Although, it may be more efficient to convert the non-dictionary to a dictionary with matching keys before performing the operation since the overall code logic may be equivalent.

Yes, the natural user experience would be to provide a scalar or column of the type of the keys, but ideally we can support the paths of key-type, equal dictionary-type, and unequal dictionary-type columns and scalars.

Reference #5963 This PR adds dictionary specialization logic to the cudf::merge API. The code first ensures corresponding dictionary columns have matched keys. The main merge logic itself is unchanged. The specialization code occurs after the row-order index values are created. The specialized dictionary logic uses the row-order vector to merge the indices of the two input dictionary columns and constructs an output dictionary column using the matched keys along with the merged indices. This PR also includes gtests in the new merge_dictionary_tests.cpp

Reference #5963 This PR adds dictionary specialization logic to the cudf::unary_operation API. All math and bitwise operations return new dictionary columns. Logical operations return BOOL8 type columns. This includes a reworked/simplified math_ops.cu and removal of unary_ops.cuh which is not needed. Also, the gtests unary_ops_test.cpp math/logical operations were split out into a new math_ops_test.cpp to simplify updating tests.

Reference #5963 Add support for dictionary column types in cudf::minmax. This PR also adds support for strings column types as well. The code was refactored to re-use the internal reduce_device function for all supported types. The strings specialization logic just creates a string_scalar result which requires a copy step to build a new string_scalar object from a string_view (output from the reduce_device function). The dictionary support was to just to substitute the keys() column in the detail function right before the type-dispatcher call. Added a gtests for strings as well as dictionary with strings keys.

Reference #5963 This PR adds dictionary column type support to the set of `cudf::reduce` functions. This PR depends on utilities added in PR #6651 Here are the reduce operations that will be included in this PR. - [x] all - [x] any - [x] max - [x] mean - [x] median - [x] min - [x] nth_element - [x] product - [x] quantile - [x] std - [x] sum_of_squares - [x] sum - [x] unique_count - [x] var Authors: - davidwendt <[email protected]> Approvers: - Mike Wendt - AJ Schmidt - Ram (Ramakrishna Prabhu) - Karthikeyan URL: #6666

Reference #5963 Add dictionary support to groupby. - [x] argmax - [x] argmin - [x] collect - [x] count - [x] max - [x] mean* - [x] median - [x] min - [x] nth element - [x] nunique - [x] quantile - [x] std* - [x] sum* - [x] var* * _not supported due to 10.2 compile segfault_ Authors: - davidwendt <[email protected]> Approvers: - Jake Hemstad - Karthikeyan URL: #6585

@davidwendt

Reference #5963 Add support for dictionary column to `cudf::rolling_window` (non-udf) Rolling aggregations - [x] min/max - [x] lead/lag - [x] counting, row-number These only require aggregating the dictionary indices and do not need to access the keys. Authors: - David (@davidwendt) Approvers: - Mark Harris (@harrism) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7186

github-actions · 2021-03-14T19:12:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

harrism · 2021-03-16T01:23:27Z

@davidwendt are the checklists up to date in the description?

davidwendt · 2021-03-16T13:26:36Z

@davidwendt are the checklists up to date in the description?

Yes.
I'm not sure binary-ops makes sense for dictionary though maybe there are a few ops that do.
And I was waiting for the rolling-window implementation to settle.
There may have been some new APIs added that probably need to consider dictionary as well.
And I'm sure that whenever the Python integration starts it will find issues in the checked items above as well.

davidwendt · 2021-05-25T15:00:00Z

After some discussion with @kkraus14 (late last year), the binary-ops for dictionaries may not make sense since the output would not have any correlation with the input -- keys on the output would not likely match with either input dictionary. Perhaps the min/max operations would make the most sense since these could operate on just the indices and the output could be a dictionary with matching keys. The other difficulty is making sure the keys match between the two input columns.

davidwendt · 2021-06-29T21:19:43Z

Since only binary-ops is left and dictionary support may not make sense here, I'm going to close this issue. We can open a separate feature request(s) if the cudf team thinks dictionary column support should be added for specific in binary-ops.

davidwendt added feature request New feature or request Needs Triage Need team to review and classify labels Aug 13, 2020

davidwendt added the libcudf Affects libcudf (C++/CUDA) code. label Aug 13, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Aug 18, 2020

This was referenced Sep 10, 2020

[FEA] Unnecessary memory-resource parameter in cudf::contains API #6193

Closed

[REVIEW] Add dictionary support to cudf::contains #6205

Merged

[REVIEW] Support any unsigned int type for dictionary indices #6230

Merged

davidwendt self-assigned this Sep 17, 2020

This was referenced Sep 17, 2020

[REVIEW] Add cudf::make_dictionary_from_scalar factory function #6254

Merged

[REVIEW] Add dictionary support to cudf::scatter with scalar #6308

Merged

davidwendt mentioned this issue Sep 25, 2020

[REVIEW] Add dictionary specialization to cudf::replace_nulls #6327

Merged

This was referenced Oct 12, 2020

[REVIEW] Add dictionary support to cudf::merge #6502

Merged

[REVIEW] Add dictionary support to cudf::unary_operation #6540

Merged

davidwendt mentioned this issue Oct 19, 2020

[REVIEW] Add dictionary support to libcudf join APIs #6556

Merged

5 tasks

davidwendt mentioned this issue Oct 22, 2020

[REVIEW] Add dictionary support to libcudf groupby functions #6585

Merged

14 tasks

This was referenced Nov 4, 2020

[REVIEW] Add dictionary support to cudf::reduce #6666

Merged

[REVIEW] Add dictionary support to cudf::quantile #6676

Merged

davidwendt mentioned this issue Nov 13, 2020

[REVIEW] Add dictionary support to cudf::minmax #6764

Merged

davidwendt mentioned this issue Jan 21, 2021

Add dictionary column support to rolling_window #7186

Merged

3 tasks

github-actions bot added the inactive-30d label Mar 14, 2021

ttnghia mentioned this issue Mar 18, 2021

Implement groupby collect_set #7420

Merged

davidwendt closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Remaining dictionary column work in libcudf #5963

[FEA] Remaining dictionary column work in libcudf #5963

davidwendt commented Aug 13, 2020 •

edited

Loading

kkraus14 commented Aug 18, 2020

davidwendt commented Sep 28, 2020

kkraus14 commented Sep 28, 2020

github-actions bot commented Mar 14, 2021

harrism commented Mar 16, 2021

davidwendt commented Mar 16, 2021

davidwendt commented May 25, 2021

davidwendt commented Jun 29, 2021

[FEA] Remaining dictionary column work in libcudf #5963

[FEA] Remaining dictionary column work in libcudf #5963

Comments

davidwendt commented Aug 13, 2020 • edited Loading

kkraus14 commented Aug 18, 2020

davidwendt commented Sep 28, 2020

kkraus14 commented Sep 28, 2020

github-actions bot commented Mar 14, 2021

harrism commented Mar 16, 2021

davidwendt commented Mar 16, 2021

davidwendt commented May 25, 2021

davidwendt commented Jun 29, 2021

davidwendt commented Aug 13, 2020 •

edited

Loading