-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Remaining dictionary column work in libcudf #5963
Comments
This is super important for cuDF Python to be able to adopt libcudf dictionary types. Thank you for continuing to think about this problem 😄. |
I'm starting to see some interesting patterns here that were not covered in the requirements discussion #3535. When a cudf API accepts two columns like in Furthermore, if both dictionary columns have identical keys then APIs like Also, should these APIs actually require both columns be DICTIONARY type? For example, |
Does a dictionary column allow using a shared pointer or something similar to the keys? Could we do a pointer equivalence check instead of having to do a full equality check?
That sounds like a huge footgun from my perspective. Users will end up calling APIs with mismatched dictionaries and getting incorrect results back as opposed to erroring isn't acceptable in my mind.
Yes, the natural user experience would be to provide a scalar or column of the type of the keys, but ideally we can support the paths of key-type, equal dictionary-type, and unequal dictionary-type columns and scalars. |
Reference #5963 This PR adds dictionary specialization logic to the cudf::merge API. The code first ensures corresponding dictionary columns have matched keys. The main merge logic itself is unchanged. The specialization code occurs after the row-order index values are created. The specialized dictionary logic uses the row-order vector to merge the indices of the two input dictionary columns and constructs an output dictionary column using the matched keys along with the merged indices. This PR also includes gtests in the new merge_dictionary_tests.cpp
Reference #5963 This PR adds dictionary specialization logic to the cudf::unary_operation API. All math and bitwise operations return new dictionary columns. Logical operations return BOOL8 type columns. This includes a reworked/simplified math_ops.cu and removal of unary_ops.cuh which is not needed. Also, the gtests unary_ops_test.cpp math/logical operations were split out into a new math_ops_test.cpp to simplify updating tests.
Reference #5963 Add support for dictionary column types in cudf::minmax. This PR also adds support for strings column types as well. The code was refactored to re-use the internal reduce_device function for all supported types. The strings specialization logic just creates a string_scalar result which requires a copy step to build a new string_scalar object from a string_view (output from the reduce_device function). The dictionary support was to just to substitute the keys() column in the detail function right before the type-dispatcher call. Added a gtests for strings as well as dictionary with strings keys.
Reference #5963 This PR adds dictionary column type support to the set of `cudf::reduce` functions. This PR depends on utilities added in PR #6651 Here are the reduce operations that will be included in this PR. - [x] all - [x] any - [x] max - [x] mean - [x] median - [x] min - [x] nth_element - [x] product - [x] quantile - [x] std - [x] sum_of_squares - [x] sum - [x] unique_count - [x] var Authors: - davidwendt <[email protected]> Approvers: - Mike Wendt - AJ Schmidt - Ram (Ramakrishna Prabhu) - Karthikeyan URL: #6666
Reference #5963 Add dictionary support to groupby. - [x] argmax - [x] argmin - [x] collect - [x] count - [x] max - [x] mean* - [x] median - [x] min - [x] nth element - [x] nunique - [x] quantile - [x] std* - [x] sum* - [x] var* * _not supported due to 10.2 compile segfault_ Authors: - davidwendt <[email protected]> Approvers: - Jake Hemstad - Karthikeyan URL: #6585
Reference #5963 Add support for dictionary column to `cudf::rolling_window` (non-udf) Rolling aggregations - [x] min/max - [x] lead/lag - [x] counting, row-number These only require aggregating the dictionary indices and do not need to access the keys. Authors: - David (@davidwendt) Approvers: - Mark Harris (@harrism) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7186
This issue has been labeled |
@davidwendt are the checklists up to date in the description? |
Yes. |
After some discussion with @kkraus14 (late last year), the binary-ops for dictionaries may not make sense since the output would not have any correlation with the input -- keys on the output would not likely match with either input dictionary. Perhaps the min/max operations would make the most sense since these could operate on just the indices and the output could be a dictionary with matching keys. The other difficulty is making sure the keys match between the two input columns. |
Since only binary-ops is left and dictionary support may not make sense here, I'm going to close this issue. We can open a separate feature request(s) if the cudf team thinks dictionary column support should be added for specific in binary-ops. |
This issue attempts to outline the known remaining work to enable column type DICTIONARY. These items are meant to be fulfilled with multiple PRs.
References to other open dictionary issues:
dictionary
types to support onlyunsigned types
in indices. #5576 This describes converting dictionary to use only unsigned integers for the indicesI did a search for
dictionary32
to find all the places where specialization was added to libcudf to simply throw an exception as a place-holder. Here is a list of those APIs (in no particular order):cudf::reduction::detail::reduce
cudf::make_column_from_scalar
cudf::copy_range
cudf::scatter
scalarcudf::merge
cudf::clamp
cudf::contains
There are also other libcudf APIs that do not have the place-holder code that will likely need dictionary columns specialization in some way:
scan(not supported)The '(keys match)' items mean these could just work with the indices if the two input columns had matching keys.
Most of the other libcudf APIs use
gather()
which is already implemented for dictionary.Also cuIO in general does not support the current dictionary type. The writers, at the very least, could call
cudf::dictionary::decode()
to convert them to a non-dictionary column before writing. Likewise, the readers could callcudf::dictionary::encode()
after the initial column is created.Lastly, I have a prototype of an index-type normalizing iterator which I hope could be useful for managing the indices. This would allow the child indices column to be any (unsigned) integer type without requiring adding DICTIONARY8 or DICTIONARY16 types to libcudf.
The text was updated successfully, but these errors were encountered: