Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Unify distinct_count column/table APIs. #10183

Open
bdice opened this issue Feb 1, 2022 · 4 comments
Open

[FEA] Unify distinct_count column/table APIs. #10183

bdice opened this issue Feb 1, 2022 · 4 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@bdice
Copy link
Contributor

bdice commented Feb 1, 2022

Is your feature request related to a problem? Please describe.
While reviewing #10030, I found that the column and table algorithms for distinct_count have completely different flags for null and NaN handling. The column API has null_policy (include/exclude) and nan_policy (NaN is/isn't null), while the table API has null_equality (nulls are equal/unequal).

This also applies to unordered_distinct_count, introduced in #10030.

Describe the solution you'd like
The distinct count APIs for column/table should use the same flags (meaning that all three flags should probably be available to both APIs). This would also allow the column API to be a pass-through implementation of the table API, with a table composed of only that column, rather than having two implementations (table, column).

@bdice bdice added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Feb 1, 2022
@github-actions
Copy link

github-actions bot commented Mar 3, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@bdice bdice self-assigned this Mar 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Nov 26, 2022
@GregoryKimball GregoryKimball assigned PointKernel and divyegala and unassigned bdice Nov 29, 2022
@divyegala
Copy link
Member

Diving into this issue with @PointKernel led us to figure out this is more complex than it appears. In general, distinct and distinct_count should share the same configuration parameters, but as of now, they do not. The configuration parameters that appear in these APIs are:

  1. null_policy : {EXCLUDE, INCLUDE}
  2. nan_policy : {NAN_IS_NULL, NAN_IS_VALID}
  3. null_equality : {EQUAL, UNEQUAL}
  4. nan_equality : {ALL_EQUAL, UNEQUAL}

These 4 knobs all interplay with each other, and the right way to do this would be:

  • Add 1 and 2 to the implementation of distinct, which currently exposes 3 and 4, while also using the new row comparators
  • Write a common function inspired from distinct that builds a hash map, to be shared with distinct_count
  • distinct uses the actual map and post-processes to build the result table, distinct_count just directly returns the size of the map
  • This also helps update distinct_count to the new experimental comparators
  • Start sharing tests/benchmarks between both these functions as they will essentially be sharing the same code path

@bdice
Copy link
Contributor Author

bdice commented Jan 4, 2023

@divyegala Yes, exactly! I agree this is a good plan for implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Pairing
Development

No branches or pull requests

4 participants