Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first-class dtype utilities #8308

Merged
merged 31 commits into from
Jun 16, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented May 20, 2021

This PR adds a new cudf.api.types module that aims to match pandas.api.types while providing the necessary compatibility layers for cudf objects that is missing from the corresponding pandas APIs. It also replaces most internal uses of pandas.api.types in an attempt to centralize all typing logic so that we have a single place in which to perform any special dtype handling as needed. This work is intended as a best-effort, first-pass attempt to isolate our dependence on pandas dtype APIs; while it resolves a number of incompatibilities with pandas and other unexpected behaviors, there are still a number of open questions that still need to be addressed to completely wrap this up. I've noted a number of TODOs in the code (for instance relating to our nested types or the different types of time types e.g. timedelta). Since getting all of these correct in a single PR will be almost impossible given the pervasive use of dtype utilities throughout our code, this PR is a good first step in that direction that I think we can merge and then work on incrementally fixing the outstanding issues rather than trying to perfect all our type handling here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label May 20, 2021
@vyasr vyasr added 2 - In Progress Currently a work in progress breaking Breaking change improvement Improvement / enhancement to an existing function tech debt labels May 20, 2021
@shwina
Copy link
Contributor

shwina commented May 21, 2021

cudf.utils.dtypes.is_numerical_dtype has been moved to cudf.api.types.is_numeric_dtype to match pandas. I have also changed this function to return True for decimal types because that's what I would expect as a user. However, this change breaks some internal code that relies on decimals not being classified as numeric. We also have an is_decimal_dtype function. Are people OK with replacing internal usage (either with explicitly checking both functions or defining a third convenience utility is_non_decimal_numeric_dtype), or is there a good reason to leave the behavior as is?

  • Personally, I'm fine with is_numeric_dtype() returning True for decimal types. It's a bit of a mental readjustment for me, but it makes sense from and end-user perspective.

  • Internally we can do something like is_integer_dtype() or is_float_dtype(), or maybe for efficiency, is_integer_or_float_dtype()?

[...] introspecting the data. Should we do that in this case?

I don't think there's a way to do this without doing a pass over the entire data, which I'd be -1 for. Sadly, I think the best thing for us to do is return True for Series of arbitrary objects and let things fail downstream in such cases.

@brandon-b-miller
Copy link
Contributor

👍 on is_numeric_dtype returning true for decimals. As for how to handle the situations internally that break from that change, since is_decimal_dtype excludes numeric non decimal, I imagine we can just check the dtype using that function and do something specific based on if it's true.

For is_scalar, I think it could be a source of bugs if it doesn't return the same thing as the pandas version certainly for pandas a numpy objects. If we rely on it doing that internally, I'd say we should change that. In my mind this function returns the same thing as pandas for everything the pandas function can handle, plus return true for cuDF scalars or other cuDF objects that we need it to. Not sure about 0d cupy arrays.

RE: the string dtype, I agree with @shwina that scanning should be avoided. Maybe since it lives on the host checking the 0'th element wouldn't be terrible? Although I guess that doesn't guarantee they're all strings. I wonder what the plan is for that on the pandas roadmap, if there are any plans? Since they have a true pd.StringDtype now it could be that this api changes in the future.

@vyasr
Copy link
Contributor Author

vyasr commented May 25, 2021

Yeah I think we'd have to check the entire Series to be sure that it's all strings, not like some strings and some other stuff. That being the case, we may be stuck with letting that fail downstream for now.

Note to self, this PR will need to account for changes in #8332 that start introducing the cudf.api.types module for one specific case.

@vyasr vyasr self-assigned this Jun 8, 2021
@vyasr vyasr added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 9, 2021
@vyasr vyasr marked this pull request as ready for review June 9, 2021 23:16
@vyasr vyasr requested a review from a team as a code owner June 9, 2021 23:16
@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@2606b71). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 8762315 differs from pull request most recent head e61765a. Consider uploading reports for the commit e61765a to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8308   +/-   ##
===============================================
  Coverage                ?   82.59%           
===============================================
  Files                   ?      109           
  Lines                   ?    17865           
  Branches                ?        0           
===============================================
  Hits                    ?    14755           
  Misses                  ?     3110           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2606b71...e61765a. Read the comment docs.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me. Fantastic work, @vyasr!

@marlenezw marlenezw removed their request for review June 10, 2021 14:52
@vyasr
Copy link
Contributor Author

vyasr commented Jun 14, 2021

@gpucibot merge

@vyasr vyasr added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jun 14, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Jun 14, 2021

rerun tests

@galipremsagar
Copy link
Contributor

rerun tests

@rapids-bot rapids-bot bot merged commit 91c727f into rapidsai:branch-21.08 Jun 16, 2021
@vyasr vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021
rapids-bot bot pushed a commit that referenced this pull request Sep 15, 2021
Continuation of #8308 that moves all imports of standard dtype utilities to use `cudf.api.types` or `cudf.core.dtypes` rather than `cudf.util.dtypes`.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9011
@vyasr vyasr deleted the refactor/type_apis branch January 14, 2022 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants