Add first-class dtype utilities #8308

vyasr · 2021-05-20T23:53:47Z

This PR adds a new cudf.api.types module that aims to match pandas.api.types while providing the necessary compatibility layers for cudf objects that is missing from the corresponding pandas APIs. It also replaces most internal uses of pandas.api.types in an attempt to centralize all typing logic so that we have a single place in which to perform any special dtype handling as needed. This work is intended as a best-effort, first-pass attempt to isolate our dependence on pandas dtype APIs; while it resolves a number of incompatibilities with pandas and other unexpected behaviors, there are still a number of open questions that still need to be addressed to completely wrap this up. I've noted a number of TODOs in the code (for instance relating to our nested types or the different types of time types e.g. timedelta). Since getting all of these correct in a single PR will be almost impossible given the pervasive use of dtype utilities throughout our code, this PR is a good first step in that direction that I think we can merge and then work on incrementally fixing the outstanding issues rather than trying to perfect all our type handling here.

shwina · 2021-05-21T18:31:53Z

cudf.utils.dtypes.is_numerical_dtype has been moved to cudf.api.types.is_numeric_dtype to match pandas. I have also changed this function to return True for decimal types because that's what I would expect as a user. However, this change breaks some internal code that relies on decimals not being classified as numeric. We also have an is_decimal_dtype function. Are people OK with replacing internal usage (either with explicitly checking both functions or defining a third convenience utility is_non_decimal_numeric_dtype), or is there a good reason to leave the behavior as is?

Personally, I'm fine with is_numeric_dtype() returning True for decimal types. It's a bit of a mental readjustment for me, but it makes sense from and end-user perspective.
Internally we can do something like is_integer_dtype() or is_float_dtype(), or maybe for efficiency, is_integer_or_float_dtype()?

[...] introspecting the data. Should we do that in this case?

I don't think there's a way to do this without doing a pass over the entire data, which I'd be -1 for. Sadly, I think the best thing for us to do is return True for Series of arbitrary objects and let things fail downstream in such cases.

brandon-b-miller · 2021-05-21T19:05:13Z

👍 on is_numeric_dtype returning true for decimals. As for how to handle the situations internally that break from that change, since is_decimal_dtype excludes numeric non decimal, I imagine we can just check the dtype using that function and do something specific based on if it's true.

For is_scalar, I think it could be a source of bugs if it doesn't return the same thing as the pandas version certainly for pandas a numpy objects. If we rely on it doing that internally, I'd say we should change that. In my mind this function returns the same thing as pandas for everything the pandas function can handle, plus return true for cuDF scalars or other cuDF objects that we need it to. Not sure about 0d cupy arrays.

RE: the string dtype, I agree with @shwina that scanning should be avoided. Maybe since it lives on the host checking the 0'th element wouldn't be terrible? Although I guess that doesn't guarantee they're all strings. I wonder what the plan is for that on the pandas roadmap, if there are any plans? Since they have a true pd.StringDtype now it could be that this api changes in the future.

vyasr · 2021-05-25T23:40:36Z

Yeah I think we'd have to check the entire Series to be sure that it's all strings, not like some strings and some other stuff. That being the case, we may be stuck with letting that fail downstream for now.

Note to self, this PR will need to account for changes in #8332 that start introducing the cudf.api.types module for one specific case.

…ere the expected behavior is unclear.

…ensive set.

codecov · 2021-06-10T01:13:33Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@2606b71). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 8762315 differs from pull request most recent head e61765a. Consider uploading reports for the commit e61765a to get more accurate results

@@               Coverage Diff               @@
##             branch-21.08    #8308   +/-   ##
===============================================
  Coverage                ?   82.59%           
===============================================
  Files                   ?      109           
  Lines                   ?    17865           
  Branches                ?        0           
===============================================
  Hits                    ?    14755           
  Misses                  ?     3110           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2606b71...e61765a. Read the comment docs.

python/cudf/cudf/api/types/categoricals.py

shwina

Looks really good to me. Fantastic work, @vyasr!

…_apis

vyasr · 2021-06-14T23:09:53Z

@gpucibot merge

vyasr · 2021-06-14T23:32:02Z

rerun tests

…_apis

…tor/type_apis

galipremsagar · 2021-06-15T21:38:01Z

rerun tests

Continuation of #8308 that moves all imports of standard dtype utilities to use `cudf.api.types` or `cudf.core.dtypes` rather than `cudf.util.dtypes`. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9011

github-actions bot added the Python Affects Python cuDF API. label May 20, 2021

vyasr added 2 - In Progress Currently a work in progress breaking Breaking change improvement Improvement / enhancement to an existing function tech debt labels May 20, 2021

vyasr added 18 commits June 8, 2021 13:56

Create new types module that aliases pandas.api.types.

e5ab00c

Move is_categorical_dtype to new location.

5586652

Move a bunch more functions into the new module.

bfe2723

Add docstrings and apply pydocstyle.

8727754

Add tests of is_categorical_dtype.

65e0198

Add tests of is_numeric_dtype and is_integer_dtype and fix bugs.

8b3b822

Test is_integer.

52786bb

Add more systematic list of test cases.

0e852e9

Import IntervalDtype into top-level namespace.

6f6e05e

Test is_string_dtype.

0933c13

Test datetime using simple wrapper of pandas.

bbfdeae

Use wrapper for is_integer_dtype.

eb4298e

Add tests of cudf types.

d5648fc

Add explicit test of pandas agreement and do some cleanup.

a889711

Simplify is_scalar.

e154d4e

Address some obvious test failures, either by fixing or with TODOs wh…

6911063

…ere the expected behavior is unclear.

Clean up some comments.

5032eef

Combine previously introduced types API with current new more compreh…

380f44b

…ensive set.

vyasr force-pushed the refactor/type_apis branch from 69d8d0b to 380f44b Compare June 8, 2021 21:29

vyasr self-assigned this Jun 8, 2021

Remove more uses of pd.api.types.

7e36d09

vyasr added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 9, 2021

Remove aliasing.

f63146e

vyasr marked this pull request as ready for review June 9, 2021 23:16

vyasr requested a review from a team as a code owner June 9, 2021 23:16

vyasr requested review from marlenezw and skirui-source June 9, 2021 23:16

shwina reviewed Jun 10, 2021

View reviewed changes

python/cudf/cudf/api/types/categoricals.py Show resolved Hide resolved

shwina approved these changes Jun 10, 2021

View reviewed changes

marlenezw removed their request for review June 10, 2021 14:52

Merge remote-tracking branch 'origin/branch-21.08' into refactor/type…

afbdfb2

…_apis

vyasr added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jun 14, 2021

vyasr added 3 commits June 14, 2021 18:44

Merge remote-tracking branch 'origin/branch-21.08' into refactor/type…

55a47ef

…_apis

Fix circular import issues.

d2b8710

Merge branch 'refactor/type_apis' of github.com:vyasr/cudf into refac…

e61765a

…tor/type_apis

rapids-bot bot merged commit 91c727f into rapidsai:branch-21.08 Jun 16, 2021

vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021

vyasr mentioned this pull request Aug 10, 2021

Remove aliases of various api.types APIs from utils.dtypes. #9011

Merged

rjzamora mentioned this pull request Oct 15, 2021

Update cudf type-check imports for 21.10+ NVIDIA-Merlin/NVTabular#1189

Closed

vyasr deleted the refactor/type_apis branch January 14, 2022 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first-class dtype utilities #8308

Add first-class dtype utilities #8308

vyasr commented May 20, 2021 •

edited

Loading

shwina commented May 21, 2021

brandon-b-miller commented May 21, 2021

vyasr commented May 25, 2021 •

edited

Loading

codecov bot commented Jun 10, 2021 •

edited

Loading

shwina left a comment

vyasr commented Jun 14, 2021

vyasr commented Jun 14, 2021

galipremsagar commented Jun 15, 2021

Add first-class dtype utilities #8308

Add first-class dtype utilities #8308

Conversation

vyasr commented May 20, 2021 • edited Loading

shwina commented May 21, 2021

brandon-b-miller commented May 21, 2021

vyasr commented May 25, 2021 • edited Loading

codecov bot commented Jun 10, 2021 • edited Loading

Codecov Report

shwina left a comment

Choose a reason for hiding this comment

vyasr commented Jun 14, 2021

vyasr commented Jun 14, 2021

galipremsagar commented Jun 15, 2021

vyasr commented May 20, 2021 •

edited

Loading

vyasr commented May 25, 2021 •

edited

Loading

codecov bot commented Jun 10, 2021 •

edited

Loading