Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Don't identify decimals as strings. #7710

Merged
merged 2 commits into from
Mar 25, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 24, 2021

As documented in this pandas issue, is_string_type for pandas is not strict and will characterize a whole bunch of things as strings that aren't. For our purposes, this is problematic because basically all subclasses of ExtensionDType will be classified as strings by that function. This is definitely not appropriate, so I modified our version of is_string_dtype to explicitly reject all of our extension dtypes (previously it was only excluding categorical types). I'm not 100% confident that no other parts of the code base rely on the current (erroneous) behavior, but the cudf tests all passed for me locally and my attempt to trace all calls of utils.is_string_dtype all look to be places where the change gives more correct behavior, so I think our best bet is to just move forward with this change. Any problems that result from this change in the future due to other code relying on the current behavior should probably be characterized as bugs in the calling code and fixed there. The same goes for for external codes that relied on this behavior; this change is potentially breaking for them as well, but again is something that they should be addressing.

@vyasr vyasr requested a review from a team as a code owner March 24, 2021 19:34
@vyasr vyasr self-assigned this Mar 24, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 24, 2021
@vyasr vyasr added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer bug Something isn't working non-breaking Non-breaking change labels Mar 24, 2021
@codecov
Copy link

codecov bot commented Mar 24, 2021

Codecov Report

Merging #7710 (babcdfc) into branch-0.19 (7871e7a) will increase coverage by 0.64%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7710      +/-   ##
===============================================
+ Coverage        81.86%   82.51%   +0.64%     
===============================================
  Files              101      101              
  Lines            16884    17450     +566     
===============================================
+ Hits             13822    14398     +576     
+ Misses            3062     3052      -10     
Impacted Files Coverage Δ
python/cudf/cudf/core/buffer.py 84.21% <ø> (+4.96%) ⬆️
python/cudf/cudf/core/column/categorical.py 91.97% <ø> (+0.58%) ⬆️
python/cudf/cudf/core/column/column.py 87.61% <ø> (-0.15%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.73% <ø> (+0.63%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.95% <ø> (-1.92%) ⬇️
python/cudf/cudf/core/column/lists.py 90.00% <ø> (-1.40%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.83% <ø> (-0.20%) ⬇️
python/cudf/cudf/core/column/string.py 86.79% <ø> (+0.30%) ⬆️
python/cudf/cudf/core/column/timedelta.py 88.66% <ø> (+0.42%) ⬆️
python/cudf/cudf/core/column_accessor.py 96.13% <ø> (+0.82%) ⬆️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e73fff0...babcdfc. Read the comment docs.

@vyasr vyasr added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 24, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Mar 24, 2021

Looks like this is a slightly more subtle problem. There's more to be done here to actually enable decimal operations; the way that they're currently be rejected is problematic (what I fixed above), but they are currently being rejected intentionally and I'll have to do a bit more work to actually enable them.

@vyasr vyasr changed the title [REVIEW] Don't identify decimals as strings. [WIP] Don't identify decimals as strings. Mar 24, 2021
@vyasr vyasr changed the title [WIP] Don't identify decimals as strings. [REVIEW] Don't identify decimals as strings. Mar 25, 2021
@vyasr vyasr added 3 - Ready for Review Ready for review by team breaking Breaking change and removed 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Mar 25, 2021
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 25, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1a1bd66 into rapidsai:branch-0.19 Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants