-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Protect against huge enum-of-strings input #3354
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3354 +/- ##
==========================================
+ Coverage 85.83% 86.09% +0.26%
==========================================
Files 55 55
Lines 6191 6221 +30
==========================================
+ Hits 5314 5356 +42
+ Misses 877 865 -12
Flags with carried forward coverage won't be shown. Click here to find out more.
|
e590d82
to
ae613c1
Compare
a92b62d
to
a32f3b1
Compare
94623c0
to
695d681
Compare
1e83402
to
483e339
Compare
483e339
to
7a82682
Compare
pandas.api.types.is_string_dtype(x) | ||
and len(x.cat.categories) > STRING_DECAT_THRESHOLD | ||
): | ||
return x.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a warning or debugging message that even though the column is a category type, we are coverting that to a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think @bkmartinjr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not - I would instead just add it to the help/docstrings that this is the default behavior.
I don't find these warnings particularly useful, as most of this code runs in production pipelines. Better to document the behavior in the API docs, IMHO.
Co-authored-by: nguyenv <[email protected]>
pandas.api.types.is_string_dtype(x) | ||
and len(x.cat.categories) > STRING_DECAT_THRESHOLD | ||
): | ||
return x.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bug - assumes all categoricals are of type str
. They can be any primitive type, e.g., int
, float
, `bool, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bkmartinjr -- I'll make a follow-on PR -- this one is "Protect against huge enum-of-strings input" -- I'll generalize the follow-on to "Protect against huge enum-of-anything input"
* [python] Extend #3354 to categoricals of arbitrary value type * code-review feedback * code-review feedback * code-review feedback * code-review feedback
* [python] Extend #3354 to categoricals of arbitrary value type * code-review feedback * code-review feedback * code-review feedback * code-review feedback
…#3423) * [python] Extend #3354 to categoricals of arbitrary value type * code-review feedback * code-review feedback * code-review feedback * code-review feedback Co-authored-by: John Kerl <[email protected]>
Issue and/or context: #3353 [sc-59407]
Changes:
As proposed on #3353 [sc-59407]
[sc-59595]
Notes for Reviewer: