Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Protect against huge enum-of-strings input #3354

Merged
merged 6 commits into from
Dec 9, 2024

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Nov 20, 2024

Issue and/or context: #3353 [sc-59407]

Changes:

As proposed on #3353 [sc-59407]

[sc-59595]

Notes for Reviewer:

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 98.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 86.09%. Comparing base (c4fe227) to head (8435095).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3354      +/-   ##
==========================================
+ Coverage   85.83%   86.09%   +0.26%     
==========================================
  Files          55       55              
  Lines        6191     6221      +30     
==========================================
+ Hits         5314     5356      +42     
+ Misses        877      865      -12     
Flag Coverage Δ
python 86.09% <98.00%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.09% <98.00%> (+0.26%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@johnkerl johnkerl force-pushed the kerl/decat-if-huge branch 2 times, most recently from e590d82 to ae613c1 Compare November 21, 2024 00:28
@johnkerl johnkerl force-pushed the kerl/decat-if-huge branch 3 times, most recently from a92b62d to a32f3b1 Compare December 6, 2024 01:30
@johnkerl johnkerl force-pushed the kerl/decat-if-huge branch 2 times, most recently from 94623c0 to 695d681 Compare December 6, 2024 15:47
@johnkerl johnkerl force-pushed the kerl/decat-if-huge branch 2 times, most recently from 1e83402 to 483e339 Compare December 6, 2024 17:27
@johnkerl johnkerl changed the title [python] Protect against huge enum-of-strings input [WIP] [python] Protect against huge enum-of-strings input Dec 6, 2024
@johnkerl johnkerl requested a review from nguyenv December 6, 2024 17:38
@johnkerl johnkerl marked this pull request as ready for review December 6, 2024 17:38
@johnkerl johnkerl requested a review from aaronwolen December 6, 2024 17:38
apis/python/src/tiledbsoma/io/ingest.py Outdated Show resolved Hide resolved
pandas.api.types.is_string_dtype(x)
and len(x.cat.categories) > STRING_DECAT_THRESHOLD
):
return x.astype(str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a warning or debugging message that even though the column is a category type, we are coverting that to a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think @bkmartinjr ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not - I would instead just add it to the help/docstrings that this is the default behavior.

I don't find these warnings particularly useful, as most of this code runs in production pipelines. Better to document the behavior in the API docs, IMHO.

apis/python/src/tiledbsoma/io/ingest.py Outdated Show resolved Hide resolved
@johnkerl johnkerl merged commit 0127f0d into main Dec 9, 2024
11 checks passed
@johnkerl johnkerl deleted the kerl/decat-if-huge branch December 9, 2024 19:53
pandas.api.types.is_string_dtype(x)
and len(x.cat.categories) > STRING_DECAT_THRESHOLD
):
return x.astype(str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug - assumes all categoricals are of type str. They can be any primitive type, e.g., int, float, `bool, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bkmartinjr -- I'll make a follow-on PR -- this one is "Protect against huge enum-of-strings input" -- I'll generalize the follow-on to "Protect against huge enum-of-anything input"

johnkerl added a commit that referenced this pull request Dec 11, 2024
* [python] Extend #3354 to categoricals of arbitrary value type

* code-review feedback

* code-review feedback

* code-review feedback

* code-review feedback
github-actions bot pushed a commit that referenced this pull request Dec 11, 2024
* [python] Extend #3354 to categoricals of arbitrary value type

* code-review feedback

* code-review feedback

* code-review feedback

* code-review feedback
johnkerl added a commit that referenced this pull request Dec 11, 2024
…#3423)

* [python] Extend #3354 to categoricals of arbitrary value type

* code-review feedback

* code-review feedback

* code-review feedback

* code-review feedback

Co-authored-by: John Kerl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants