-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: is_string_dtype is not strict #15585
Comments
xref @ResidentMario |
There definitely needs to be a strict implementation somewhere in Maybe also consider leaving this method as-is and throwing in a new How to test performance? Will |
asv is the place for things like this. |
@ResidentMario you shouldn't need this fyi |
Any updates on this? It seems like behavior changed in import pandas as pd
from pandas.api.types import is_string_dtype
is_string_dtype(pd.Series(pd.Categorical([1,2,3])))
# False in v0.23.4
# True in v0.24.1
# Still:
is_string_dtype(pd.Series([(0,1), (1,1)]))
# True in v0.23.4
# True in v0.24.1 This was causing issues in a project where we were expecting the results of this check not to change between versions. On further investigation, it seems like this function is unreliable (tuples aren't strings, categoricals as strings feels very |
As documented in [this pandas issue](pandas-dev/pandas#15585), `is_string_type` for pandas is not strict and will characterize a whole bunch of things as strings that aren't. For our purposes, this is problematic because basically all subclasses of `ExtensionDType` will be classified as strings by that function. This is definitely not appropriate, so I modified our version of `is_string_dtype` to explicitly reject all of our extension dtypes (previously it was only excluding categorical types). I'm not 100% confident that no other parts of the code base rely on the current (erroneous) behavior, but the cudf tests all passed for me locally and my attempt to trace all calls of `utils.is_string_dtype` all look to be places where the change gives more correct behavior, so I think our best bet is to just move forward with this change. Any problems that result from this change in the future due to other code relying on the current behavior should probably be characterized as bugs in the calling code and fixed there. The same goes for for external codes that relied on this behavior; this change is potentially breaking for them as well, but again is something that they should be addressing. Authors: - Vyas Ramasubramani (@vyasr) Approvers: - Keith Kraus (@kkraus14) URL: #7710
#15533 (comment)
pandas.types.common.is_string_dtype
is not strict, just checking if its a fixed-width numpy string/unicode type orobject
dtype. It could be made strict via something like this.this would need performance checking to see if its a problem. Further this then changes the API a tiny bit (as we allow an object OR a dtype to be passed in). Which is probably ok as long as its documented a bit.
The text was updated successfully, but these errors were encountered: