-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: StringArray non-extensible due to inconsisent assertion #34309
Comments
This specific assert could be fixed, but in general: I am not sure we already want to commit to have those classes be "public" for subclassing (they are of course public for using them, but subclassing will typically depend more on implementation details). |
It's clear from the documentation that these classes are still experimental and could change at any time. Motivations for this change could be:
|
I'm a bit confused about the use-case here. What are you gaining by subclassing StringArray / StringDtype, rather than ExtensionArray / ExtensionDtype? |
@TomAugspurger Extending StringArray/StringDtype achieves data abstraction for types that semantically differ, but have the same machine representation (in this case string). The code example provides an example, where the programmer may use the MyExtensionDtype to denote specific contents, while the machine represents this as a string, without adding unnecessary boilerplate code. For instance, storing a path as a string is reasonable, in contrast to https://github.com/ContinuumIO/cyberpandas, where the machine representation of an IP address can be made more efficient through an integer representation. |
I think this usecase is ok, at the very least this could unconver some lighter tested codepaths in EA subclassing (we likeley have very limited coverage of Dtype subclassing aside from our internal usage). |
For your use case, is it only the "name" of the dtype you want to change (not any of its functionality?) |
For this use case it is simply aliasing indeed. The changed dtype is used to indicate more expensive prework (such as validating all items are urls). The most natural extension to this is adding accessors to the alias ( |
But so that is not a simple alias. As the string dtype subclass will not further keep this guarantee once you do something with the dataframe / dtype. |
Please correct me if I have any of this wrong but it seems to me like I want a simple mechanism to label dtypes as whatever I want, without changing the underlying storage type. From your perspective, you are looking at it instead as creating new pandas sequence types. A good resource to provide background on the distinction are the design documents for pandas 2.0 (2015-2016), where my newly labeled dtype is what McKinney refers to as a logical dtype and the storage type as a physical type. |
I suppose we are somewhat talking past each other, as I am well aware of those design documents and logical vs storage types. But in general, we attach a meaning / specific behaviour to logical dtypes. For example having a Series as StringDtype will enable certain string-specific functions. As I understand your use case: you are attaching meaning to a Series to say, for example, "this are all urls". However, just subclassing StringDtype to change the name doesn't provide any specific behaviour (it will just use that name when the dtype is printed in the repr). It will for example not prohibit a user to actually put other, non-url strings in a Series of such a url dtype (the validation you are speaking about, as I would understand it). So I think it's not really clear to me in what kind of context those StringDtype subclasses would be used. |
And just to make it clear, I don't think that a subclass overriding |
This issue and the related PR have been open for a while, and it probably would benefit from an executive decision. The discussion has focussed on the hacky way I extended the class as initial motivation. We can conclude that this is not supported now, and might never be in te future. Let us note that this is discouraged and that the classes are still experimental, which is hereby documented in this discussion (alongside the official documentation). I would not find accomodating such hacking a justification to change the code. These are the arguments in favor of merging that remain (purely engineering I would say):
The PR is here: #34310 |
I think as long as there is a not a clear use case for it (and as mentioned in #34309 (comment), I still don't really understand how subclassing string dtype works for your use case), we are not really keen on facilitating subclassing StringDtype.
If we don't see it as a supported use case, we should probably also not add a test for it.
That's indeed correct. |
Code Sample, a copy-pastable example
Results in
assert dtype == "string" AssertionError
Problem description
It should be possible to extend the StringDtype/StringArray for users to design efficient subtypes. I believe that the the AssertionError is a bug and not intended, as pandas wants to have extensible dtypes, because there is the ExtensionDtype.
Expected Output
The code above should pass without errors.
PR with fix on it's way.
Output of
pd.show_versions()
pandas v1.0.3
The text was updated successfully, but these errors were encountered: