Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: StringArray non-extensible due to inconsisent assertion #34309

Open
sbrugman opened this issue May 22, 2020 · 13 comments
Open

BUG: StringArray non-extensible due to inconsisent assertion #34309

sbrugman opened this issue May 22, 2020 · 13 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data Subclassing Subclassing pandas objects

Comments

@sbrugman
Copy link
Contributor

sbrugman commented May 22, 2020

Code Sample, a copy-pastable example

import pandas as pd
from pandas import StringDtype
from pandas.core.arrays import StringArray
from pandas.core.dtypes.dtypes import register_extension_dtype

@register_extension_dtype
class MyExtensionDtype(StringDtype):
    name = 'my_extension'

    def __repr__(self) -> str:
        return "MyExtensionDtype"

    @classmethod
    def construct_array_type(cls) -> "Type[MyExtensionStringArray]":
        return MyExtensionStringArray

class MyExtensionStringArray(StringArray):
    def __init__(self, values, copy=False):
        super().__init__(values, copy)
        self._dtype = MyExtensionDtype()

series = pd.Series(["test", "test2"], dtype="my_extension")
assert series.dtype == 'my_extension'

Results in
assert dtype == "string" AssertionError

Problem description

It should be possible to extend the StringDtype/StringArray for users to design efficient subtypes. I believe that the the AssertionError is a bug and not intended, as pandas wants to have extensible dtypes, because there is the ExtensionDtype.

Expected Output

The code above should pass without errors.

PR with fix on it's way.

Output of pd.show_versions()

pandas v1.0.3

@jorisvandenbossche
Copy link
Member

This specific assert could be fixed, but in general: I am not sure we already want to commit to have those classes be "public" for subclassing (they are of course public for using them, but subclassing will typically depend more on implementation details).
Those classes are still quite experimental, so I don't think we want to give any guarantees about this.

@sbrugman
Copy link
Contributor Author

It's clear from the documentation that these classes are still experimental and could change at any time.

Motivations for this change could be:

  • The current assertion statement is inconsistent with type checking elsewhere in the same class:
    if isinstance(dtype, StringDtype):
  • The project we're using this extension for is specific to extending types (not the typical public subclassing). Fixing this issue might help us contribute back to pandas. In the project we mitigate the possiblity of sudden changes by having a fall-back method that does not rely on ExtensionDtype and using minimal code (similar to the example) to achieve the effect.

@TomAugspurger
Copy link
Contributor

I'm a bit confused about the use-case here. What are you gaining by subclassing StringArray / StringDtype, rather than ExtensionArray / ExtensionDtype?

@sbrugman
Copy link
Contributor Author

sbrugman commented May 22, 2020

@TomAugspurger Extending StringArray/StringDtype achieves data abstraction for types that semantically differ, but have the same machine representation (in this case string).

The code example provides an example, where the programmer may use the MyExtensionDtype to denote specific contents, while the machine represents this as a string, without adding unnecessary boilerplate code.

For instance, storing a path as a string is reasonable, in contrast to https://github.com/ContinuumIO/cyberpandas, where the machine representation of an IP address can be made more efficient through an integer representation.

(Background: 1, 2)

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

I think this usecase is ok, at the very least this could unconver some lighter tested codepaths in EA subclassing (we likeley have very limited coverage of Dtype subclassing aside from our internal usage).

@jorisvandenbossche
Copy link
Member

The code example provides an example, where the programmer may use the MyExtensionDtype to denote specific contents, while the machine represents this as a string, without adding unnecessary boilerplate code.

For your use case, is it only the "name" of the dtype you want to change (not any of its functionality?)

@sbrugman
Copy link
Contributor Author

For your use case, is it only the "name" of the dtype you want to change (not any of its functionality?)

For this use case it is simply aliasing indeed. The changed dtype is used to indicate more expensive prework (such as validating all items are urls). The most natural extension to this is adding accessors to the alias (series.url.query, series.url.domain, etc.).

@jorisvandenbossche
Copy link
Member

The changed dtype is used to indicate more expensive prework (such as validating all items are urls).

But so that is not a simple alias. As the string dtype subclass will not further keep this guarantee once you do something with the dataframe / dtype.

@sbrugman
Copy link
Contributor Author

Please correct me if I have any of this wrong but it seems to me like I want a simple mechanism to label dtypes as whatever I want, without changing the underlying storage type. From your perspective, you are looking at it instead as creating new pandas sequence types.

A good resource to provide background on the distinction are the design documents for pandas 2.0 (2015-2016), where my newly labeled dtype is what McKinney refers to as a logical dtype and the storage type as a physical type.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 26, 2020

I suppose we are somewhat talking past each other, as I am well aware of those design documents and logical vs storage types.
StringDtype is in fact already a logical type, since under the hood it's right now just an object array of pointers to python objects (and could be stored as a variable-sized binary array in the future).

But in general, we attach a meaning / specific behaviour to logical dtypes. For example having a Series as StringDtype will enable certain string-specific functions.

As I understand your use case: you are attaching meaning to a Series to say, for example, "this are all urls". However, just subclassing StringDtype to change the name doesn't provide any specific behaviour (it will just use that name when the dtype is printed in the repr). It will for example not prohibit a user to actually put other, non-url strings in a Series of such a url dtype (the validation you are speaking about, as I would understand it).

So I think it's not really clear to me in what kind of context those StringDtype subclasses would be used.

@TomAugspurger
Copy link
Contributor

And just to make it clear, I don't think that a subclass overriding StringArray._validate to do the URL validation would be appropriate. That's an internal method that we can't guarantee the stability of across releases.

@sbrugman
Copy link
Contributor Author

This issue and the related PR have been open for a while, and it probably would benefit from an executive decision.

The discussion has focussed on the hacky way I extended the class as initial motivation. We can conclude that this is not supported now, and might never be in te future. Let us note that this is discouraged and that the classes are still experimental, which is hereby documented in this discussion (alongside the official documentation). I would not find accomodating such hacking a justification to change the code.

These are the arguments in favor of merging that remain (purely engineering I would say):

  • The current assertion statement is inconsistent with type checking elsewhere in the same class:
    if isinstance(dtype, StringDtype):
  • The change adds coverage to the ExtensionArrays, which could be beneficial even for internal use only.

The PR is here: #34310

@jorisvandenbossche
Copy link
Member

I think as long as there is a not a clear use case for it (and as mentioned in #34309 (comment), I still don't really understand how subclassing string dtype works for your use case), we are not really keen on facilitating subclassing StringDtype.

The change adds coverage to the ExtensionArrays, which could be beneficial even for internal use only.

If we don't see it as a supported use case, we should probably also not add a test for it.

The current assertion statement is inconsistent with type checking elsewhere in the same class:

That's indeed correct.

@mroeschke mroeschke added the Subclassing Subclassing pandas objects label Aug 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants