-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-13: The pandas Logical Type System #58455
base: main
Are you sure you want to change the base?
Conversation
|
||
## Proposal | ||
|
||
Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.
In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of pd.ArrowDtype(pa.date32())
, pd.ArrowDtype(pa.timestamp(<unit>))
and datetime64[unit]
since we don't offer any logical abstraction.
In the model of this PDEP, we would offer something to the effect pd.date()
and pd.datetime()
, abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315
This PDEP proposes the following backends should be prioritized in the following order (1. is the highest priority): | ||
|
||
1. Arrow | ||
2. pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this "masked"? calling it "pandas" seems to preference it as being canonical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pandas extension
make more sense? Masked is maybe too specific, since this also refers to things like dictionary-encoded / categorical
... | ||
|
||
@property | ||
def physical_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Other areas where I think it would matter would be:
- Differentiation between decimal128/decimal256
- Any parametrized type (e.g. datetime types with different units / timezones, container types with different value types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?
its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write dtype = pd.datetime(physical_type=something_that_specifices_tz)
instead of dtype=pd.datetime(tz=tz)
. Neither of these seems great, so i feel like I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that
Doing something like pd.datetime(physical_type=something_that_specifies_tz)
is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would be pd.datetime(tz=tz)
and a user should not care if we use pandas or pyarrow underneath that.
Using #58315 as an example again, if we used the ser.dt.date
accessor on a series with a logical type of pd.datetime()
, then the user should expect back a pd.date()
back. That gives us a lot of flexibility as library authors - maybe we still want pd.datetime()
to be backed by our datetime extension by default, but could then use pyarrow's date implementation to back pd.date()
by default rather than having to build that from scratch
In today's world we are either asking users to explicitly use pa.timestamp()
to get back a true date type from ser.dt.date
, or they use the traditional pandas extension type and get back dtype=object
when they use the accessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PDEP suggests the 99% use case would be pd.datetime(tz=tz)
This seems reasonable, but then im not clear on why physical_type is in the base class.
but could then use pyarrow's date implementation to back pd.date() by default
Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.
Are you referring to the physical_type
keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string()
, pa.large_string()
, object
, etc...The "datetime" type could be physically implemented as a pa.timestamp
or the existing pandas extension datetime
. Right now a decimal would only be backed by either a pa.decimal128
or pa.decimal256
If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.
In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp
or our own datetime type, and a date can only be implemented as a pa.date32
, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type
BTW im nitpicking a bunch, but im generally positive on the central idea here.
The feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type? np.int64
? (but that's only for the data array, while there is also a mask array)
And is this physical type then always either a numpy type or pyarrow type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider the physical type in that case to be the pd.core.arrays.integer.Int64Dtype
but maybe we should add terminology to this PDEP?
In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.
Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256
The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific")
would also be helpful.
If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.
In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type
You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say,
pd.DatetimeTZDtype(unit="ns", tz="US/Pacific")
would also be helpful.
I think it should be as simple as:
>>> ser = pd.Series(dtype=pd.datetime(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")
Or
>>> ser = pd.Series(dtype=pd.DatetimeDtype(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")
You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.
I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.
When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say ser.dt.date()
they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end user
... | ||
|
||
@property | ||
def missing_value_marker -> pd.NA|np.nan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in most places we call this "na_value". Is there a reason to use a different name here? is pd.NA|np.nan
here just for the example's sake or is that a restriction you want to impose? What about None? Could a 3rd party EA author do something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - happy to change the name. I would rather not expand the scope beyond pd.NA
or np.nan
. I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable. With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable.
This is one of the main comments I was going to give: IMO this should not be configurable, even though it is true we live currently in this non-ideal world of multiple missing sentinels.
And it might makes things also more complex / dependent on other discussions, but I think for designing those new logical types we should somewhat assume we will in parallel also do a PDEP on converging on a single NA value for missing sentinel (it will actually makes things simpler for the design in this PDEP, it's just a bit a mess to discuss / implement all that on the short term because of the interdependencies).
Even if on the short term we have logical types that have either NA or NaN or NaT, I don't think that it should be something that the user can specify like in your example below (pd.string(missing_value_marker=np.nan)
). I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker). So we can still have this attribute (i.e. the current na_value
already does this), but it's just the query the information, not to set by the user.
With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.
If we want to move to those new logical types, we will anyway have a period where we have both the current types and the new types. So if we in parallel decide on a NA PDEP, it's also an option that those new logical types just all use NA. Whether that is feasible of course depends on the timeline of both developments (like ideally the string logical dtype for 3.0 would also already have used NA, but it does not because otherwise we could not yet introduce it by default right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker).
I agree with this comment, but we have already implemented this in a way where we disconnect the "backend" from the NA marker with types like "pyarrow_numpy". I agree users ideally won't be touching this field (maybe it can be a read-only property?).
If we removed this altogether I think we would lose the ability to support a transition period for dtypes like pyarrow_numpy
and the python_numpy
string type you are working on, but I may be missing the bigger picture on that.
I would definitely support another PDEP to unify our missing value handling, but would be wary of trying to tackle two large PDEPs at once. With this field being a part of the logical type proposal, I think it helps us support all of the existing types we have already implemented and give us better flexibility to control / deprecate the various NA markers we allow today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're certainly right that allowing both makes a transition easier, as is what we are doing with the string dtype. And depending on the timelines of both changes (logical types / NA), it might indeed be needed for other dtypes as well.
So I think my main point is that we shouldn't show this as a "parameter" of the dtype construction, but just as a property of a certain dtype instance.
Essentially what we did for the string dtype was to call this a different "backend" (although multiple string dtype instances with different null value might both be using pyarrow or numpy, but the "backend" argument maps to the concrete EA Array class that is being used, and those are actually different), so it is tied to a specific implementation, and not user-specified. And although this is very ugly with the different strange backend names, I think this is a more future-proof option if we think longer term the NA value should not be user configurable.
It might be nicer / clearer to show this in the StringDtype repr as a separate property instead of being merged in the backend name (so eg StringDtype(storage="python", na_value=NA)
/ StringDtype(storage="python", na_value=NaN)
instead of StringDtype(storage="python")
/ StringDtype(storage="python_numpy")
(because what does "python_numpy" mean to a user.. ?).
But at the same time I also wouldn't want to embed too much the idea that na_value
is user configurable, it's just that for the transition period we need a way to be able to specify both versions of the dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that na_value
is not for users of the pandas library, but for developers of the pandas library. I think it should be read-only, and possibly a "hidden" property (prefix with an underscore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely support both of what @jorisvandenbossche and @Dr-Irv are saying here. I think where I am getting tripped up is how we allow users today to swap over to this paradigm.
If someone had a code base where they do ser = pd.Series(..., dtype="string[pyarrow_numpy]")
and elsewhere maybe just ser = pd.Series(..., dtype="string[pyarrow]")
, how do we think they can port to this new design? Or do we not think that is valuable?
To ensure we maintain all of the current functionality of our existing type system(s), a base type structure would need to look something like: | ||
|
||
```python | ||
class BaseType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this then get mixed into ExtensionDtype? if not, what is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it depends on whether we as a team want to go down the path of offering constructors like pd.string()
or pd.StringDtype()
; with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas
Maybe this should be renamed to BaseLogicalType
to clairfy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the
.dtype
that users interact with? That's the logical type?
Yea that's where my head is at - essentially they would see / interact with a subclass of the BaseType
shown here
For external extension dtypes, I think it will be very hard to extend this concept of logical types to multiple external libraries, like your example of two independent external libraries implementing the same logical type.
Yea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things
I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these
That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.
So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like logical_type
going forward. If the logical type if "foo", we would let that be constructed using pd.foo
or pd.FooDtype
I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?
- Signed Integer | ||
- Unsigned Integer | ||
- Floating Point | ||
- Fixed Point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this correspond to pyarrow's decimal128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either decimal128 or decimal256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call it "Decimal", is there possible confusion with the python standard library decimal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in int
versus a "int64"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because the python decimal
type lets you specify any number of digits for the precision, as well as rounding.
So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.
Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the decimal
stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)
|
||
## String Type Arguments | ||
|
||
This PDEP proposes that we maintain only a small subset of string arguments that can be used to construct logical types. Those string arguments are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off the top of my head, this would mean deprecating M8[unit], m8[unit], timedelta64[unit], period[freq], interval, interval[subtype, closed], category. (the first three of which are supported by np.dtype(...)
). Just checking that these are all intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely would want to deprecate using the NumPy types since those would be circumventing the logical type system. I am -0.5 on continuing to allow timedelta64 because that also refers directly to the extension array implementation name rather than a logical type.
Period seems logical to continue supporting. I don't think I've ever constructed an interval that way before so I'll withhold an opinion on that for now, but generally these are offered for legacy support and not the ideal way of creating types going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely would want to deprecate using the NumPy types
See also some of the discussion in #58243.
But IMO it's a very hard step to actually deprecate specifying numpy dtypes, given how widespread this usage is. And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.
In any case, if we would deprecate that, it would need to go very slowly, and we should only do that a while after introducing the new pandas types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.
The benefit would be having some consistency in the type constructors we allow and it would decouple the user focus on something have to be NumPy.
I agree it would be a very long deprecation, but on the flip side I don't think a user really believes they are subscribing into any particular set of features when they ask for a np.int64
data type. They really just want a 64 bit integer.
Especially with integer types being zero copy there is very little benefit to asking for a np.int64
, and I would argue that is even counterproductive with the possibility of missing values being introduced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards thinking deprecating these would cause a lot of churn for users without much clear benefit. To the extent that you want to encourage users to use different patterns, I'm fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing users to construct physical types like np.int64
alongside logical types like pd.int64
I think defeats the purpose of adding logical types though; in such a case we are just adding more type construction options, which is a growing problem in our current state that this PDEP is trying to solve
There also is a clear benefit to pd.int64
over np.int64
, namely the former has flexibility to better handle missing values whereas the latter locks you into numpy semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though;
In my mind, we wouldn't allow them to create a Series with a "physical type", we would just accept np.int64
as input for the dtype
argument, and specifying np.int64
or pd.int64()
would give the exact same result: a Series with whatever is the default int64 dtype at that point (right now that is the numpy based int64, in the future maybe a numpy masked int64 or pyarrow int64).
It's certainly true that this doesn't help with the "too many ways" to specify a dtype, and this PDEP would add yet another way, but at least a way that is consistent across all dtypes, and which ensures there is a clear mapping of whathever random input to the resulting dtype.
For example, np.int64
is actually not even a dtype but a scalar constructor (np.dtype("int64")
is the dtype), but conveniently accepted as a dtype argument in numpy. Just like the python types are also accepted, in addition to strings.
So with the logical types, you can say that (for now) all of np.int64
, np.dtype("int64")
, "int64"
, int
, "int"
all map to the same as pd.int64()
, and longer term we could discuss reducing the number of aliases being accepted.
(I didn't do that with this comment .. ;), but personally I think we should maybe better focus first on the actual (logical) dtypes (the new part in this PDEP), and we can still later consider which aliases should be allowed or not. Short term we need to keep them all working anyway)
EDIT: numpy has very extensive documentation about which values are exactly accepted as dtype descriptors/aliases at https://numpy.org/devdocs/reference/arrays.dtypes.html#specifying-and-constructing-data-types, we should probably do something similar for pandas at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair. Providing compat is going to be a fact of life for this. Ultimately I don't think it would be that hard for a user to swap out np.int64
for the logical constructor (in such an instance it would just be find/replace), but yea I agree that can be something cleaned up over time. Its not critical to execution of this PDEP
|
||
## Bridging Type Systems | ||
|
||
An interesting question arises when a user constructs two logical types with differing physical types. If one is backed by NumPy and the other is backed by pyarrow, what should happen? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"constructs" here seems weird. do you mean when two objects with different storages interact? i.e. ser_pyarrow + ser_numpy? If so, I agree with the ordering below, and note that it can be governed by __pandas_priority__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the ordering as well, but I would be supportive of just raising as well
| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) | | ||
| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) | | ||
|
||
It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #24662
- Period | ||
- Binary | ||
- String | ||
- Dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categorical
instead? Because "dictionary" in Arrow terms is rather the physical type, not logical (like the string column could be dict encoded to save space, but still represent strings). While categorical (like pandas does it now) has specific semantics (i.e. fixed set of allowed values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference, but Dictionary
has the advantage of being more generic. With Categorical
as a logical type I still think we'd end up offering a separate Dictionary
logical type, which may or may not be a bad thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the point of a Dictionary logical type in addition to Categorical? When would one use that?
Suppose for the case of having dict encoded string values, either you just want to see those as strings, and then this is the String logical type, or either they define a strict set of categories and then this is the Categorical logical type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a logical Dictionary type could offer support for things like Python dictionaries or JSON structures that would seem an odd fit through the lens of a Categorical type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, we are talking about completely different things then ;) I was assuming you were using "dictionary" in the Arrow sense, but so you are using it in the python sense (in Arrow that is called a map
type).
I am fine with in the future having a map/dictionary type (I also don't think this list necessarily needs to be exhaustive, we can also focus on those types for which we have somewhat support nowadays).
But then I do think that "Categorical" is missing in the list above (and now using the term as our existing CategoricalDtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK makes sense. Yea I agree let's add Categorical separately then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW i also thought this referred to a Categorical-like dtype.
there is a JSONDtype in the extension tests (im guessing you implemented it long ago) that IIRC has a ton of xfails. id suggest getting that closer to fully-working before committing ourselves to first-class support for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear I don't think we want to offer a first class logical JSON type, but a user can construct that from a combination of the list / dictionary types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote, I think in other systems JSON is typically physically stored as a string (so that's a bit different from what the current test JSONDtype does)
Some big picture thoughts before I call it a day:
|
Cool thanks again for the thorough feedback. Responding to each question...
Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see #58312
This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that
Definitely - noted for next revision. But at a high level I am of the opinion that logical types have nothing to do with data buffers, null masks, or algorithm implementations. Continuing with the database analogy, a logical query plan would say things like "Aggregate these types" or "Join these types". A physical query plan by contrast would say "use this algorithm for summation" or "use this algorithm for equality". An extension type acts more like the latter case
I certainly dislike these too, but I'm not sure how our current type system could be ported to a logical type system without them. With our many string implementations being the main motivating case, I think this metadata is required for a logical type instance to expose so other internal methods know what they are dealing with. I would definitely at least love for these to be private, but it comes down to how much control we want to give users to pick "string" versus "string[pyarrow]" versus "string[pyarrow_python]" and whatever other iterations they can use today
To be clear I would sincerely hope a user specifying the backend is an exceptional case; that is only allowed because our current type system allows / quasi-expects it. The logical type is a stepping stone to untangle user requirements from implementation details |
In the database example do the different implementations have different behaviors? It strikes me as |
I agree there are a few parts of the Bridging section that are probably not needed and can be trimmed (eg the fact about following the C standard rules on casting / conversion), but on the other hand the text the will also have to be expanded significantly in other areas. For example all the questions Brock listed in the last item in his comment above (#58455 (comment)) will need answering. Same for capturing a summary of some of the discussion points we have been talking about (we already generated a lot comments in two days, we can't expect everyone to read all that to have a good idea of the discussion)
Setting the different null handling aside for a moment (because we know that is a big difference right now, but one we might want to do something about), I think that most reports about different behaviour should be considered as bug reports / limitations of the current implementation, and not inherent differences. (of course there will always be some differences, like small differences because of numeric stability in floating point algorithms, or different capitalization of the German ß (apache/arrow#34599), and it will always be a somewhat subjective line between implementation details versus clear and intentional behavioural differences. But the list of items linked above are all clear things that could be fixed on our side to ensure we provide a consistent behaviour, not inherent differences. It's our choice whether we want to provide a consistent interface or not) |
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(missing_value_marker=np.nan) | ||
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(physical_type=pa.string_view()) | ||
``` | ||
Note that the class-based approach would reuse existing classes like ``pd.StringDtype`` but change their purpose, whereas the factory function would more explicitly be a new approach. This is an area that requires more discussion amongst the team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a dtype, I'd prefer having Dtype
in the name. That favors the class based approach. But the factory function approach could be used as well, but I'd prefer pd.stringDtype()
over pd.string()
to make it clear that it's a dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - I think @jorisvandenbossche has the same preference. So happy to start focus on that pattern more.
The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype()
gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont have a strong preference, but do think it will make the discussion easier to focus on just one variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I'll stick with that in the next revision. We can always revisit if that doesn't work out for whatever reason. Thanks for the feedback all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, my personal preference would be to go with the shorter functional constructors (pd.string()
, pd.int64()
, etc).
It's easier to use, less boilerplate, generally more similar to other libraries (this was my impression, but looking up a few examples, that is not necessarily true), and ties less into the exact Dtype class (in case two variants of one logical type want to use a different dtype class, although I am not sure this is necessarily something we want, though)
But agreed that for now, I think you can just use one throughout the document to explain everything, and then I would add one section on this topic mentioning some options and why which choice was made (and we can still further bikeshed on the exact syntax, but that feels like an actual bikeshed separate from the core ideas of the PDEP)
(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string()
. I seem to remember this also came up when I discussed topic in Basel last year)
The only concern I have is that we are repurposing something that has long existed.
pd.Int64Dtype()
gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing
Well, we still need to discuss what those logical type constructors would actually return ;) In my mind, it could continue to return our extension types (as I wouldn't necessarily distinguish user-facing extension types and logical types), and then pd.Int64Dtype()
would not necessarily have to change return value .. (except that it might gain additional keyword to e.g. get the pyarrow-backed integer dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like
pd.dtypes.string()
. I seem to remember this also came up when I discussed topic in Basel last year)
That works for me. I'm not a big fan of continually adding things to the top level pandas
namespace.
|
||
The ``BaseType`` proposed above also has a property for the ``missing_value_marker``. Operations that use two logical types with different missing value markers should raise, as there is no clear way to prioritize between the various sentinels. | ||
|
||
## PDEP-11 History |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## PDEP-11 History | |
## PDEP-13 History |
pd.Series(dtype=pd.list(value_type=pd.string())) | ||
``` | ||
|
||
## Bridging Type Systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to address if people can convert between two physical types that back the same logical type?
below, you discuss where operations between 2 objects of different physical types, but same logical type, cause a priority in the result. But what if users want to convert from one physical type to another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great callout - I assume you have in mind something like adding a logical int + float.
I think the same rules should still apply, i.e. if one is physically a pyarrow int and the other a numpy float you should end up with logical float type backed by a physical pyarrow float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if I have a numpy backed string and I want to convert to a pyarrow backed string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those would both be under the logical string type, so if you tried to concatenate them together you would get a pyarrow string per the rules outlined here. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm saying that let's say you know that you have a Series
of logical type string
, and you also know that it is backed by numpy
string, and you want to convert that Series
to have a physical type of arrow
. Don't you need a public API that lets you convert from one physical backing to another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second is that you want an Index with timestamp[pyarrow] dtype to be (or just behave like) a DatetimeIndex.
Yea if we move to a place where the DatetimeIndex
is backed by the logical datetime type, then it becomes an implementation detail for us to control how the two different physical types need to fulfill the requirements of a logical datetime type. I think since the pyarrow / NumPy datetime types are zero-copy save for NaT handling the internal development for that should be a lot less burdensome than having a separate DatetimeIndex[pyarrow]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you would want to implement a new date dtype with numpy semantics?
No - there would be no need at all for that. In this particular case, if the logical requirement was to go from a datetime -> date, and physically our array happened to be backed by a NumPy array, we internally would go NumPy datetime -> pa.timestamp -> pa.date, with the pa.date being exposed as a pd.date
type; if the array was already backed by pyarrow we could skip the first step
This can get tricky if/when PDEP-10 is reverted.
I think even if PDEP-10 got reverted it would limit the logical types we offer (we'd essentially scrap date, list, dictionary, struct, etc...anything where only pyarrow implements something) until pyarrow becomes a hard dependency again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - there would be no need at all for that [...]
We're definitely talking past each other, because AFAICT the suggestion here directly contradicts discussion in #58315. Let's put a pin in this until the call on Wednesday and see if we can sort it out then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing this part of the discussion from the call: the communication failure was in 1) what "mixed semantics" I was concerned about and 2) what Will meant by not needing to implement anything new.
Consider
dti = pd.date_range("2016-01-01", periods=3)
ser = pd.Series(dti)
ser.iloc[1] = pd.NaT
df = ser.to_frame("datetime")
df["date"] = df["datetime"].dt.date
Now suppose that df["date"] is backed (perhaps indirectly) by a pyarrow array. When I say the columns have "mixed semantics", I am referring to the propagation of nulls in comparisons
mask1 = df["date"] == dti[0].date()
mask2 = df["date"] != dti[0].date()
In this scenario mask1
and mask2
are both nullable and their second entry is pd.NA. Using these masks for indexing gives surprising results (raising ATM, different if the Ice Cream Agreement goes into effect) to users expecting numpy semantics. A user who has not opted in to null propagation has some columns that behave this way and some that do not. Avoiding that scenario is why we implemented "string[pyarrow_numpy]" to have arrow storage with numpy semantics.
OK, all of that is what I had failed to communicate. What Will was saying when he said (paraphrasing) "we wouldn't need to implement another dtype", that didn't exclude a thin wrapper to retain numpy propagation semantics with pyarrow storage.
@WillAyd can you confirm that the above is a valid summary of what we got sorted out at the end of the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's correct and a nice summary. This is why the BaseType in the PDEP has the missing_value_marker
/ na_value
- it explicitly decouples the NA handling from the physical type. It specifically allows the logical type to cover the functionality of string[pyarrow_numpy]
as you described above, and as we talked about on the call theoretically also allows you to have a logical date type which uses pyarrow behind the scenes but can present pd.NaT
as the na_value
to the end user
I think this PDEP could help achieve more consistent NA handling over time since it decouples the user expectations from a "backend", but actually propogating one NA marker consistently is left to a separate PDEP
|
||
It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone). | ||
|
||
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). | |
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though pandas claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). |
|
||
``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred). | ||
|
||
### Missing Value Handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel wanted to ping this off you. When we talked I said would stick with np.nan as the missing value marker, but since we are reusing the pd.StringDtype()
in this version of the proposal I don't think we should move that away from pd.NA
Its not the default but maybe this pd.na_marker="legacy"
serves as a good middle ground?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a global option, it should go through pd.set_option/get_option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why data_manager instead of storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I wanted to include the word data is because we are strictly referring to the data buffer and nothing else. I don't have a strong preference on the terminology of this though
| pd.IntXXType() | pd.NA | np.nan | | ||
| pd.FloatXXType() | pd.NA | np.nan | | ||
| pd.StringDtype() | pd.NA | np.nan | | ||
| pd.DatetimeType() | pd.NA | pd.NaT | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period, timedelta64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can make this a fully exhaustive list. For the Legacy missing value on BinaryDtype
, ListDtype
, StructDtype
and MapDtype
do you think np.nan
or pd.NA
makes more sense?
|
||
Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear. | ||
|
||
Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like you're not engaging with what joris and i tried to explain about "you need to do it all at once" on Wednesday's call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry missed this before - what do you think is missing in this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section dramatically expands the scope of this PDEP, to encompass a lot of what would be PDEP-NA. If you want to make PDEP-NA, great, go for it. Until then, this one should either a) be tightly scoped so as to not depend on PDEP-NA, or b) be explicitly "this is on the back-burner until after PDEP-NA". I prefer the first option, but you do you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your objection just to the default of pd.NA or to the more general approach of allowing na_marker to allow different nullability semantics while still using the same logical type? I am indifferent if we felt the need to start with the "legacy" approach of mixed sentinels, but I think the latter na_marker is a critical piece to execute this PDEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your objection just to the default of pd.NA
Yes, that is a deal-breaker for me until we have a dedicated PDEP about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the na_marker="legacy" makes sense? Curious to know if you think the NA marker for logical types that have no NumPy equivalent (ex: ListDtype) should still be np.nan or pd.NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding naming aside, I'd be fine with that as long as it is the default until we do an all-at-once switchover following PDEP-NA.
I recognize this is more about the underlying primitive field types than collection types, however I just feel like the collections are arguably also primitive, and this issue has bugged me forever: To avoid needing to store critical reminders about array axis names and bounds in comments instead of the type hints, how could we include array shape hints (int bounds), and optional array axis names (str) or ideally axis types (convenient newtype int bounds, great for language server to catch shape bugs)? I've found it quite helpful to separate the axis names from the bounds, but in the case of typed axes, the axis name lives in the type of the bound, so the type checker can also do broadcast alignment. It's a bit tricky in some languages but probably a breeze in python and the only reason we're not doing this is we don't know what we're missing, imho we want the n-dimensional array shapes to be heterogeneous lists of axis bound pointer newtypes because then the type system can calculate the output shapes of our n-dimensional array operations better, as the world is moving rapidly into higher-dimensional data engineering, just seems reasonable to ensure Pandas has great n-dimensional array shape types, that is all |
Hey @bionicles thanks for providing some feedback. I don't quite follow what you are looking for though - is there maybe a small example you can provide to clarify? |
Absolutely, great idea, here is some code A: demonstrates how array shapes are defined by the data passed in, and bugs can be unclear, and the axes have no names Then we could hack out a basic data structure to hold the names and bounds for the axes in the shapes of the arrays. from typing import NewType, Type, Tuple, Protocol, Optional
from dataclasses import dataclass, field
import pytest
@dataclass
class Axis:
"one step along the path of indices to an element within a tensor"
name: str = "B"
bound: Optional[int] = 16
def __post_init__(self):
if self.bound is not None and self.bound < 0:
raise ValueError("Axis size must be non-negative")
def test_axis_nonneg():
with pytest.raises(ValueError):
Axis(bound=-1)
# We want to write the shapes of the arrays in the type system
Shape = Tuple[Axis, ...]
# Example:
# Rows = Axis("R")
# Columns = Axis("C") # oops, this is a value
# MatrixShape = Tuple[Rows, Columns] # TypeError: Tuple[t0, t1, ...]: each t must be a type. Got Axis(name='R', bound=16).
rows = Axis("R")
columns = Axis("C")
matrix_shape = (rows, columns)
def str_from_shape(shape: Shape) -> str:
if not shape:
return "()"
names = []
bounds = []
for a in shape:
names.append(a.name)
bounds.append(a.bound)
return f"Shape({names=}, {bounds=})"
def shape_from_characters_and_bounds(characters: str, bounds: Tuple[Optional[int], ...]) -> Shape:
"build a shape quickly"
characters = tuple(characters)
characters_and_bounds = zip(characters, bounds)
return tuple(Axis(name=c, bound=b) for c, b in characters_and_bounds)
def shape_from_shapes(a: Shape, b: Shape) -> Shape:
c = {}
for axis_in_a in a:
c[axis_in_a.name] = axis_in_a.bound
for axis_in_b in b:
# remember the linear algebra and numpy broadcasting rules
# if either is 1 or none, any value is OK, take largest
# however if they are both scalars greater than 1, (m, n)
# then m == n must hold for compatibility
if axis_in_b.name not in c:
c[axis_in_b.name] = axis_in_b.bound
else:
lhs_bound = c[axis_in_b.name]
rhs_bound = axis_in_b.bound
if (
(lhs_bound is not None and lhs_bound > 1) and
(rhs_bound is not None and rhs_bound > 1) and
(lhs_bound != rhs_bound)
):
message = f"Axis '{axis_in_b.name}' Bounds ({lhs_bound}, {rhs_bound}) Incompatible: \n- lhs={str_from_shape(a)}\n- rhs={str_from_shape(b)}\n"
raise ValueError(message)
else:
if lhs_bound is None and rhs_bound is None:
c[axis_in_b.name] = None
elif lhs_bound is None and rhs_bound is not None:
c[axis_in_b.name] = rhs_bound
elif lhs_bound is not None and rhs_bound is None:
c[axis_in_b.name] = lhs_bound
else:
# both are scalars, take largest
c[axis_in_b.name] = max(lhs_bound, rhs_bound)
c_shape = tuple(Axis(name=k, bound=v) for k, v in c.items())
return c_shape
def compatibility_from_shapes(a: Shape, b: Shape) -> bool:
try:
_c = shape_from_shapes(a, b)
return True
except ValueError:
return False B: demonstrates how clarifying the shape of the arrays can make similar bugs easier to understand https://colab.research.google.com/drive/1OVkRtwv747jXPI11kfKCGGFHWWtRs3ki?usp=sharing You could also transpose by axis names instead of their order, which is way more flexible because you could swap axes into desired positions by names regardless of their current order This is along the line of what i'm looking for, however requires features of newer versions of python and would screw up pandas backwards compatibility Anyway, then we could have a column that's type hinted to be a specific shape of array, with axes labeled, i bet it looks really cool "Array[u8, (B, H, W, C), (16, 28, 28, 1)]" is wayyy more clear than "array[int]" and that clarity will help type checkers help us avoid / understand / fix shape bugs which are prevalent in python AI code |
From a static type checking perspective, this is really difficult (and maybe impossible) to do, even to track the dtypes within a You're asking for doing something similar with tracking the shape of a So I don't think that what you are asking for is possible from a static typing perspective. |
Outside of @Dr-Irv comments I also think what you are asking for is not quite in the scope of this PDEP. I would suggest opening a separate issue for discussion about supporting that though |
If users don't know which backend they have, and the backends differ in behavior in albeit minor aspects, users will see this differences. Differences in behavior on the same input will be reported as bugs. 0.001% of 10 million PyPI downloads per day is 100. I think the only viable way forward is to treat these as mutually exclusive. |
How would you see us moving back to exclusive backends? Wouldn't that mean that the string data type would have to go back to being an object data type in the "numpy" system, or did you have something else in mind? |
Not sure what is meant by "exclusive backends". Do you mean where the user knows what backend they are using?
Not if we embrace the new string dtype. |
For ML/probability, "float" and "int" are underdetermined in the sense they're not clear on how big those need to be. Making it easier to type hint data types and harder to accidentally screw up numeric precisions could be a solid goal for this PDEP. Assigning various precisions of the same category of data type e.g. "float" "int" "uint" to a shared logical base, we could make it possible to type hint for a category. One struggle I have is type hinting datatype arguments, as sometimes we need to configure the data types more dynamically, and python generics are not done. For example, I pass two kinds of float dtype, one being precise to hold big numbers and another being imprecise to save memory import numpy as np
import ml_dtypes
DEFAULT_IMPRECISE_FLOAT = ml_dtypes.bfloat16
DEFAULT_PRECISE_FLOAT = np.float32
def eval_from_input_and_model(
input: Input,
model: Model,
imprecise_float_dtype: np.floating = DEFAULT_IMPRECISE_FLOAT,
precise_float_dtype: np.floating = DEFAULT_PRECISE_FLOAT,
^^^^^^^ logical dtypes can be precision-agnostic
) -> Eval:
... This causes type checker false positives with the dtype type hints because np.floating is np.floating[unknown] while np.float32 is np.floating[_32Bit] To make something that works with Python type hints / type checking today, it may make sense to make a class dtypes(Enum):
BOOLEAN = DataType(in_python=bool, in_np=np.dtype(bool), in_pa=pa.bool())
F32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
F64 = DataType(in_python=float, in_np=np.float64, in_pa=pa.float64())
I32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
I64 = DataType(in_python=int, in_np=np.int64, in_pa=pa.int64())
... so on for all precisions of float/int/uint, for strings, temporal stuff, etc...
) Benefit of this would be: we could always type hint our data types using a pd.dtypes.XYZ variant, and we could keep this "rosetta stone" of dtype conversion in one place, and it would give us a nice place to affix the methods to convert / display / check dtypes. A downside is, all the dtypes are in one giant enum, so it's hard to write
One possible workaround for that downside could be to make Literals for the Enum
The Literal["UTC", "America/New_York", ...] works to enable explicit type hinting for time zones, I've dealt with that a bunch this week. Sometimes these generics lead us into using We could try to experiment with an "Enum of Enums" approach where the class dtypes(Enum):
BOOLEAN = DataType(in_python=bool, in_np=np.dtype(bool), in_pa=pa.bool())
FLOAT = Float
)
class Float(Enum):
F32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
F64 = DataType(in_python=float, in_np=np.float64, in_pa=pa.float64())
class Int(Enum):
I32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
I64 = DataType(in_python=int, in_np=np.int64, in_pa=pa.int64()) That's a bit weird, and I've never done it, so I don't know if it even works, but it would have a benefit of grouping the dtypes by logical category and could possibly work better with Another downside to the Enum approach might be, how to fit To get this done faster, we could focus on a bijection between arrow and numpy dtypes. If we line up their dtype systems and count all the distinctions, that's a good list of edge cases, as mentioned above. Arrow has I can't claim to know all the edge cases and issues, but I think this PDEP is a good idea, and one way to make the implementation work could be to make an enum with dataclass values (I didn't even know this was possible until recently). Then we could make a Literal for each logical category of dtype with the variant names in that category. Edge cases:
Sorry for long comment, hope this helps guide us, my bad if these issues are already addressed |
@bionicles thanks for your input. In the PDEP text there is a link to the following image - I think there is some conceptual overlap between being able to lay out a type hierarchy and solve your typing needs: The details of how we actually need to implement that for type checkers might be a little too granular for this PDEP (maybe, maybe not) but hopefully this type of categorization is inline with what you are after |
I don't want to dismiss the type hinting topic (it's interesting and it would be nice if the PDEP could also improve the type hinting situation, although I will have to read and understand it in more detail to say anything constructive about it), but I would suggest moving that to a separate issue. (while this is the main PR for the PDEP, it's perfectly fine to split certain sub-topics into its own issue to have a more focused discussion, as this PR issue is becoming very lengthy) |
@WillAyd I have re-read the discussion of the last days above multiple times now, and I am not fully following (and as Brock mentioned, some different sentences seem a bit contradicting each other).
It's not entirely clear to me what you mean exactly with "just have one way of doing things", but I agree with the others that this does sound as two opposite options for which we need to make a choice (i.e. "mutually exclusive"): Either we say "let's accept that there are significant differences between backends, and we don't put effort in making behaviours consistent". But at that point users will care about those. Or either we say: "most users shouldn't have to care about which backend is being used, because that is an implementation detail that (except for some details) has not much directly visible (behavioural) impact". But at that point we will have to put effort in making the behaviour consistent across backends.
I quoted this a bit out of context, but this again seems a bit contradictory? If we do not want to support different behaviours, i.e. support both "numpy" behaviour and "pyarrow behaviour" at the same time (with which I agree), then that does mean we need to match behaviour (and put effort in making them match) between a backend based on numpy and based on pyarrow, no?
Personally, I think I am firmly in the camp of "let's do put effort in making behaviours consistent".
For our string dtypes (within the NaN-variants and within the NA-variants), we do put effort in making the behaviour consistent between the "python" and "pyarrow" storage. In the So I think this is a concrete example of where we clearly do put effort in making the backends behave consistently. For the people that argued otherwise, do you think we should also stop doing for that the string dtype, or just not follow that pattern for other dtypes? |
Thanks @jorisvandenbossche for the detailed reply. Some quick comments that hopefully clarify:
Mainly that users don't need to make choices based off of how our implementation decides to return different types. Like many of the issues labelled "pyarrow dtype retention" and "Dtype conversions", we have a lot of gaps in the API where we force a user to understand that "this method returns a pyarrow backed boolean data type because we think it can be more performant, but this method returns a NumPy bool because our extension array interface declared it as such." Implementation details like this don't need to be leaked into the API; we can just always return a logical type
The point of this quote within its context was to say that we don't need to try and reimplement NumPy's signed integer overflow handling (both pre NEP-50 and post NEP-50), nor does the use of INT64_MIN as the NaT sentinel in our own implementation of temporal data prevent us from having logical temporal types. It is totally acceptable to have some "undefined behavior," akin to how the C language does not 100% standardize every compiler, and maybe it would help to list those out as part of this PDEP. But generally edge cases shouldn't force us to abandon an entire system It also applies to the datetime example that I've quoted a few times in this document. In today's world a call to
Yes this does go back to the "one way of doing things" comment; I think the way that we are going about strings is great, although strings are even way more extreme than any of the other data types. Just a quick rundown of other types:
My expectation is that converting those all to logical in sum would be far less effort than what is going into the string type Hope that helps clarify but let me know what may still be unclear |
Also with the string example, our behaviors will "match" so far as there aren't differences between utf8proc and the Python stdlib. We definitely aren't going to try and bridge those gaps, but that hasn't prevented us from creating that logical type. |
Thanks, that clarifies. I think part of the confusion is that you said multiple times something along the line of "I think we should not put a lot of time into trying to make all the backends work the same", while I would personally interpret your current stance (but please correct if I am misinterpreting!) as "we should try to make backends work the same to the extent possible", but then the details and discussion are of course about that last italic part .., i.e. what is an acceptable difference and what not, what is possible to make consistent and what not. Because my interpretation is that you also think that (ideally) NA handling is the same across backends? That's clearly something we need to "put time into" if we want to make that happen. But I agree that the capitalization of German Eszett is not easily possible to make consistent (and overflow in operations is an area where we disagree, I think). |
Thanks @jorisvandenbossche . Your wording is much better than mine |
I think missing values can be handled consistently, although its up to you as an end user which NA "semantics" you want to use. Taking the following example from today's world: >>> values = ["foo", "bar", None]
>>> pd.Series(values).str.startswith("f")
0 True
1 False
2 None
dtype: object
>>> pd.Series(values, dtype="string").str.startswith("f")
0 True
1 False
2 <NA>
dtype: boolean With this PDEP you would get the logical data type back in the first example, based off of the rules in the |
I know that is today's world, but I would personally still argue that we should not keep allowing the user to choose the NA semantics in the future, but limit this to one set of behaviours. But yes that is PDEP-16, and it's a bit difficult to keep those separate. |
(sorry, long post coming)
Thinking a bit further on this, this raises two questions: what do we exactly understand with "matching behaviour", and why then have multiple backends if they can't differ? Matching behaviour across backendsAbove, there was some discussion about what we exactly understand under those "subtle" or "significant" differences that we would either want to allow or not allow. It's good to be concrete. For example, quoting Richard:
First, the missing value behaviour is not a "subtle" difference at all (it's actually so significant that there is a separate PDEP for it ;)). So I would say that most aspects related to NA handling (the exact sentinel to use, the propagation behaviour, how it's treated in indexing, whether they sort first or last, etc) are all signifcant behaviour differences, that indeed users do care about (and if we would allow that do differ, something that users should be aware of). But I would personally argue that this is not something that we should allow to differ between backends (hence PDEP-16). Brock gave some other examples:
The first one (support NaT or not) indeed depends a bit on PDEP-16, because if we would decide to not distinguish NA and NaT (the current proposal), then most of the behaviour differences (a certain operation returning NaT or not) go away (except for the tiny corner case of being able to store the smallest possible timedelta or not).
To clarify, pyarrow provides numeric operations in both a version that checks for overflow and one that does not, and we chose to use the checked version. So if we want to match behaviour with numpy, we can easily change our pyarrow dtypes to do so. Now, personally I prefer the behaviour that raises an error for overflow, but again if we want, we can also implement that for our numpy-based dtypes. Of course that comes with a performance penalty (and so maybe it should be possible to turn it off), but that's the case for the pyarrow dtypes now as well. Summarizing the currently raised examples of behaviour differences:
I would personally call none of those "subtle", but see those all as significant differences (and it is for this kind of differences that I personally argue that we should ensure this is consistent across backends). And referring to my older comment at #58455 (comment), I also think many currently known or reported differences with the pyarrow backend (certain type specific methods not yet working, something not working when set as the index, ...) should be considered as "bugs" or "not yet implemented". For the other side of the spectrum, the kind of differences that I personally would consider as edge cases that are acceptable, are for example the capitalization of the German Eszett or the fact that What's the point of multiple backendsIf they should behave exactly the same, what's then the point of having multiple backends? (as that could certainly be a reason that one would want a different backend, for a specific behavioural feature) One reason of course is the ability to have an optional dependency, but this gets us in the PDEP-10/15 discussion of making pyarrow required or not ... Especially when pyarrow would be a required dependency, one could question the need to have multiple backends for several of the default dtypes. For the string dtype there might be sufficient difference in performance characteristics (eg the object-dtype based ond is more efficient when mutating or sorting) that it might be worth to keep both options. But for example for the numeric (int/float) dtypes, if a possible numpy and parrow backend behaves the same and is more or less on par for performance, why maintain both? (and based on performance, I don't think it is necessarily a given that we would then pick pyarrow above numpy) I think an important reason for the backends is to let us allow more easily to experiment with a new implementation, before making something the default. For example, assume for a moment we already had a Another way to look at those backends is rather than a "dtype" backend, is to consider it as a "compute" backend or engine. Similarly like we use Anyway, this is maybe getting a bit far from the current discussion, but for the actual PDEP, I don't think we necessarily need to answer all the above questions (e.g. which backend for which dtype? keep multiple backends long term? etc). I would personally say something along the lines of 1) we provide a logical dtype for all our current dtypes (with a consistent interface, etc), and 2) they all have (obviously) at least one implementation (either using numpy, or using pyarrow, or a mixture where it makes sense), but if there are multiple implementations (i.e. parametrizations of the same dtype class), the intent is that they behave the same ("to the extent possible"). |
I'm a little bit torn on this. One one hand, it would definitely make for easier internal development if these were bundled together, but on the other hand I think there is some advantage to delivering separately, i.e. the logical types here could be a good stepping stone to achieve PDEP-16. For users that are very concerned about missing value handling, using the "legacy" mode (or whatever we want to call it) from the
I think what we might be overlooking with this example though is, assuming we really prefer the non-pyarrow behavior, there isn't really that much of a need to make pyarrow change at all. We can prefer the existing timedelta64 storage / computational behavior and just worry about efficient copy to
This is somewhat of a technical note but we can also punt on that behavior for some time, at least for arithmetic (conversions maybe need some more review). C23 introduces checked arithmetic implemented via
Thanks for this list. I think the PDEP can have a section dedicated to each of these behaviors. Happy to try and draft that, but am open to any content anyone provides up front
I think this is a great section, but I'd also add that we stand to gain a lot from the composition of these different providers. We have a ton of custom calendar functionality in pandas using NumPy that I don't see us reimplementing with pyarrow types. On the flip side, pyarrow has some really cool types that probably won't be implemented in NumPy (aggregate types, decimal, etc...). A the main tenet of this PDEP is that we can strategically offer the strengths of types from the different libraries in a way that is consistent and meaningful to the end user |
I've read all of this discussion, and am not sure how I feel about the behavior issue. My main reason is that while there is a summary of major behavior differences given by @jorisvandenbossche at #58455 (comment) , it seems from the discussion there are some minor behavior differences that would need to be better understood. However, in the future, whether the behavior is entirely consistent between backends, or the behaviors are the same "to the extent possible", documentation will be required to indicate those behavior choices. Even if the behaviors were 100% consistent across backends, if those behaviors differed between how things would work in pandas as compared to non-pandas code using So maybe there should be a discussion post somewhere (or a new piece of documentation) that explicitly documents the differences in behavior between the different backends that exist today, and then we can see where the line could be drawn in terms of what behaviors would be consistent and what behaviors would not be. |
Does it help if we say that this PDEP only concerns itself with how types are stored and not necessarily how they are computed? I think from the storage perspective it is very easy to technically validate what differences there are and are not. With computations, you run into questions like how is overflow handled, why does pandas use kahan summation while NumPy uses pairwise summation while pyarrow uses XXX, why does pandas default to ddof=1 for std while NumPy uses 0? I don't think anything needs to change with our computation engine as part of this. Long term, yes it might make more sense to use a different computation engine that scales out of core or can be integrated into a lazy evaluation context, but that is also a different PDEP |
I've been largely critical, here are some bits that I'm +1 on:
|
There are certainly pros and cons for each option, but maybe this is something we can actually table for later when things get more concrete, as I don't think it is critical for this PDEP whether each dtype will have support for different NA semantics. The dtypes that are currently already extension dtypes and that we would keep will have to support both for at least a transition period (like we now do for
I am not sure if the PDEP text should have a section on each of those, that goes maybe too much in the details? EDIT: I think this broadly matches with the first point ("general statement") of Brock's comment just above.
Yes, definitely. That's also why I included "or a mixture [of numpy and pyarrow] where it makes sense" (and mentioned the notion of considering it as a compute engine).
I think the PDEP should at least state something about behaviour (i.e. not necessarily which library is used, but the result of the computation), as that is the most relevant aspect for users. Not necessarily in detail (as we can't answer every possible question), but as I mentioned above at least in general terms together with some examples.
Not too important, but I wouldn't necessarily say that strings are the biggest motivator. For the dtype constructor |
>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=pd.NA)).value_counts().dtype | ||
int64[pyarrow] | ||
>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=np.nan)).value_counts().dtype | ||
Int64Dtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int64Dtype() | |
dtype('int64') |
* physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view | ||
* na_value: Either pd.NA, np.nan, or pd.NaT. | ||
|
||
While these attributes are exposed as construction arguments to end users, users are highly discouraged from trying to control them directly. Put explicitly, this PDEP allows a user to request a ``pd.XXXDtype(storage="numpy")`` to request a NumPy-backed array, if possible. While pandas may respect that during construction, operations against that data make no guarantees that the storage backend will be persisted through, giving pandas the freedom to convert to whichever storage is internally optimal (Arrow will typically be preferred). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should each dtype necessarily expose each of those metadata items as a construction argument? If there is only one valid option (eg a dtype that is implemented only using numpy or pyarrow for storage, or that has only a single physical type), it seems not necessary to expose this as argument?
To implement this PDEP, we expect all of the logical types to have at least the following metadata: | ||
|
||
* storage: Either "numpy" or "pyarrow". Describes the library used to create the data buffer | ||
* physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's true that there are multiple dtypes where this could be relevant, to Brock's point: it will not be relevant for all dtypes. So maybe make this an optional metadata item? (or maybe just give as example how the logical dtype can hide implementation details)
FWIW, I think for me this are (or should be) the two main points of the PDEP (and the text could mostly focus on those two items, going into more detail on each). |
@pandas-dev/pandas-core