-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISC: path to nullable-by-default integers and floats #58243
Comments
The migration path and behaviour impact of moving towards the nullable dtypes is something that will have to be described and decided in the future PDEP. So good to start discussing that here! With the recent experience of the opt-in of
How would that avoid mixed-propagation behaviour? Do you mean in the case that we would only add integer/boolean dtypes with NA, and not yet for other types? (in which case using NaN instead of NA indeed keeps things more consistent)
That's a good question ..;), and indeed something we need to discuss more. I personally think there is value in 1) having all dtypes/arrays stored in a pandas object be a EDtype / EA (for internal consistency, but also for users, such that eg Personally I don't think that will be useful for the large majority of our users to allow that generally, and only be confusing (and in theory it's relatively straightforward to write your own EA to store one of the numpy dtypes we don't handle natively). Specifically for int/bool, what would be the value to allow it? It would feel a bit as having the ability to say that a certain column can have no nulls (like databases often have a nullability flag for a certain column in the schema).
On the other hand, it will make the upgrade path a lot smoother if we still allow using numpy dtypes wherever a user specifies a dtypes (and automatically translate that to the equivalent pandas dtype). |
The discussed changes with nullable dtypes are indeed going to have a big impact on users. I personally find it hard to judge if it will be easier rather than harder for users to update by slicing it in two separate changes (the change is each time smaller, but you have (somewhat related) breaking changes twice) But I though it might be a good exercise to think through all the potential areas of direct user impact of making this change. That is something we will have to do anyway to document for users (and for the PDEP to decide we are OK with such a set of changes), and that might also give a better idea of the kind of changes we are talking about and how they would be separated if doing the change in two steps. First attempt of listing user impact of moving to nullable extension dtypes:
Other things? |
I see three main paths:
I've landed on 2) as my preferred option because a) it is a much smaller user-facing change than 1 (mostly c and b) |
I've been giving some thought to how we can move towards having nullable integer/bool dtypes by default (from the ice cream agreement last august).
Terminology note: I am using "nullable" to mean "supports some missing sentinel without taking a stance on what that sentinel is or what semantics it has"
On the user-end, I think it will need to be opt-in for a while. This can mirror the pyarrow-hybrid string future option. In the medium-term, we can implement hybrid Integer/Boolean dtype/EAs that use nan as their sentinel. This will minimize the behavior changes users see and avoids introducing mixed-propagation behavior. A subsequent deprecation cycle can move to all-propagating.
Open Questions
dtype=np.int64
, do we warn/raise or map that to future dtype (assuming the user has opted in)?df.dtypes == np.int64
?Now that I write that out, I'm talking myself into being strict on this front and avoiding headaches down the road.
Thoughts?
cc @jorisvandenbossche @phofl
The text was updated successfully, but these errors were encountered: