-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISC: Supporting numpy StringDType in Pandas #58503
Comments
I think PDEP-13 is going to be important for this. We have so many new string dtypes...while they all have merits in their own right I don't think this makes for a good end user experience and it is confusing how to produce and control them throughout their lifecycle in our codebase |
100% |
what about
Mixing the dtype systems is a concern to others as well as myself.
I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10 |
Ah. I see that you did mention this #57073 (comment). But no direct response to that comment to-date. |
Thanks @simonjayhawkins for providing all of this input. I would support what I think you are asking for with either a new PDEP or a revote/reclarification on PDEP 10 before investing a lot of effort into these. I do agree that we have quasi worked around what we agreed to in a lot of smaller PRs and are not in an ideal state with our string dtypes. Between the different string implementations, nullability semantics, infer_strings settings, dtype_backend arguments, requiring versus not requiring pyarrow, etc... I find it personally challenging to navigate where we stand now. At the very least having this discussed and communicated in one central location should be beneficial |
Whether to use the new numpy 2.0 string dtype in pandas is IMO unrelated to the
If we would require pyarrow as a hard dependency, then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype, and to converge on a single string dtype implementation in pandas. |
reasonable but not obvious. e.g. if the user expects to be doing |
Motivation
Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas. As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow. It'll also provide an alternative to needing Arrow to have a performant string dtype.
Supporting the new StringDType also has maintenance benefits, since it'll provide a path to getting rid of the object dtype that doesn't depend on requiring Arrow, because the string ufuncs that operate on StringDType are designed to match the Python semantics.
Also, it might be able to supplant the
pyarrow_numpy
dtype. Not sure what the plan for thepyarrow_numpy
stuff will be long term, but I that if the performance of numpy strings are OK, we can just infer to them by default (if numpy is selected as the dtype backend).Implementation Details
One thing that we'll probably want to discuss is the dtype naming conventions for pyarrow/numpy strings.
I'm really not a fan of the
string[pyarrow_numpy]
naming scheme (since there's ambiguity in this name as to whether the array is actually backed by an arrow or numpy array, since both are in the name :) ).Maybe we can (deprecate) and rename this to something
string[pyarrow_nplike]
, or juststring[nplike]
if we want to replace thepyarrow_numpy
strings altogether(where
nplike
will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if thepyarrow_numpy
dtype will go away in the future).PDEP-13 may also be tangentially related here (I haven't had the time to go through the discussion there yet though).
Anyone have any thoughts on this?
cc @pandas-dev/pandas-core @ngoldbaum (who I'm working with this on)
The text was updated successfully, but these errors were encountered: