-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does pandas work around numpy limitations with custom dtypes instead of fixing them upstream? #8350
Comments
@shoyer I'll give some comments, can copy to the list in a bit. Hopefully not TL;DR! Their are 3 'dtype' likes that exist in pandas that could in theory mostly be migrated back to numpy. These currently exist as the
So when we added support for
So I know I am repeating myself, but it comes down to this. The API/interface of the delegated methods needs to be defined. For ndarrays it is long established and well-known. So easy to gear pandas to that. However with a newer type that is not the case, so pandas can easily decide, hey this is the most correct behavior, let's do it this way, nothing to break, no back compat needed. |
I'm so glad you asked this question; I've been wondering about it, too. |
@ischwabacher does my answer shed some light? anything still not clear? |
It surely does. At one point I did go poking around in numpy to see if I could figure out where one would start to update |
FYI: Here are some of the numpy issue / proposals for datetime: http://numpy-tst.readthedocs.org/en/latest/neps/datetime-proposal3.html http://numpy-discussion.10968.n7.nabble.com/timezones-and-datetime64-td33407.html (I seem to remember a more recent 2014 discussion though) |
Additionally, this all has to work with missing data. |
@cpcloud yeah, I think that's a really key additional point: good handling of missing data. Plus, the need to provide a single API across multiple versions of numpy leads to some additional hacking around. |
@cpcloud maybe you'd like to send a follow-up to the numpy list w.r.t. missing data support et all. |
@jreback I'm not sure that would be very helpful. There's already been a pretty epic discussion: http://www.numpy.org/NA-overview.html. Not sure if there's much more to add. current status: https://github.com/njsmith/numpy/wiki/NA-discussion-status |
@cpcloud good point :) |
All we needed for pandas was NaN for integers and datetimes... alas! On Mon, Sep 22, 2014 at 12:03 PM, jreback [email protected] wrote:
|
Maybe I should add-- The reason this question came up is, the only reason On Mon, Sep 22, 2014 at 2:43 PM, Phillip Cloud [email protected]
Nathaniel J. Smith |
@njsmith thanks for bringing all this up here's my 2 cents.
same here. we have around i would say 3-7 active devs at any given time, with @jreback doing a large majority of the work. i personally have dropped off quite a bit ever since starting work at continuum.
I agree that it's more awkward and i do understand the need to differentiate a that said, @njsmith is there a TODO list for that branch anywhere? there is a bloomberg hackathon this weekend and maybe folks could make some headway there. plenty of interesting low-level stuff there, i'm sure a ton to learn about numpy internals since i think the impl touches a lot of different parts of the code base i don't have much to say re getting people to contribute. i personally haven't contributed anything huge (i think maybe a bug report or two) but not for any particular reason other than lack of time. |
Thanks everyone for adding your perspective -- this has been very enlightening for me. @njsmith After reading your wiki page and NEPs, I am still not entirely sure I understand the resolution of the NAs in numpy discussion. What is the blocker for your miniNEP2 proposal to implement bit-pattern NAs via optional dtypes? Just the lack of an implementation? Or are there still arguments about whether it is even a good idea? Of course, I think your proposal is/was a great idea :). From a practical perspective, Categorical does seem like a case where -- hypothetically -- it could be done as a custom numpy dtype (if extending dtypes were easier), but probably implemented in pandas to avoid waiting on the numpy release cycle and to enable using klib. In theory, I don't think API design would be a blocker if we wrote the numpy dtype specifically for pandas. |
Just to chime in on the "nobody wants to write C": I tried to see how On the other hand: I also didn't know where to start such an implementation in pandas and without @jreback start I wouldn't have found it either. |
Speaking of fixing things upstream, we should probably weigh in on PEP 431 -- Time zone support improvements. It seems like it would be bad if the stdlib finally implemented decent time zone support and it were incompatible with ours. |
@ischwabacher you and @rockg are the DST sticklers! go for it! |
I was thinking particularly of cases like
datetime64
,Categorical
andGeoSeries
(from geopandas).I recently posted on the numpy discussion mailing to try to get a sense of what solutions exist for writing custom dtypes without writing C. Unfortunately, it appears there's not much hope!
http://mail.scipy.org/pipermail/numpy-discussion/2014-September/071231.html
@njsmith suggested that I really should ask pandas developers to chime in to find out why they choose to work around numpy's limitations rather than enhance it. I would love it if someone who understands the "why" for the choices pandas made could add their perspective to that thread.
@jreback @JanSchulz any thoughts to add? or did a sum it up well enough with "nobody wants to write C"?
The text was updated successfully, but these errors were encountered: