-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: implement DatetimeLikeArray #19902
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable organization at first glance. Will look closer / think more about it later.
pandas/core/indexes/datetimelike.py
Outdated
# ------------------------------------------------------------------ | ||
# Null Handling | ||
|
||
@property # NB: override with cache_readonly in immutable subclasses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have a PR started that made something like a @maybe_cache_readonly
? That'll look more appealing with this in place, else we'll just be overriding these just to mark them as cached.
Your PR probably did this, but ideally would would have a class attribute that indicates whether the class is immutable, and a single decorator for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah the PR you're thinking of did exactly that. At the time there was only one property/cache_readonly affected, but this was the motivation. It'll be easy to revive if it becomes necessary.
pandas/core/indexes/datetimelike.py
Outdated
""" common ops mixin to support a unified interface datetimelike Index """ | ||
inferred_freq = cache_readonly(DatetimeLikeArray.inferred_freq.fget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this isn't so bad...
pandas/core/indexes/datetimes.py
Outdated
@@ -174,8 +174,92 @@ def _new_DatetimeIndex(cls, d): | |||
return result | |||
|
|||
|
|||
class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin, | |||
Int64Index): | |||
class DatetimeArray(DatetimeLikeArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we've discussed this anywhere yet, but I'm not sure if we want a plain DatetimeArray
, just a DatetimeTZArray
. We'll need to hash that out somewhere. That discussion probably depends on how public these EAs are going to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a place where our goals overlap imperfectly. My goal is to ensure that Index/Series/DataFrame comparison/arithmetic behavior is consistent by having shared implementations of those methods. For that purpose I expect it'll be easier to have a single DatetimeArray
for both aware/naive than to juggle DatetimeTZArray
/ndarray[datetime64[ns]]
I may also be confused about what the "Extension" in Extension Array is for. I'm thinking of it largely as "extending numpy arrays", whereas the canonical usage may be for downstream users to extend pandas.
Regardless, if we reach consensus on this part of the diff, the next step is to move over a handful of methods that require only a 1-line change to wrap an Index object (e.g. DatetimeIndex.to_julian_dates
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that purpose I expect it'll be easier to have a single DatetimeArray
Completely agreed that a single implementation is the only sane way to achieve that.
I'm thinking of it largely as "extending numpy arrays",
That's right, but NumPy's datetime64[ns]
is I think sufficient for us as far as tz-naive datetimes go (I may be wrong about us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nump's impl has served us, but we it is immensely easier to have a real DTI be the actual underlying implementation as we can easily extend this. So I am onboard with @jbrockmendel here to have a combined DatetimeArray. We actually discussed this I think in 0.17.0 when I created this originally, but was rejected for compat with numpy. We could still have that (the issue is what .values
outputs). but it makes the code much better to have this than not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback glad to hear you're on board. Any thoughts on the appropriate size/scope per PR to make reviewers' task easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think could move much of this to core/array/datetime.py
doing straight moves first then changes are good
Codecov Report
@@ Coverage Diff @@
## master #19902 +/- ##
==========================================
+ Coverage 91.9% 91.9% +<.01%
==========================================
Files 154 158 +4
Lines 49659 49701 +42
==========================================
+ Hits 45640 45680 +40
- Misses 4019 4021 +2
Continue to review full report at Codecov.
|
pandas/core/indexes/datetimelike.py
Outdated
@@ -121,8 +121,149 @@ def ceil(self, freq): | |||
return self._round(freq, np.ceil) | |||
|
|||
|
|||
class DatetimeIndexOpsMixin(object): | |||
class DatetimeLikeArray(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe append Mixin
to the name here, to indicate that this still can't be constructed and used on its own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing purposes I was planning to implement a bare-bones __new__
. That actually raises an important question: what is the canonical attribute to assign the values
input to? For DTI/TDI/PI right now it is self._data
, but for the Block subclasses it's self.values
. Has a convention been established for ExtensionArrays?
(none of which is mutually exclusive with Mixin
being a good suggestion)
Now that #19800 is in, the follow-up to this can include all of the comparison method (the Index ops will need a ~2 line wrapper around the array ops) |
Thoughts on where to go with this? The steps after this are going to require a lot of work to carefully port the appropriate tests, so I'd like to keep slow-and-steady momentum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Overall this looks good.
@jbrockmendel can you sketch out your next few steps here? Can you edit the OP in #19696? If not make a list and I'll add it, maybe as a sublist.
Can you add DatetimeArray
, PeriodArray
, and TimedeltaArray
to pandas.core.arrays.__init__
?
pandas/core/arrays/datetimelike.py
Outdated
from pandas.core.algorithms import checked_add_with_arr | ||
|
||
|
||
class DatetimeLikeArray(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append Mixin to the class name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/core/indexes/datetimelike.py
Outdated
""" common ops mixin to support a unified interface datetimelike Index """ | ||
inferred_freq = cache_readonly(DatetimeLikeArray.inferred_freq.fget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment here to note why we do it like this (array is mutable, index is immutable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
This is ~everything that can be cut/paste directly. The next step I have in mind is a handful of methods that need a 1-line wrapping in the Index classes. e.g. comparison methods are non-Index-specific right up until the last line when they wrap an ndarray in an Index. Following that are the arithmetic methods, for which the wrapping is less trivial. Does that answer the question?
Sure. BTW let me know if time is an issue on this or other PRs. I've been distracted for the last couple of weeks with a bugfix-fork of statsmodels that's left a bunch of stuff here on hiatus. |
It's starting to be. I think we want to do a release candidate in the next couple weeks. It'd be nice to have as much of the EA stuff done as possible. My plan is for groupby to be the last bit of API that we ensure works, and then pick up moving our other extension types over to the new interface. If you're able to take on any of that it'd be great.
That's unfortunate, but understandable :/ I'm hoping to push on a statsmodels release sometime shortly after pandas 0.23.0. |
Making suggested changes now. Will push shortly. re re statsmodels: see sm2. The vague hope is that it gets enough community traction to convince jpkt to take technical debt seriously, at which point fixes can be upstreamed and it can become unnecessary. |
No preference at all. @jorisvandenbossche / @jreback this LGTM. Any concerns? |
How do I have to see this PR? Because if it is the first, I probably have a bunch of comments on what we exactly want to put in the array classes. And also, if that is the case, I am not sure we want the Index to subclass them? I thought we would rather go for composition? |
Reorganization.
I've been going back and forth on which approach is best here. I'm slightly coming around to the idea of subclassing, but haven't 100% settled yet. I think that the changes here are going to be helpful either way, correct @jbrockmendel? At some point we'll either make DatetimeLikeArray inherit from ExtensionArray, or the Index classes will change to compose it. |
ResourceWarning in TestMangleDupes appears unrelated |
gentle ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have the feeling that this single step rather complicates things (eg the inheritance scheme), but since the goal is that this is only temporary, I suppose I don't really care :-)
pandas/core/arrays/__init__.py
Outdated
@@ -1,2 +1,5 @@ | |||
from .base import ExtensionArray # noqa | |||
from .categorical import Categorical # noqa | |||
from .datetimes import DatetimeArrayMixin # noqa | |||
from .periods import PeriodArrayMixin # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this period
? and below timedelta
I am fine with keeping the conflicting one as plural, if you want that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is "the conflicting one"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was speaking about the file name, not the order. So to use singular period
instead of periods
(and same for timedelta), as we discussed about before: https://github.com/pandas-dev/pandas/pull/19902/files/3a67bce8169663430005b2a36673132fc1e79f4c#r175774283
@jreback just rebased. If we can push this through I can get the next step up over the weekend and we'll have a shot at finishing the transition at the sprint. |
going to merge #21261 then have you rebase (not that I expect conflicts). Then can merge. |
pls rebase |
ping |
thanks! |
The medium-term goal: refactor out of
DatetimeIndexOpsMixin
/DatetimeIndex
/TimedeltaIndex
/PeriodIndex
the bare minimum subset of functionality to implement arithmetic+comparisons forDatetimeArray
/TimedeltaArray
/PeriodArray
. This PR does not do that.What it does do is refactor out the subset of those methods that can be transplanted directly into the Array classes (i.e. cut/paste).
On its own this PR is not very useful, so think of it as a Proof of Concept/discussion piece.
cc @TomAugspurger since this is a precursor to getting a "real" PeriodArray.