Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: implement DatetimeLikeArray #19902

Merged
merged 37 commits into from
Jul 2, 2018
Merged

Conversation

jbrockmendel
Copy link
Member

The medium-term goal: refactor out of DatetimeIndexOpsMixin/DatetimeIndex/TimedeltaIndex/PeriodIndex the bare minimum subset of functionality to implement arithmetic+comparisons for DatetimeArray/TimedeltaArray/PeriodArray. This PR does not do that.

What it does do is refactor out the subset of those methods that can be transplanted directly into the Array classes (i.e. cut/paste).

On its own this PR is not very useful, so think of it as a Proof of Concept/discussion piece.

cc @TomAugspurger since this is a precursor to getting a "real" PeriodArray.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable organization at first glance. Will look closer / think more about it later.

# ------------------------------------------------------------------
# Null Handling

@property # NB: override with cache_readonly in immutable subclasses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a PR started that made something like a @maybe_cache_readonly? That'll look more appealing with this in place, else we'll just be overriding these just to mark them as cached.

Your PR probably did this, but ideally would would have a class attribute that indicates whether the class is immutable, and a single decorator for both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah the PR you're thinking of did exactly that. At the time there was only one property/cache_readonly affected, but this was the motivation. It'll be easy to revive if it becomes necessary.

""" common ops mixin to support a unified interface datetimelike Index """
inferred_freq = cache_readonly(DatetimeLikeArray.inferred_freq.fget)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this isn't so bad...

@@ -174,8 +174,92 @@ def _new_DatetimeIndex(cls, d):
return result


class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin,
Int64Index):
class DatetimeArray(DatetimeLikeArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've discussed this anywhere yet, but I'm not sure if we want a plain DatetimeArray, just a DatetimeTZArray. We'll need to hash that out somewhere. That discussion probably depends on how public these EAs are going to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a place where our goals overlap imperfectly. My goal is to ensure that Index/Series/DataFrame comparison/arithmetic behavior is consistent by having shared implementations of those methods. For that purpose I expect it'll be easier to have a single DatetimeArray for both aware/naive than to juggle DatetimeTZArray/ndarray[datetime64[ns]]

I may also be confused about what the "Extension" in Extension Array is for. I'm thinking of it largely as "extending numpy arrays", whereas the canonical usage may be for downstream users to extend pandas.

Regardless, if we reach consensus on this part of the diff, the next step is to move over a handful of methods that require only a 1-line change to wrap an Index object (e.g. DatetimeIndex.to_julian_dates)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that purpose I expect it'll be easier to have a single DatetimeArray

Completely agreed that a single implementation is the only sane way to achieve that.

I'm thinking of it largely as "extending numpy arrays",

That's right, but NumPy's datetime64[ns] is I think sufficient for us as far as tz-naive datetimes go (I may be wrong about us).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nump's impl has served us, but we it is immensely easier to have a real DTI be the actual underlying implementation as we can easily extend this. So I am onboard with @jbrockmendel here to have a combined DatetimeArray. We actually discussed this I think in 0.17.0 when I created this originally, but was rejected for compat with numpy. We could still have that (the issue is what .values outputs). but it makes the code much better to have this than not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback glad to hear you're on board. Any thoughts on the appropriate size/scope per PR to make reviewers' task easier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think could move much of this to core/array/datetime.py

doing straight moves first then changes are good

@codecov
Copy link

codecov bot commented Feb 26, 2018

Codecov Report

Merging #19902 into master will increase coverage by <.01%.
The diff coverage is 96.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19902      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         154      158       +4     
  Lines       49659    49701      +42     
==========================================
+ Hits        45640    45680      +40     
- Misses       4019     4021       +2
Flag Coverage Δ
#multiple 90.28% <96.85%> (ø) ⬆️
#single 41.94% <74.84%> (+0.04%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/timedelta.py 100% <100%> (ø)
pandas/core/indexes/timedeltas.py 91.19% <100%> (-0.06%) ⬇️
pandas/core/indexes/datetimes.py 95.48% <100%> (-0.18%) ⬇️
pandas/core/indexes/period.py 92.58% <100%> (-0.1%) ⬇️
pandas/core/indexes/datetimelike.py 96.72% <100%> (+0.24%) ⬆️
pandas/core/arrays/__init__.py 100% <100%> (ø) ⬆️
pandas/core/arrays/period.py 100% <100%> (ø)
pandas/core/arrays/datetimes.py 100% <100%> (ø)
pandas/core/arrays/datetimelike.py 92.75% <92.75%> (ø)
pandas/core/indexes/multi.py 94.96% <0%> (-0.01%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad76ffc...a684c2d. Read the comment docs.

@@ -121,8 +121,149 @@ def ceil(self, freq):
return self._round(freq, np.ceil)


class DatetimeIndexOpsMixin(object):
class DatetimeLikeArray(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe append Mixin to the name here, to indicate that this still can't be constructed and used on its own?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing purposes I was planning to implement a bare-bones __new__. That actually raises an important question: what is the canonical attribute to assign the values input to? For DTI/TDI/PI right now it is self._data, but for the Block subclasses it's self.values. Has a convention been established for ExtensionArrays?

(none of which is mutually exclusive with Mixin being a good suggestion)

@jbrockmendel
Copy link
Member Author

Moved the new classes to core.arrays directory. The set of cut/paste-able methods will be expanded when #19912 and #19903 go through.

@jbrockmendel
Copy link
Member Author

Now that #19800 is in, the follow-up to this can include all of the comparison method (the Index ops will need a ~2 line wrapper around the array ops)

@jbrockmendel
Copy link
Member Author

Thoughts on where to go with this? The steps after this are going to require a lot of work to carefully port the appropriate tests, so I'd like to keep slow-and-steady momentum.

@gfyoung gfyoung added Enhancement Internals Related to non-user accessible pandas implementation labels Mar 2, 2018
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Overall this looks good.

@jbrockmendel can you sketch out your next few steps here? Can you edit the OP in #19696? If not make a list and I'll add it, maybe as a sublist.

Can you add DatetimeArray, PeriodArray, and TimedeltaArray to pandas.core.arrays.__init__?

from pandas.core.algorithms import checked_add_with_arr


class DatetimeLikeArray(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append Mixin to the class name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

""" common ops mixin to support a unified interface datetimelike Index """
inferred_freq = cache_readonly(DatetimeLikeArray.inferred_freq.fget)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment here to note why we do it like this (array is mutable, index is immutable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jbrockmendel
Copy link
Member Author

can you sketch out your next few steps here?

This is ~everything that can be cut/paste directly. The next step I have in mind is a handful of methods that need a 1-line wrapping in the Index classes. e.g. comparison methods are non-Index-specific right up until the last line when they wrap an ndarray in an Index. Following that are the arithmetic methods, for which the wrapping is less trivial. Does that answer the question?

Can you edit the OP in #19696?

Sure.

BTW let me know if time is an issue on this or other PRs. I've been distracted for the last couple of weeks with a bugfix-fork of statsmodels that's left a bunch of stuff here on hiatus.

@TomAugspurger
Copy link
Contributor

BTW let me know if time is an issue on this or other PRs.

It's starting to be. I think we want to do a release candidate in the next couple weeks. It'd be nice to have as much of the EA stuff done as possible. My plan is for groupby to be the last bit of API that we ensure works, and then pick up moving our other extension types over to the new interface. If you're able to take on any of that it'd be great.

bugfix-fork of statsmodels

That's unfortunate, but understandable :/ I'm hoping to push on a statsmodels release sometime shortly after pandas 0.23.0.

@jbrockmendel jbrockmendel mentioned this pull request Mar 18, 2018
15 tasks
@jbrockmendel
Copy link
Member Author

Making suggested changes now. Will push shortly.

re arrays.__init__, any thoughts about defining __all__ and removing the # noqas? Not a big deal, just a slight preference.

re statsmodels: see sm2. The vague hope is that it gets enough community traction to convince jpkt to take technical debt seriously, at which point fixes can be upstreamed and it can become unnecessary.

@TomAugspurger
Copy link
Contributor

re arrays.init, any thoughts about defining all and removing the # noqas? Not a big deal, just a slight preference.

No preference at all.

@jorisvandenbossche / @jreback this LGTM. Any concerns?

@jorisvandenbossche
Copy link
Member

How do I have to see this PR?
Is this is a first actual stab at implementing the array types for datetime/period/timedelta? Or more a reorganisation of code?

Because if it is the first, I probably have a bunch of comments on what we exactly want to put in the array classes. And also, if that is the case, I am not sure we want the Index to subclass them? I thought we would rather go for composition?

@TomAugspurger
Copy link
Contributor

Reorganization. DatetimeLikeArray isn't an ExtensionArray yet.

And also, if that is the case, I am not sure we want the Index to subclass them? I thought we would rather go for composition?

I've been going back and forth on which approach is best here. I'm slightly coming around to the idea of subclassing, but haven't 100% settled yet. I think that the changes here are going to be helpful either way, correct @jbrockmendel? At some point we'll either make DatetimeLikeArray inherit from ExtensionArray, or the Index classes will change to compose it.

@jbrockmendel
Copy link
Member Author

ResourceWarning in TestMangleDupes appears unrelated

@jbrockmendel
Copy link
Member Author

gentle ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have the feeling that this single step rather complicates things (eg the inheritance scheme), but since the goal is that this is only temporary, I suppose I don't really care :-)

@@ -1,2 +1,5 @@
from .base import ExtensionArray # noqa
from .categorical import Categorical # noqa
from .datetimes import DatetimeArrayMixin # noqa
from .periods import PeriodArrayMixin # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this period? and below timedelta
I am fine with keeping the conflicting one as plural, if you want that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is "the conflicting one"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was speaking about the file name, not the order. So to use singular period instead of periods (and same for timedelta), as we discussed about before: https://github.com/pandas-dev/pandas/pull/19902/files/3a67bce8169663430005b2a36673132fc1e79f4c#r175774283

@jbrockmendel
Copy link
Member Author

@jreback just rebased. If we can push this through I can get the next step up over the weekend and we'll have a shot at finishing the transition at the sprint.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2018

going to merge #21261 then have you rebase (not that I expect conflicts). Then can merge.

@jreback jreback added this to the 0.24.0 milestone Jun 29, 2018
@jreback jreback changed the title WIP: implement DatetimeLikeArray ENH: implement DatetimeLikeArray Jun 29, 2018
@jreback
Copy link
Contributor

jreback commented Jun 29, 2018

pls rebase

@jbrockmendel
Copy link
Member Author

ping

@jreback
Copy link
Contributor

jreback commented Jul 2, 2018

thanks!

@jreback jreback merged commit 7cd2679 into pandas-dev:master Jul 2, 2018
@jbrockmendel jbrockmendel deleted the dtarrays3 branch July 2, 2018 23:59
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants