Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add fold support to Timestamp constructor #31563

Merged
merged 137 commits into from
Feb 26, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Feb 2, 2020

PERF Note: Current implementation slows down the Timestamp constructor. I've tested this thoroughly and tracked it down to ba7fcd5 where I changed the function signatures at the very beginning of working on the PR. The performance overhead appeared before any of the logic was implemented.

Reasoning for changes

We currently don't support fold from PEP 495, and this causes us a lot of grief, including, but not limited to, broken Timestamp representations at the edge of DST shift (see #31338). The values can also break.

Support of the fold attribute helps us deal with that. Now, if the wall clock time occurs twice due to a DST shift, we will know exactly at which point in time we are.

This also removes inconsistencies between using pytz and dateutil: now both the values and representations match for both timezone packages near DST summer/winter shifts.

Scope of PR

Implementing fold into Timestamp can easily get out of hand. Parent pydatetime has it easy, as the object doesn't need to sync its underlying epoch time, fold, and the representation, like we do.

This PR is already large, so I propose we limit its scope to minimal consistent fold support:

  • when fold is explicitly supplied, attempt to shift value across the DST boundary according to the value supplied
  • infer fold from value during tz-aware Timestamp construction. For example, if Timestamp is built from epoch time, then we can infer fold.
  • pass the resulting fold to the underlying datetime constructor in create_timestamp_from_ts so that it gets stored in the object (can't assign it directly as it's read-only).
  • implement local timezone support for fold. Don't have any idea how to do this though.

Things I suggest we leave to discussion and other PRs:

  • conflicts resolution. A user can supply fold=1 for a datetime that is nowhere near a DST shift. We can raise and Error or a Warning in the future. For now, we check whether fold can be 1, and if not, we leave it as default fold=0.
  • consider reintroducing ambiguous time errors. Currently, the implementation assumes fold == 0 for ambiguous time like Timestamp("2019-10-27 01:30:00", tz='Europe/London'). The error was dropped to mirror the fold=0 default in pydatetime and to allow the fold attribute to be bint in Cython functions.

Note: pydatetime doesn't infer fold from value and doesn't raise errors when you assign fold=1 nowhere near a fold. Example:

from datetime import datetime
from dateutil.tz import gettz

dt = datetime.datetime(2015, 10, 27, 5, 30, 0, 0, gettz("Europe/London"), fold=1)
dt.fold
OUT:
1

Performance problems

This implementation behaves as I expect it to, but it slows down scalar constructor benchmarks by 30-70%, even for tz-naive benchmarks.

Update: since the benchmark slowdown appeared before any of the functionality, this has nothing to do with the logic introduced and all to do with adding bint fold to function signatures, apparently. I'm afraid I don't know Cython well enough to research further.

Default behaviour near DST boundary is now for fold=0
@AlexKirko
Copy link
Member Author

@jreback
Sorry if this PR is waiting for its turn. Pinging you in case it fell through the cracks.

@jreback jreback added this to the 1.1 milestone Feb 23, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question and some minor comments. @mroeschke and @pganssle ok here?

tz="dateutil/Europe/London", fold=1)
ts

For more, see :ref:`Timezone section <timeseries.timezone>` in the user guide on working with timezones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference the sub-section here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2220,6 +2220,32 @@ you can use the ``tz_convert`` method.

rng_pytz.tz_convert('US/Eastern')

.. note::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this a sub-section (e.g. Fold), add a reference (so you can link to it in whatsnew)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the same level as working with time zones

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"timezones."
)

if hasattr(ts_input, 'fold'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed? (e.g. we are passing fold now to all of the construction functions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still needed when we pass a naive datetime, fold, and a timezone.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2020

also pls merge master as some cosmetic changes where done in cython. ping on green.

# Allow fold only for unambiguous input
if fold is not None:
if fold not in [0, 1]:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test that hits this exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've been out of the loop on this PR.

Looks pretty good from the surface. Just one comment.

@AlexKirko
Copy link
Member Author

AlexKirko commented Feb 25, 2020

Linux 32 bit azure pipeline fails to build for everyone: #32229
Doesn't look like it has anything to do with this PR.

@AlexKirko
Copy link
Member Author

AlexKirko commented Feb 25, 2020

UPDATE: outdated, see below
All green except failures that don't have anything to do with this PR.

Linux py36_32bit fails to build and is being handled in issue #32229. macOS py36_macos has failed test_value_counts_unique_nunique, which is being handled in issue #32220 and PR #32201

Docs changed (tested ref locally, and everything seems to work), test added for invalid fold values, master merged. ts_input.replace within hasattr if block is still necessary, because it sets fold for naive-datetime ts_input passed together with fold and a timezone.

Resolve conflict in timestamps.pyx by accepting
the master version.
@AlexKirko
Copy link
Member Author

@jreback
All green now. Merged in new changes that fix broken CI.

Changes you requested made.

ts_input.replace within hasattr if block is still necessary, because it sets fold for naive-datetime ts_input passed together with fold and a timezone.

@AlexKirko AlexKirko requested a review from jreback February 26, 2020 07:03
@jreback jreback merged commit 6e04264 into pandas-dev:master Feb 26, 2020
@jreback
Copy link
Contributor

jreback commented Feb 26, 2020

thanks @AlexKirko very nice! and thanks to @pganssle and @mroeschke for the lively discussion!

@AlexKirko
Copy link
Member Author

@jreback @pganssle @mroeschke
Great! Thanks to everyone for all the effort you made to make the changes as robust as possible. We didn't hit exactly 256 comments or 128 commits, but other than that I'm very happy with the result.
A special thank you to @pganssle for the in-depth research and discussion!

@pganssle
Copy link
Contributor

Sorry I couldn't give this one last once-over, but thanks so much for putting in the hard work to get this right, @AlexKirko.

And now there's PEP 495 support in pandas, just in time for PEP 615 (discussion, currently underway)!

@mroeschke
Copy link
Member

Great work @AlexKirko! Really happy to have this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Timezones Timezone data dtype
Projects
None yet
6 participants