Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve accuracy of to_pytimedelta #57841

Merged
merged 9 commits into from
Mar 25, 2024
Merged

Conversation

rohanjain101
Copy link
Contributor

@rohanjain101 rohanjain101 commented Mar 14, 2024

>>> a = pd.Timedelta(1152921504609987375)
>>> a.to_pytimedelta()
datetime.timedelta(days=13343, seconds=86304, microseconds=609988)
>>>

609988 for microseconds but expected is 609987

>>> a = pd.Timedelta(1152921504609987374)
>>> a.to_pytimedelta()
datetime.timedelta(days=13343, seconds=86304, microseconds=609987)
>>>

In this example, a difference of 1 nanosecond, shouldn't result in a difference of 1 microsecond when converted from ns to us unit.

@rohanjain101 rohanjain101 marked this pull request as ready for review March 14, 2024 21:49
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @rohanjain101

I think both the title of the PR and the description are misleading. You are not really improving the accuracy of to_pytimedelta, and an increase in a nanosecond doesn't cause an increase in a microsecond when calling to_pytimedelta. You simply found the point where the rounding change to the next nanosecond.

I do agree that the way the function works is misleading. The problem is that most people would expect this code:

>>> datetime.timedelta(microseconds=12.5)
datetime.timedelta(microseconds=12)

to round up to 13 instead to 12. For good or for bad, this is how Python rounding works:

>>> round(12.5)
12

The round function or the same algorithm is probably used inside of datetime.timedelta, so we get the behavior your experienced when using pandas to_pytimedelta.

The reasoning for how round works is complex, but it's the default rounding strategy for a reason. I'm not sure if in this particular case the rounding strategy you propose could be better. If you think there is a reason to change it, please let us know the reason in the PR description, or maybe better, open an issue to discuss.

I assume from the information provided that you just found Python's rounding strategy misleading, so I'll close this issue, but if there is a better reason than that for this PR, please feel free to add a comment and reopen.

@datapythonista datapythonista added the Timedelta Timedelta data type label Mar 14, 2024
@rohanjain101
Copy link
Contributor Author

rohanjain101 commented Mar 14, 2024

Ok, but the reason this behavior is occurring I don't believe is related to python's round function. It's because python's integer division goes through a floating point which introduces accuracy issues for large integers (and in this case, result is rounded anyway). The issue is with the division, not the rounding done by timedelta.

@datapythonista
Copy link
Member

Sorry, you are correct. I didn't realize the problem with the floating point precision.

I wonder what's the impact on performance of this change, did you execute a benchmark to compare the execution times of both implementations?

@MarcoGorelli @jbrockmendel thoughts on this change?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note in v3.0.0.rst? I think this change looks OK.

(FWIW, Timestamp.to_pydatetime also just seems to truncate the nanoseconds component. It would be nice to match that behavior but that would be an API change)

@rohanjain101
Copy link
Contributor Author

Added whatsnew, are there existing benchmarks for comparing performance?

@rohanjain101 rohanjain101 requested a review from mroeschke March 25, 2024 12:24
@mroeschke mroeschke added this to the 3.0 milestone Mar 25, 2024
@mroeschke mroeschke merged commit cf40e56 into pandas-dev:main Mar 25, 2024
46 checks passed
@mroeschke
Copy link
Member

Thanks @rohanjain101

@rohanjain101 rohanjain101 deleted the to_td branch March 25, 2024 23:04
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* improve accuracy of to_pytimedelta

* f

* f

* whatsnew

* f

---------

Co-authored-by: Rohan Jain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants