Fix/time series interpolation is wrong 21351 #56515

cbpygit · 2023-12-15T10:57:59Z

closes Time Series Interpolation is wrong #21351
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-12-20T17:29:36Z

FAILED pandas/tests/resample/test_time_grouper.py::test_groupby_resample_interpolate - TypeError: '<' not supported between instances of 'int' and 'tuple'

Any idea what is causing this? Seems unlikely to be "expected is wrong" for this one (which i assume is the case for the other one)

pandas/core/resample.py

jbrockmendel · 2023-12-20T17:33:13Z

pandas/core/resample.py

+        if not is_period_index:
+            final_index = result.index
+            missing_data_points_index = obj.index.difference(final_index)
+            if len(missing_data_points_index) > 0:


in the opposite case where the difference is empty, i think a the sort_index and .loc below (which makes a copy) can be avoided

@jbrockmendel Can you please clarify? "in the opposite case where the difference is empty" --> then missing_data_points_index will be empty, so its length will be zero. In that case, the if-expression evaluates to False, so that the loc/sort is indeed not executed. Maybe I am overlooking something.

jbrockmendel · 2023-12-20T17:36:50Z

IIUC in the motivating case we have a series whose index has freq1 and a new freq2 that are not multiples of each other. Is what you're doing here similar to

freq3 = LCD(freq1, freq2)
result = series.resample(freq3).interpolate(...).asfreq(freq2)

And if so, would that be more efficient/clearer than what you're doing here? I guess it wouldn't work if the original index doesn't have a freq? Or might blow up if the LCD happens to be tiny?

jbrockmendel · 2023-12-20T17:37:39Z

is there a doctest this changes/fixes?

cbpygit · 2023-12-20T19:20:40Z

IIUC in the motivating case we have a series whose index has freq1 and a new freq2 that are not multiples of each other. Is what you're doing here similar to
freq3 = LCD(freq1, freq2)
result = series.resample(freq3).interpolate(...).asfreq(freq2)
And if so, would that be more efficient/clearer than what you're doing here? I guess it wouldn't work if the original index doesn't have a freq? Or might blow up if the LCD happens to be tiny?

@jbrockmendel I don't think this is a "pathological" case, but in general, this type of resampling should be agnostic to what sort of index it is.

It is helpful to think of the input data as a "measurement" of a function $f(x)$ at some points $x_i$, and the resampling/interpolation as asking the question about function values at a "higher number" of points $x_j$. The code identifies this correctly as upsampling. The interpolator is a "piecewise assumed function" $g(x)$ that (hopefully) describes how the data behaves between adjacent points $x_i, x_{i+1}$. It is irrelevant whether the original sampling followed a precise frequency. Instead, it is important that the interpolator can take into account as many points in the vicinity of the new points $x_j$ to evaluate $g$. (In the simple linear case 2 adjacent points are enough, but for more complex interpolation more anchor points may be used.)

The problem with the previous implementation is that those anchor points were, more or less randomly, excluded from the interpolation process. All we need to do is make sure none of the original data points is missing because this would mean throwing away valid information. This is achieved by

result = concat(
    [result, obj.loc[missing_data_points_index]]
)

The remaining code is just gymnastics around this. I can help make this more clear in the code or here, but I don't think it can be simplified further.

cbpygit · 2023-12-21T12:00:17Z

FAILED pandas/tests/resample/test_time_grouper.py::test_groupby_resample_interpolate - TypeError: '<' not supported between instances of 'int' and 'tuple'

Any idea what is causing this? Seems unlikely to be "expected is wrong" for this one (which i assume is the case for the other one)

@jbrockmendel I spent quite a while looking at this now. The problem here is caused by interpolation applied to a GroupBy object. I was not able to solve how I would access/recreate the MultiIndex that the original data frame would have in Resampler.interpolate if it was not resampled 😞. The resampling happens somewhere in self._upsample. So there is a lack of information here, in a sense. I think I would need help to make it work for the groupby case, and it would feel unnatural to support it for non-grouped, but not grouped operations.

cbpygit · 2023-12-21T19:00:37Z

@jbrockmendel I took another look and added an implementation for multi-indexes (resulting from groupby). This fixes test_groupby_resample_interpolate, all tests in resample are green for me now locally.

cbpygit · 2023-12-21T22:28:58Z

And regarding pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate, which seems to be the test that crashes in most of the other test suites: I don't see how this test was ever passing 🤷‍♂️. The input datetime_series has an index created using date_range with freq="B". So this index is not even spaced, as it has only business days. A linear interpolation will therefore never result in np.arange(len(datetime_series), dtype=float).

MarcoGorelli · 2023-12-28T19:33:56Z

thanks for your PR

there's still some failing tests:

FAILED pandas/tests/series/methods/test_interpolate.py::TestSeriesInterpolateData::test_interpolate - AssertionError: Series are different

Series values are different (13.33333 %)
[index]: [2000-01-03T00:00:00.000000000, 2000-01-04T00:00:00.000000000, 2000-01-05T00:00:00.000000000, 2000-01-06T00:00:00.000000000, 2000-01-07T00:00:00.000000000, 2000-01-10T00:00:00.000000000, 2000-01-11T00:00:00.000000000, 2000-01-12T00:00:00.000000000, 2000-01-13T00:00:00.000000000, 2000-01-14T00:00:00.000000000, 2000-01-17T00:00:00.000000000, 2000-01-18T00:00:00.000000000, 2000-01-19T00:00:00.000000000, 2000-01-20T00:00:00.000000000, 2000-01-21T00:00:00.000000000, 2000-01-24T00:00:00.000000000, 2000-01-25T00:00:00.000000000, 2000-01-26T00:00:00.000000000, 2000-01-27T00:00:00.000000000, 2000-01-28T00:00:00.000000000, 2000-01-31T00:00:00.000000000, 2000-02-01T00:00:00.000000000, 2000-02-02T00:00:00.000000000, 2000-02-03T00:00:00.000000000, 2000-02-04T00:00:00.000000000, 2000-02-07T00:00:00.000000000, 2000-02-08T00:00:00.000000000, 2000-02-09T00:00:00.000000000, 2000-02-10T00:00:00.000000000, 2000-02-11T00:00:00.000000000]
[left]:  [0.0, 1.0, 2.0, 3.0, 4.0, 5.8, 6.4, 7.0, 7.6, 8.2, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0]
[right]: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0]
At positional index 5, first diff: 5.8 != 5.0
FAILED pandas/tests/series/methods/test_interpolate.py::TestSeriesInterpolateData::test_nan_irregular_index - AssertionError: Series are different

If these were wrong to begin with, could you please updated them so they have the correct values so that they pass? That would need doing as part of this PR

CI would need to be green for this to be mergeable (furthermore, people are much more likely to review a PR which is green)

…terpolate` and the related explanation about consideration of anchor points when interpolating downsampled series with non-aligned result index.

Fixes assumption in `test_interp_basic_with_non_range_index`. If the index is [1, 2, 3, 5] and values are [1, 2, np.nan, 4], it is wrong to expect that interpolation will result in 3 for the missing value in case of linear interpolation. It will rather be 2.666...

… approach

…ting from groupby-operations

… on series with datetime index using business days only (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate`).

… on irregular index (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_nan_irregular_index`).

…scipy is not installed

cbpygit · 2024-01-02T16:51:04Z

All green now @MarcoGorelli @jbrockmendel

MarcoGorelli · 2024-01-03T17:06:25Z

thanks - aiming to get to this this month

cbpygit · 2024-04-08T07:54:21Z

I don't think the failing test cases are related to my code @MarcoGorelli @mroeschke , what do I do in this case? It occurred after syncing with main.

MarcoGorelli

Looks good to me, thanks @cbpygit for having stuck with this one!

Leaving open a bit in case there's further comments

pandas/tests/resample/test_base.py

pandas/tests/resample/test_time_grouper.py

Co-authored-by: Matthew Roeschke <[email protected]>

mroeschke

Could you add a whatsnew note in v3.0.0.rst under the Other section of the Bug Fix group? Thanks for sticking with this almost there!

doc/source/whatsnew/v3.0.0.rst

…-wrong-21351' into fix/time-series-interpolation-is-wrong-21351

mroeschke · 2024-04-24T19:42:05Z

Thanks @cbpygit

* fix: Fixes wrong doctest output in `pandas.core.resample.Resampler.interpolate` and the related explanation about consideration of anchor points when interpolating downsampled series with non-aligned result index. * Resolved merge conflicts * fix: Fixes wrong test case assumption for interpolation Fixes assumption in `test_interp_basic_with_non_range_index`. If the index is [1, 2, 3, 5] and values are [1, 2, np.nan, 4], it is wrong to expect that interpolation will result in 3 for the missing value in case of linear interpolation. It will rather be 2.666... * fix: Make sure frequency indexes are preserved with new interpolation approach * fix: Fixes new-style up-sampling interpolation for MultiIndexes resulting from groupby-operations * fix: Fixes wrong test case assumption when using linear interpolation on series with datetime index using business days only (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate`). * fix: Fixes wrong test case assumption when using linear interpolation on irregular index (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_nan_irregular_index`). * fix: Adds test skips for interpolation methods that require scipy if scipy is not installed * fix: Makes sure keyword arguments "downcast" is not passed to scipy interpolation methods that are not using `interp1d` or spline. * fix: Adjusted expected warning type in `test_groupby_resample_interpolate_off_grid`. * fix: Fixes failing interpolation on groupby if the index has `name`=None. Adds this check to an existing test case. * Trigger Actions * feat: Raise error on attempt to interpolate a MultiIndex data frame, providing a useful error message that describes a working alternative syntax. Fixed related test cases and added test that makes sure the error is raised. * Apply suggestions from code review Co-authored-by: Matthew Roeschke <[email protected]> * refactor: Adjusted error type assertion in test case * refactor: Removed unused parametrization definitions and switched to direct parametrization for interpolation methods in tests. * fix: Adds forgotten "@" before pytest.mark.parametrize * refactor: Apply suggestions from code review * refactor: Switched to ficture params syntax for test case parametrization * Update pandas/tests/resample/test_time_grouper.py Co-authored-by: Matthew Roeschke <[email protected]> * Update pandas/tests/resample/test_base.py Co-authored-by: Matthew Roeschke <[email protected]> * refactor: Fixes too long line * tests: Fixes test that fails due to unimportant index name comparison * docs: Added entry in whatsnew * Empty-Commit * Empty-Commit * Empty-Commit * docs: Sorted whatsnew * docs: Adjusted bug fix note and moved it to the right section --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>

cbpygit mentioned this pull request Dec 15, 2023

Time Series Interpolation is wrong #21351

Closed

MarcoGorelli self-requested a review December 15, 2023 18:01

jbrockmendel reviewed Dec 20, 2023

View reviewed changes

pandas/core/resample.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Dec 20, 2023

View reviewed changes

cbpygit requested review from attack68, rhshadrach, Dr-Irv, WillAyd, datapythonista and mroeschke as code owners January 2, 2024 11:05

cbpygit added 7 commits January 2, 2024 13:17

fix: Fixes wrong doctest output in `pandas.core.resample.Resampler.in…

ff6d12f

…terpolate` and the related explanation about consideration of anchor points when interpolating downsampled series with non-aligned result index.

Resolved merge conflicts

1593af0

fix: Make sure frequency indexes are preserved with new interpolation…

dd8b8d3

… approach

fix: Fixes new-style up-sampling interpolation for MultiIndexes resul…

a04a3a2

…ting from groupby-operations

fix: Fixes wrong test case assumption when using linear interpolation…

efbba10

… on series with datetime index using business days only (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate`).

fix: Fixes wrong test case assumption when using linear interpolation…

0294464

… on irregular index (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_nan_irregular_index`).

cbpygit force-pushed the fix/time-series-interpolation-is-wrong-21351 branch from 065129c to 0294464 Compare January 2, 2024 12:20

fix: Adds test skips for interpolation methods that require scipy if …

537f8bf

…scipy is not installed

cbpygit requested a review from jbrockmendel January 2, 2024 16:50

cbpygit requested a review from mroeschke April 5, 2024 07:29

Merge branch 'main' into fix/time-series-interpolation-is-wrong-21351

c0547b5

Merge branch 'main' into fix/time-series-interpolation-is-wrong-21351

d6382f8

MarcoGorelli approved these changes Apr 12, 2024

View reviewed changes

mroeschke reviewed Apr 12, 2024

View reviewed changes

pandas/tests/resample/test_base.py Outdated Show resolved Hide resolved

mroeschke reviewed Apr 12, 2024

View reviewed changes

pandas/tests/resample/test_time_grouper.py Outdated Show resolved Hide resolved

cbpygit and others added 5 commits April 13, 2024 11:07

Update pandas/tests/resample/test_time_grouper.py

4e9a616

Co-authored-by: Matthew Roeschke <[email protected]>

Update pandas/tests/resample/test_base.py

c655bf1

Co-authored-by: Matthew Roeschke <[email protected]>

Merge branch 'main' into fix/time-series-interpolation-is-wrong-21351

e916da9

refactor: Fixes too long line

eaa7e07

tests: Fixes test that fails due to unimportant index name comparison

649bfa2

mroeschke reviewed Apr 15, 2024

View reviewed changes

cbpygit added 5 commits April 24, 2024 13:22

docs: Added entry in whatsnew

4cfbbf1

Empty-Commit

76794e3

Merge branch 'main' into fix/time-series-interpolation-is-wrong-21351

6ad9b26

Empty-Commit

6555141

Empty-Commit

48850cc

mroeschke reviewed Apr 24, 2024

View reviewed changes

doc/source/whatsnew/v3.0.0.rst Outdated Show resolved Hide resolved

cbpygit added 3 commits April 24, 2024 18:52

Merge branch 'main' into fix/time-series-interpolation-is-wrong-21351

8eea71c

docs: Sorted whatsnew

7f957cf

Merge remote-tracking branch 'origin/fix/time-series-interpolation-is…

51e95e0

…-wrong-21351' into fix/time-series-interpolation-is-wrong-21351

cbpygit requested a review from mroeschke April 24, 2024 17:11

docs: Adjusted bug fix note and moved it to the right section

12bdd90

mroeschke approved these changes Apr 24, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Apr 24, 2024

mroeschke merged commit 4f7cb74 into pandas-dev:main Apr 24, 2024
45 of 46 checks passed

Dr-Irv mentioned this pull request May 12, 2024

BUG: In main, using resample().interpolate(inplace=True) raises an exception #58690

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/time series interpolation is wrong 21351 #56515

Fix/time series interpolation is wrong 21351 #56515

cbpygit commented Dec 15, 2023 •

edited

Loading

jbrockmendel commented Dec 20, 2023

jbrockmendel Dec 20, 2023

cbpygit Dec 20, 2023

jbrockmendel commented Dec 20, 2023

jbrockmendel commented Dec 20, 2023

cbpygit commented Dec 20, 2023

cbpygit commented Dec 21, 2023

cbpygit commented Dec 21, 2023

cbpygit commented Dec 21, 2023

MarcoGorelli commented Dec 28, 2023 •

edited

Loading

cbpygit commented Jan 2, 2024

MarcoGorelli commented Jan 3, 2024

cbpygit commented Apr 8, 2024

MarcoGorelli left a comment

mroeschke left a comment

mroeschke commented Apr 24, 2024

Fix/time series interpolation is wrong 21351 #56515

Fix/time series interpolation is wrong 21351 #56515

Conversation

cbpygit commented Dec 15, 2023 • edited Loading

jbrockmendel commented Dec 20, 2023

jbrockmendel Dec 20, 2023

Choose a reason for hiding this comment

cbpygit Dec 20, 2023

Choose a reason for hiding this comment

jbrockmendel commented Dec 20, 2023

jbrockmendel commented Dec 20, 2023

cbpygit commented Dec 20, 2023

cbpygit commented Dec 21, 2023

cbpygit commented Dec 21, 2023

cbpygit commented Dec 21, 2023

MarcoGorelli commented Dec 28, 2023 • edited Loading

cbpygit commented Jan 2, 2024

MarcoGorelli commented Jan 3, 2024

cbpygit commented Apr 8, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Apr 24, 2024

cbpygit commented Dec 15, 2023 •

edited

Loading

MarcoGorelli commented Dec 28, 2023 •

edited

Loading