Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/time series interpolation is wrong 21351 #56515

Merged

Conversation

cbpygit
Copy link
Contributor

@cbpygit cbpygit commented Dec 15, 2023

@jbrockmendel
Copy link
Member

FAILED pandas/tests/resample/test_time_grouper.py::test_groupby_resample_interpolate - TypeError: '<' not supported between instances of 'int' and 'tuple'

Any idea what is causing this? Seems unlikely to be "expected is wrong" for this one (which i assume is the case for the other one)

if not is_period_index:
final_index = result.index
missing_data_points_index = obj.index.difference(final_index)
if len(missing_data_points_index) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the opposite case where the difference is empty, i think a the sort_index and .loc below (which makes a copy) can be avoided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Can you please clarify? "in the opposite case where the difference is empty" --> then missing_data_points_index will be empty, so its length will be zero. In that case, the if-expression evaluates to False, so that the loc/sort is indeed not executed. Maybe I am overlooking something.

@jbrockmendel
Copy link
Member

IIUC in the motivating case we have a series whose index has freq1 and a new freq2 that are not multiples of each other. Is what you're doing here similar to

freq3 = LCD(freq1, freq2)
result = series.resample(freq3).interpolate(...).asfreq(freq2)

And if so, would that be more efficient/clearer than what you're doing here? I guess it wouldn't work if the original index doesn't have a freq? Or might blow up if the LCD happens to be tiny?

@jbrockmendel
Copy link
Member

is there a doctest this changes/fixes?

@cbpygit
Copy link
Contributor Author

cbpygit commented Dec 20, 2023

IIUC in the motivating case we have a series whose index has freq1 and a new freq2 that are not multiples of each other. Is what you're doing here similar to

freq3 = LCD(freq1, freq2)
result = series.resample(freq3).interpolate(...).asfreq(freq2)

And if so, would that be more efficient/clearer than what you're doing here? I guess it wouldn't work if the original index doesn't have a freq? Or might blow up if the LCD happens to be tiny?

@jbrockmendel I don't think this is a "pathological" case, but in general, this type of resampling should be agnostic to what sort of index it is.

It is helpful to think of the input data as a "measurement" of a function $f(x)$ at some points $x_i$, and the resampling/interpolation as asking the question about function values at a "higher number" of points $x_j$. The code identifies this correctly as upsampling. The interpolator is a "piecewise assumed function" $g(x)$ that (hopefully) describes how the data behaves between adjacent points $x_i, x_{i+1}$. It is irrelevant whether the original sampling followed a precise frequency. Instead, it is important that the interpolator can take into account as many points in the vicinity of the new points $x_j$ to evaluate $g$. (In the simple linear case 2 adjacent points are enough, but for more complex interpolation more anchor points may be used.)

The problem with the previous implementation is that those anchor points were, more or less randomly, excluded from the interpolation process. All we need to do is make sure none of the original data points is missing because this would mean throwing away valid information. This is achieved by

result = concat(
    [result, obj.loc[missing_data_points_index]]
)

The remaining code is just gymnastics around this. I can help make this more clear in the code or here, but I don't think it can be simplified further.

@cbpygit
Copy link
Contributor Author

cbpygit commented Dec 21, 2023

FAILED pandas/tests/resample/test_time_grouper.py::test_groupby_resample_interpolate - TypeError: '<' not supported between instances of 'int' and 'tuple'

Any idea what is causing this? Seems unlikely to be "expected is wrong" for this one (which i assume is the case for the other one)

@jbrockmendel I spent quite a while looking at this now. The problem here is caused by interpolation applied to a GroupBy object. I was not able to solve how I would access/recreate the MultiIndex that the original data frame would have in Resampler.interpolate if it was not resampled 😞. The resampling happens somewhere in self._upsample. So there is a lack of information here, in a sense. I think I would need help to make it work for the groupby case, and it would feel unnatural to support it for non-grouped, but not grouped operations.

@cbpygit
Copy link
Contributor Author

cbpygit commented Dec 21, 2023

@jbrockmendel I took another look and added an implementation for multi-indexes (resulting from groupby). This fixes test_groupby_resample_interpolate, all tests in resample are green for me now locally.

@cbpygit
Copy link
Contributor Author

cbpygit commented Dec 21, 2023

And regarding pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate, which seems to be the test that crashes in most of the other test suites: I don't see how this test was ever passing 🤷‍♂️. The input datetime_series has an index created using date_range with freq="B". So this index is not even spaced, as it has only business days. A linear interpolation will therefore never result in np.arange(len(datetime_series), dtype=float).

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 28, 2023

thanks for your PR

there's still some failing tests:

FAILED pandas/tests/series/methods/test_interpolate.py::TestSeriesInterpolateData::test_interpolate - AssertionError: Series are different

Series values are different (13.33333 %)
[index]: [2000-01-03T00:00:00.000000000, 2000-01-04T00:00:00.000000000, 2000-01-05T00:00:00.000000000, 2000-01-06T00:00:00.000000000, 2000-01-07T00:00:00.000000000, 2000-01-10T00:00:00.000000000, 2000-01-11T00:00:00.000000000, 2000-01-12T00:00:00.000000000, 2000-01-13T00:00:00.000000000, 2000-01-14T00:00:00.000000000, 2000-01-17T00:00:00.000000000, 2000-01-18T00:00:00.000000000, 2000-01-19T00:00:00.000000000, 2000-01-20T00:00:00.000000000, 2000-01-21T00:00:00.000000000, 2000-01-24T00:00:00.000000000, 2000-01-25T00:00:00.000000000, 2000-01-26T00:00:00.000000000, 2000-01-27T00:00:00.000000000, 2000-01-28T00:00:00.000000000, 2000-01-31T00:00:00.000000000, 2000-02-01T00:00:00.000000000, 2000-02-02T00:00:00.000000000, 2000-02-03T00:00:00.000000000, 2000-02-04T00:00:00.000000000, 2000-02-07T00:00:00.000000000, 2000-02-08T00:00:00.000000000, 2000-02-09T00:00:00.000000000, 2000-02-10T00:00:00.000000000, 2000-02-11T00:00:00.000000000]
[left]:  [0.0, 1.0, 2.0, 3.0, 4.0, 5.8, 6.4, 7.0, 7.6, 8.2, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0]
[right]: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0]
At positional index 5, first diff: 5.8 != 5.0
FAILED pandas/tests/series/methods/test_interpolate.py::TestSeriesInterpolateData::test_nan_irregular_index - AssertionError: Series are different

If these were wrong to begin with, could you please updated them so they have the correct values so that they pass? That would need doing as part of this PR

CI would need to be green for this to be mergeable (furthermore, people are much more likely to review a PR which is green)

…terpolate` and the related explanation about consideration of anchor points when interpolating downsampled series with non-aligned result index.
Fixes assumption in `test_interp_basic_with_non_range_index`. If the index is [1, 2, 3, 5] and values are [1, 2, np.nan, 4], it is wrong to expect that interpolation will result in 3 for the missing value in case of linear interpolation. It will rather be 2.666...
… on series with datetime index using business days only (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate`).
… on irregular index (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_nan_irregular_index`).
@cbpygit cbpygit force-pushed the fix/time-series-interpolation-is-wrong-21351 branch from 065129c to 0294464 Compare January 2, 2024 12:20
@cbpygit cbpygit requested a review from jbrockmendel January 2, 2024 16:50
@cbpygit
Copy link
Contributor Author

cbpygit commented Jan 2, 2024

All green now @MarcoGorelli @jbrockmendel

@MarcoGorelli
Copy link
Member

thanks - aiming to get to this this month

@cbpygit cbpygit requested a review from mroeschke April 5, 2024 07:29
@cbpygit
Copy link
Contributor Author

cbpygit commented Apr 8, 2024

I don't think the failing test cases are related to my code @MarcoGorelli @mroeschke , what do I do in this case? It occurred after syncing with main.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @cbpygit for having stuck with this one!

Leaving open a bit in case there's further comments

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note in v3.0.0.rst under the Other section of the Bug Fix group? Thanks for sticking with this almost there!

@cbpygit cbpygit requested a review from mroeschke April 24, 2024 17:11
@mroeschke mroeschke added this to the 3.0 milestone Apr 24, 2024
@mroeschke mroeschke merged commit 4f7cb74 into pandas-dev:main Apr 24, 2024
45 of 46 checks passed
@mroeschke
Copy link
Member

Thanks @cbpygit

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* fix: Fixes wrong doctest output in `pandas.core.resample.Resampler.interpolate` and the related explanation about consideration of anchor points when interpolating downsampled series with non-aligned result index.

* Resolved merge conflicts

* fix: Fixes wrong test case assumption for interpolation

Fixes assumption in `test_interp_basic_with_non_range_index`. If the index is [1, 2, 3, 5] and values are [1, 2, np.nan, 4], it is wrong to expect that interpolation will result in 3 for the missing value in case of linear interpolation. It will rather be 2.666...

* fix: Make sure frequency indexes are preserved with new interpolation approach

* fix: Fixes new-style up-sampling interpolation for MultiIndexes resulting from groupby-operations

* fix: Fixes wrong test case assumption when using linear interpolation on series with datetime index using business days only (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate`).

* fix: Fixes wrong test case assumption when using linear interpolation on irregular index (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_nan_irregular_index`).

* fix: Adds test skips for interpolation methods that require scipy if scipy is not installed

* fix: Makes sure keyword arguments "downcast" is not passed to scipy interpolation methods that are not using `interp1d` or spline.

* fix: Adjusted expected warning type in `test_groupby_resample_interpolate_off_grid`.

* fix: Fixes failing interpolation on groupby if the index has `name`=None. Adds this check to an existing test case.

* Trigger Actions

* feat: Raise error on attempt to interpolate a MultiIndex data frame, providing a useful error message that describes a working alternative syntax. Fixed related test cases and added test that makes sure the error is raised.

* Apply suggestions from code review

Co-authored-by: Matthew Roeschke <[email protected]>

* refactor: Adjusted error type assertion in test case

* refactor: Removed unused parametrization definitions and switched to direct parametrization for interpolation methods in tests.

* fix: Adds forgotten "@" before pytest.mark.parametrize

* refactor: Apply suggestions from code review

* refactor: Switched to ficture params syntax for test case parametrization

* Update pandas/tests/resample/test_time_grouper.py

Co-authored-by: Matthew Roeschke <[email protected]>

* Update pandas/tests/resample/test_base.py

Co-authored-by: Matthew Roeschke <[email protected]>

* refactor: Fixes too long line

* tests: Fixes test that fails due to unimportant index name comparison

* docs: Added entry in whatsnew

* Empty-Commit

* Empty-Commit

* Empty-Commit

* docs: Sorted whatsnew

* docs: Adjusted bug fix note and moved it to the right section

---------

Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time Series Interpolation is wrong
4 participants