Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more upstream-dev cftime failures #3751

Closed
dcherian opened this issue Feb 4, 2020 · 20 comments · Fixed by #3764
Closed

more upstream-dev cftime failures #3751

dcherian opened this issue Feb 4, 2020 · 20 comments · Fixed by #3764

Comments

@dcherian
Copy link
Contributor

dcherian commented Feb 4, 2020

https://dev.azure.com/xarray/xarray/_build/results?buildId=2116&view=logs&jobId=2280efed-fda1-53bd-9213-1fa8ec9b4fa8&j=2280efed-fda1-53bd-9213-1fa8ec9b4fa8&t=175181ee-1928-5a6b-f537-168f7a8b7c2d

46 failed tests but they all seem to result from the same TypeError

=================================== FAILURES ===================================
______________ test_sel_date_scalar_nearest[365_day-sel_kwargs0] _______________

da = <xarray.DataArray (time: 4)>
             0002-02-01 00:00:00],
            dtype='object')
sel_kwargs = {'method': 'nearest'}

    @requires_cftime
    @pytest.mark.parametrize(
        "sel_kwargs",
        [{"method": "nearest"}, {"method": "nearest", "tolerance": timedelta(days=70)}],
    )
    def test_sel_date_scalar_nearest(da, date_type, index, sel_kwargs):
        expected = xr.DataArray(2).assign_coords(time=index[1])
>       result = da.sel(time=date_type(1, 4, 1), **sel_kwargs)

xarray/tests/test_cftimeindex.py:460: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xarray/core/dataarray.py:1056: in sel
    ds = self._to_temp_dataset().sel(
xarray/core/dataset.py:2056: in sel
    pos_indexers, new_indexes = remap_label_indexers(
xarray/core/coordinates.py:391: in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
xarray/core/indexing.py:270: in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
xarray/core/indexing.py:189: in convert_label_indexer
    indexer = index.get_loc(
xarray/coding/cftimeindex.py:334: in get_loc
    return pd.Index.get_loc(self, key, method=method, tolerance=tolerance)
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/pandas/core/indexes/base.py:2899: in get_loc
    indexer = self.get_indexer([key], method=method, tolerance=tolerance)
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/pandas/core/indexes/base.py:2992: in get_indexer
    indexer = self._get_nearest_indexer(target, limit, tolerance)
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/pandas/core/indexes/base.py:3076: in _get_nearest_indexer
    left_distances = np.abs(self[left_indexer] - target)
xarray/coding/cftimeindex.py:444: in __sub__
    return CFTimeIndex(np.array(self) - other)
xarray/coding/cftimeindex.py:248: in __new__
    assert_all_valid_date_type(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data = TimedeltaIndex(['-59 days'], dtype='timedelta64[ns]', freq=None)

    def assert_all_valid_date_type(data):
        import cftime
    
        if len(data) > 0:
            sample = data[0]
            date_type = type(sample)
            if not isinstance(sample, cftime.datetime):
>               raise TypeError(
                    "CFTimeIndex requires cftime.datetime "
                    "objects. Got object of {}.".format(date_type)
                )
E               TypeError: CFTimeIndex requires cftime.datetime objects. Got object of <class 'pandas._libs.tslibs.timedeltas.Timedelta'>.

xarray/coding/cftimeindex.py:206: TypeError
@keewis keewis mentioned this issue Feb 5, 2020
3 tasks
@spencerkclark
Copy link
Member

spencerkclark commented Feb 6, 2020

Part of the hazard of using a pd.Index subclass I suppose...

It looks like pandas-dev/pandas#31511 was the cause of the issue:

$ git bisect good
64336ff8414f8977ff94adb9a5bc000a3a4ef454 is the first bad commit
commit 64336ff8414f8977ff94adb9a5bc000a3a4ef454
Author: Kevin Anderson <[email protected]>
Date:   Sun Feb 2 20:48:28 2020 -0700

    BUG: fix reindexing with a tz-aware index and method='nearest' (#31511)

 doc/source/whatsnew/v1.1.0.rst               |  2 +-
 pandas/core/indexes/base.py                  |  5 ++---
 pandas/tests/frame/indexing/test_indexing.py | 10 ++++++++++
 3 files changed, 13 insertions(+), 4 deletions(-)

A way to fix this upstream would be to make sure that target has the same type as the index (here it is a generic pd.Index instead of a CFTimeIndex), but I'm not sure how hard that would be (or if it makes sense in all cases):
https://github.com/pandas-dev/pandas/blob/a2a35a86c4064d297c8b48ecfea80e9f05e27712/pandas/core/indexes/base.py#L3080

I think it's possible we could work around this in xarray. It comes down to properly recognizing what to do when you subtract a generic pd.Index of cftime.datetime objects from a CFTimeIndex. Previously this code in pandas operated strictly using NumPy arrays, so there was no casting issue when doing the subtraction.

@dcherian
Copy link
Contributor Author

dcherian commented Feb 6, 2020

Thanks for narrowing it down @spencerkclark . Let's see what @TomAugspurger thinks about an xarray workaround vs a pandas fix.

@TomAugspurger
Copy link
Contributor

FWIW, I think @jbrockmendel is still progressing on an "extension index" interface where you could have a custom dtype / Index subclass that would be properly supported. Long-term, that's the best solution.

Short-term, I'm less sure what's best.

@jbrockmendel
Copy link

xarray/coding/cftimeindex.py:444: in __sub__
    return CFTimeIndex(np.array(self) - other)

any idea what other is here? looks like it might be a DatetimeIndex

@spencerkclark
Copy link
Member

FWIW, I think @jbrockmendel is still progressing on an "extension index" interface where you could have a custom dtype / Index subclass that would be properly supported. Long-term, that's the best solution.

Nice -- I look forward to being able to try that out!

any idea what other is here? looks like it might be a DatetimeIndex

@jbrockmendel agreed that it's unclear -- we probably should have written the code for that method in a clearer way. I think it's mainly used for subtracting a single datetime.timedelta or NumPy array of datetime.timedelta objects from a CFTimeIndex. In both of those cases we would expect the result to remain a CFTimeIndex.

cftime.datetime objects often represent dates from non-standard calendars (e.g. calendars with no leap year, or calendars where all months have 30 days), so in general they are not compatible with the dates used in a DatetimeIndex. Subtracting like-calendar cftime.datetime objects is fair game though, and we'd like that to produce timedeltas.

@keewis
Copy link
Collaborator

keewis commented Feb 7, 2020

when the tests fail, other is a pandas.Index containing whatever has been used to index. The subtraction results in a normal TimedeltaIndex which is then passed to the CFTimeIndex:

np.array(self) = array([cftime.DatetimeNoLeap(0001-02-01 00:00:00)], dtype=object)
other = Index([0001-05-01 00:00:00], dtype='object')
other[0] = cftime.DatetimeNoLeap(0001-05-01 00:00:00)
np.array(self) - other = TimedeltaIndex(['-89 days'], dtype='timedelta64[ns]', freq=None)

I think the issue here is that other is a pandas.Index instead of a CFTimeIndex.

@spencerkclark
Copy link
Member

I think the issue here is that other is a pandas.Index instead of a CFTimeIndex.

Yes, I noted that in my original post (sorry if that wasn't clear).

@keewis
Copy link
Collaborator

keewis commented Feb 7, 2020

no, that's my bad, it is pretty clear but I seem to have skipped over it

@spencerkclark
Copy link
Member

Another kind of failure came up in the context of indexing a Series with a cftime.datetime object:

Example failure

____________________________ test_indexing_in_series_getitem[365_day] _____________________________

series = 0001-01-01 00:00:00    1
0001-02-01 00:00:00    2
0002-01-01 00:00:00    3
0002-02-01 00:00:00    4
dtype: int64
index = CFTimeIndex([0001-01-01 00:00:00, 0001-02-01 00:00:00, 0002-01-01 00:00:00,
             0002-02-01 00:00:00],
            dtype='object')
scalar_args = [cftime.DatetimeNoLeap(0001-01-01 00:00:00)]
range_args = ['0001', slice('0001-01-01', '0001-12-30', None), slice(None, '0001-12-30', None), slice(cftime.DatetimeNoLeap(0001-01...:00), cftime.DatetimeNoLeap(0001-12-30 00:00:00), None), slice(None, cftime.DatetimeNoLeap(0001-12-30 00:00:00), None)]

    @requires_cftime
    def test_indexing_in_series_getitem(series, index, scalar_args, range_args):
        for arg in scalar_args:
>           assert series[arg] == 1

test_cftimeindex.py:597:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../pandas/pandas/core/series.py:884: in __getitem__
    return self._get_with(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 0001-01-01 00:00:00    1
0001-02-01 00:00:00    2
0002-01-01 00:00:00    3
0002-02-01 00:00:00    4
dtype: int64
key = cftime.DatetimeNoLeap(0001-01-01 00:00:00)

    def _get_with(self, key):
        # other: fancy integer or otherwise
        if isinstance(key, slice):
            # _convert_slice_indexer to determing if this slice is positional
            #  or label based, and if the latter, convert to positional
            slobj = self.index._convert_slice_indexer(key, kind="getitem")
            return self._slice(slobj)
        elif isinstance(key, ABCDataFrame):
            raise TypeError(
                "Indexing a Series with DataFrame is not "
                "supported, use the appropriate DataFrame column"
            )
        elif isinstance(key, tuple):
            try:
                return self._get_values_tuple(key)
            except ValueError:
                # if we don't have a MultiIndex, we may still be able to handle
                #  a 1-tuple.  see test_1tuple_without_multiindex
                if len(key) == 1:
                    key = key[0]
                    if isinstance(key, slice):
                        return self._get_values(key)
                raise

        if not isinstance(key, (list, np.ndarray, ExtensionArray, Series, Index)):
>           key = list(key)
E           TypeError: 'cftime._cftime.DatetimeNoLeap' object is not iterable

../../../pandas/pandas/core/series.py:911: TypeError

Admittedly I think most people probably use a CFTimeIndex within xarray data structures, but it would be nice to maintain some ability to use it in pandas data structures too. This issue stems from the changes made in pandas-dev/pandas#31399. I think the problem is that pandas.core.dtypes.common.is_scalar returns False for a cftime.datetime object:

In [1]: import cftime

In [2]: import pandas

In [3]: pandas.core.dtypes.common.is_scalar(cftime.DatetimeNoLeap(2000, 1, 1))
Out[3]: False

Could there be a simple upstream fix for this?

@jbrockmendel
Copy link

Could there be a simple upstream fix for this?

Yah, pandas recently added a check for is_scalar(key) in Series.__getitem__ to try to avoid some unnecessary lookups. That will need to be changed to accommodate unexpected scalars.

spencerkclark added a commit to spencerkclark/xarray that referenced this issue Feb 28, 2020
@spencerkclark
Copy link
Member

@jbrockmendel @TomAugspurger it turns out that fixing indexing with the "nearest" method without overriding private methods of pandas.Index is harder than I expected within xarray alone (see #3764 (comment) for more details). Do you think an upstream fix would be acceptable here? My understanding is that the issue that prompted the change in pandas only pertained to DatetimeIndexes (or perhaps maybe also PeriodIndexes in the future); would it be possible to limit the scope of the updates in pandas-dev/pandas#31511 to just those?

max-sixty pushed a commit that referenced this issue Feb 28, 2020
@jbrockmendel
Copy link

Do you think an upstream fix would be acceptable here?

Definitely. Is it still the case that the identified problems would all be solved by having is_scalar recognize cftime.datetime?

@spencerkclark
Copy link
Member

Thanks @jbrockmendel -- I think there are two separate issues:

@jbrockmendel
Copy link

@spencerkclark can you open an issue on the pandas tracker about this and ping me there; I dont want this to fall off my radar

@spencerkclark
Copy link
Member

Thanks @jbrockmendel -- I'll try to do that this weekend.

@spencerkclark
Copy link
Member

Thanks so much @jbrockmendel for looking into the __getitem__ issue upstream. That should be the last of the issues that remains from this thread. As you probably noticed, we ended up merging #3764, which fixed indexing with the "nearest" method.

Once pandas-dev/pandas#32684 is merged, we should be able to un-xfail the Series __getitem__ tests on our end.

@jbrockmendel
Copy link

We're making progress on pandas-dev/pandas#32684, I'd also like to see if we can do an in-pandas fix to avoid the need to override more things here (i.e. #3764). Seems like that will lead to fewer headaches long-term.

Since pandas-dev/pandas#31511, the method has been updated to remove an np.asarray call, and I'm curious if that is enough to make #3764 unnecessary. The only remaining difference AFAICT is np.abs vs abs, which seems like it shouldnt be too tough to resolve.

@spencerkclark
Copy link
Member

spencerkclark commented Mar 21, 2020

Thanks @jbrockmendel; it's great to see that pandas-dev/pandas#32684 was merged.

Regarding #3764, I gave things a try with pandas master and removing our overrides to _get_nearest_indexer and _filter_indexer_tolerance. I got similar failures to what we were getting before.

Example failure

_______________________________ test_sel_date_scalar_nearest[proleptic_gregorian-sel_kwargs2] _______________________________

da = <xarray.DataArray (time: 4)>
array([1, 2, 3, 4])
Coordinates:
  * time     (time) object 0001-01-01 00:00:00 ... 0002-02-01 00:00:00
date_type = <class 'cftime._cftime.DatetimeProlepticGregorian'>
index = CFTimeIndex([0001-01-01 00:00:00, 0001-02-01 00:00:00, 0002-01-01 00:00:00,
             0002-02-01 00:00:00],
            dtype='object')
sel_kwargs = {'method': 'nearest', 'tolerance': datetime.timedelta(days=1800000)}

    @requires_cftime
    @pytest.mark.parametrize(
        "sel_kwargs",
        [
            {"method": "nearest"},
            {"method": "nearest", "tolerance": timedelta(days=70)},
            {"method": "nearest", "tolerance": timedelta(days=1800000)},
        ],
    )
    def test_sel_date_scalar_nearest(da, date_type, index, sel_kwargs):
        expected = xr.DataArray(2).assign_coords(time=index[1])
>       result = da.sel(time=date_type(1, 4, 1), **sel_kwargs)

test_cftimeindex.py:471:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../core/dataarray.py:1061: in sel
    **indexers_kwargs,
../core/dataset.py:2066: in sel
    self, indexers=indexers, method=method, tolerance=tolerance
../core/coordinates.py:392: in remap_label_indexers
    obj, v_indexers, method=method, tolerance=tolerance
../core/indexing.py:270: in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
../core/indexing.py:190: in convert_label_indexer
    label.item(), method=method, tolerance=tolerance
../coding/cftimeindex.py:365: in get_loc
    return pd.Index.get_loc(self, key, method=method, tolerance=tolerance)
../../../pandas/pandas/core/indexes/base.py:2874: in get_loc
    indexer = self.get_indexer([key], method=method, tolerance=tolerance)
../../../pandas/pandas/core/indexes/base.py:2967: in get_indexer
    indexer = self._get_nearest_indexer(target, limit, tolerance)
../../../pandas/pandas/core/indexes/base.py:3062: in _get_nearest_indexer
    indexer = self._filter_indexer_tolerance(target, indexer, tolerance)
../../../pandas/pandas/core/indexes/base.py:3069: in _filter_indexer_tolerance
    indexer = np.where(distance <= tolerance, indexer, -1)
../../../pandas/pandas/core/indexes/extension.py:129: in wrapper
    return op(other)
../../../pandas/pandas/core/ops/common.py:63: in new_method
    return method(self, other)
../../../pandas/pandas/core/arrays/datetimelike.py:75: in wrapper
    other = self._scalar_type(other)
pandas/_libs/tslibs/timedeltas.pyx:1233: in pandas._libs.tslibs.timedeltas.Timedelta.__new__
    ???
pandas/_libs/tslibs/timedeltas.pyx:209: in pandas._libs.tslibs.timedeltas.convert_to_timedelta64
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OverflowError: Python int too large to convert to C long

pandas/_libs/tslibs/timedeltas.pyx:154: OverflowError

In my testing, I can only get things to work if the argument to abs (or np.abs) is a NumPy array (instead of an Index). An upstream change like this would work (it still maintains the behavior desired from pandas-dev/pandas#31511), but I'm not sure how palatable it would be.

@jbrockmendel
Copy link

Can you open a pandas PR for the cftime-nearest-fix branch, and add a test in tests/test_downstream.py for the problematic behavior? That branch won't be merged as-is, but it will be easier to discuss options in-line.

@spencerkclark
Copy link
Member

Thanks @jbrockmendel. I didn't realize you had a few downstream tests; that's great. See pandas-dev/pandas#32905.

jreback pushed a commit to pandas-dev/pandas that referenced this issue Mar 26, 2020
simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this issue May 6, 2020
simonjayhawkins added a commit to pandas-dev/pandas that referenced this issue May 6, 2020
clrpackages pushed a commit to clearlinux-pkgs/pandas that referenced this issue May 30, 2020
MeeseeksMachine (10):
      Backport PR #32833: DOC: FutureWarning in Sphinx build when calling read_parquet (#32847)
      Backport PR #32827: DOC: Fixed contributors for bugfix releases (#32928)
      Backport PR #33566: CI: Fix jedi deprecation warning for 0.17.0 on IPython (#33972)
      Backport PR #33080: CI troubleshoot azure (#33973)
      Backport PR #33309: DOC: include Offset.__call__ to autosummary to fix sphinx warning (#33975)
      Backport PR #31146: Remove possibly illegal test data (#33976)
      Backport PR #33241: tostring->tobytes (#33993)
      Backport PR #33954: DOC: update conf.py to use new numpy doc url (#33996)
      Backport PR #33102: PERF: fix performance regression in memory_usage(deep=True) for object dtype (#33157)
      Backport PR #33968: CI: Bump numpydev URL (#34250)

MomIsBestFriend (1):
      Backport PR #32840 on branch 1.0.x (DOC: use new pydata-sphinx-theme name) (#32848)

Pandas Development Team (1):
      RLS: 1.0.4

Simon Hawkins (24):
      CI: test_unsupported_other fails on pyarrow 0.17 (#33990)
      CI/DEP: Use numba.extending.is_jitted for numba version > 0.49.0 (#33994)
      DOC: start 1.0.4 (#33970)
      BLD: recursive inclusion of DLLs in package data (#33246) (#33995)
      Backport PR #33462 on branch 1.0.x (BUG: None converted to NaN after groupby first and last) (#33998)
      Backport PR #33761 on branch 1.0.x (REGR: fix DataFrame reduction with EA columns and numeric_only=True) (#34000)
      Backport PR #33292 on branch 1.0.x (REGR: Fix bug when replacing categorical value with self) (#34004)
      Backport PR #32870 on branch 1.0.x (DOC: Remove latest whatsnew from header) (#34003)
      Backport PR #33629 on branch 1.0.x (BUG: Fix Categorical use_inf_as_na bug) (#34001)
      release note for #33102 (#34005)
      Backport PR #33513 on branch 1.0.x  (BUG: Fix Categorical.min / max bug) (#34022)
      Backport PR #32905 on branch 1.0.x (Fix to _get_nearest_indexer for pydata/xarray#3751) (#34025)
      Backport PR #33693 on branch 1.0.x (BUG: Fix memory issues in rolling.min/max) (#34027)
      Backport PR #33089 on branch 1.0.x (BUG: Don't cast nullable Boolean to float in groupby) (#34023)
      Backport PR #33645, #33632 and #34087 on branch 1.0.x (#34173)
      Backport PR #34048 on branch 1.0.x (Bug in DataFrame.replace casts columns to ``object`` dtype if items in ``to_replace`` not in values) (#34115)
      REGR: exceptions not caught in _call_map_locations (#34113)
      Backport PR #33983 on branch 1.0.x (BUG: Use args and kwargs in Rolling.apply) (#34190)
      Backport PR #34049 on branch 1.0.x (Bug in Series.groupby would raise ValueError when grouping by PeriodIndex level) (#34247)
      Backport PR #34053 on branch 1.0.x (more informative error message with np.min or np.max on unordered Categorical) (#34246)
      Backport PR #32479 on branch 1.0.x (BUG: Fix issue with datetime[ns, tz] input in Block.setitem) (#34369)
      Backport PR #33644 on branch 1.0.x  (BUG: Groupby quantiles incorrect bins) (#34382)
      DOC: intersphinx inventory link for statsmodels (#34424)
      DOC: 1.0.4 release notes and date (#34425)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants