Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve typehints of xr.Dataset.__getitem__ #4144

Merged
merged 7 commits into from
Jun 15, 2020

Conversation

nbren12
Copy link
Contributor

@nbren12 nbren12 commented Jun 10, 2020

To resolve some common type-related errors, this PR adds some overload type hints to Dataset.__getitem__. Now mypy can correctly infer that hashable inputs return DataArrays.

@pep8speaks
Copy link

pep8speaks commented Jun 10, 2020

Hello @nbren12! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-13 17:43:56 UTC

Sadly this is not working with my version of mypy. See python/mypy#7328
@mathause
Copy link
Collaborator

The mypy check throws an error: xarray/core/dataset.py:1250: error: Overloaded function signature 2 will never be matched: signature 1's parameter type(s) are the same I think you can ignore the other failures. Should that be:

    @overload
    def __getitem__(self, key: Hashable) -> DataArray:
        ...

    @overload
    def __getitem__(self, key: Iterable[Hashable]) -> "Dataset":
        ...

? Only guessing, though - it was Any so there may be more options. @crusaderky

@nbren12
Copy link
Contributor Author

nbren12 commented Jun 12, 2020

@mathause On further consideration, I think it might not be possible to get this to work. This method has three behaviors:

  • Mapping -> Dataset
  • Hashable -> DataArray
  • else (List): -> Dataset

With my limited understanding of mypy, I think that any two of these is supported by overload, but I'm not sure it's possible to support all 3. I tried several different options, but maybe I am missing something.

Would a good middle ground be something like this?

  • Hashable -> DataArray
  • Any -> Union[DataArray, Dataset]

I think this would work since both the input/outputs of the first one are subtypes of the second one. It's not a complete solution, but it would solve the most common problem of ds['a'] returning a union type rather than a DataArray.

nbren12 added 3 commits June 11, 2020 22:18
Given mypy's use of overloads, I think this is all we can do. If the argument is not Hashable, then return the Union type as before.
@nbren12 nbren12 marked this pull request as ready for review June 12, 2020 05:46
@nbren12
Copy link
Contributor Author

nbren12 commented Jun 12, 2020

Okay. Assuming the tests pass, I think this is ready for review. I tried adding a test, but mypy didn't seem to find problems even with code that I know doesn't work (e.g. 'a'+ 1). Is there some strategy for testing tricky type hints like this?

In any case, this code does work:

$ cat test_mypy.py                                                                                                                                                                                                    (fv3net) 
import xarray as xr
ds = xr.Dataset({"a": ()})


arr = ds['a']
union_obj = ds[['a']]

reveal_locals()
$ mypy test_mypy.py                                                                                                                                                                                                   (fv3net) 
test_mypy.py:8: note: Revealed local types are:
test_mypy.py:8: note:     arr: xarray.core.dataarray.DataArray
test_mypy.py:8: note:     ds: xarray.core.dataset.Dataset
test_mypy.py:8: note:     union_obj: Union[xarray.core.dataarray.DataArray, xarray.core.dataset.Dataset]

@mathause
Copy link
Collaborator

Seems this was already discussed in GH3210 (comment) and see also the TODO:

xarray/xarray/core/dataset.py

Lines 1244 to 1250 in 8f688ea

def __getitem__(self, key: Any) -> "Union[DataArray, Dataset]":
"""Access variables or coordinates this dataset as a
:py:class:`~xarray.DataArray`.
Indexing with a list of names will return a new ``Dataset`` object.
"""
# TODO(shoyer): type this properly: https://github.com/python/mypy/issues/7328

(although it is not entirely clear to me whether this is actually fixed or not)

@max-sixty
Copy link
Collaborator

Nice find @mathause ; I remember that discussion now.

As @nbren12 said, this doesn't go all the way given mypy's restrictions, but seems like a dominant improvement.

Test failures are unrelated.

Thanks @nbren12 !

@nbren12
Copy link
Contributor Author

nbren12 commented Jun 12, 2020

No problem! I think I am done with this one unless you think its important that I document or test this somehow. Can someone review it?

@crusaderky
Copy link
Contributor

I took the liberty to rework it, please have a look
Test script:

from typing import Hashable, Mapping
import xarray
ds: xarray.Dataset


class D(Hashable, Mapping):
    def __hash__(self): ...
    def __getitem__(self, item): ...
    def __iter__(self): ...
    def __len__(self): ...

reveal_type(ds["foo"])
reveal_type(ds[["foo", "bar"]])
reveal_type(ds[{}])
reveal_type(ds[D()])

mypy output:

t1.py:12: note: Revealed type is 'xarray.core.dataarray.DataArray'
t1.py:13: note: Revealed type is 'xarray.core.dataset.Dataset'
t1.py:14: note: Revealed type is 'xarray.core.dataset.Dataset'
t1.py:15: note: Revealed type is 'xarray.core.dataset.Dataset'

@crusaderky crusaderky self-requested a review June 13, 2020 17:44
@@ -1241,13 +1242,25 @@ def loc(self) -> _LocIndexer:
"""
return _LocIndexer(self)

def __getitem__(self, key: Any) -> "Union[DataArray, Dataset]":
# FIXME https://github.com/python/mypy/issues/7328
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is fixed now and can be removed? Or perhaps we more it below above the third @overload and add a comment that Any means list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's fixed. Specifically, mypy can't deal in the signature with an overload of Mapping and Hashable. Curiously however, once you add #type: ignore to the overloaded signature, the actual type inspection works just fine (see my test script above).

@crusaderky crusaderky merged commit bc5c79e into pydata:master Jun 15, 2020
@nbren12
Copy link
Contributor Author

nbren12 commented Jun 15, 2020

@crusaderky Thanks for the re-work. For my own benefit, could you explain why that code worked? I remember writing something very similar, and running into mypy errors. My understanding of how mypy intreprets overload seems incomplete.

@crusaderky
Copy link
Contributor

@nbren12 it seems to me that mypy is being overly aggressive when parsing the hinted code (hence why I had to put # type: ignore on it) but it is being more lax when the same code is invoked somewhere else like in my test script. Overall I suspect it may be fragile and break in future mypy versions...

@nbren12 nbren12 deleted the feature/overloaded-typehint branch June 17, 2020 01:41
dcherian added a commit to TomNicholas/xarray that referenced this pull request Jun 24, 2020
…o-combine

* 'master' of github.com:pydata/xarray: (81 commits)
  use builtin python types instead of the numpy alias (pydata#4170)
  Revise pull request template (pydata#4039)
  pint support for Dataset (pydata#3975)
  drop eccodes in docs (pydata#4162)
  Update issue templates inspired/based on dask (pydata#4154)
  Fix failing upstream-dev build & remove docs build (pydata#4160)
  Improve typehints of xr.Dataset.__getitem__ (pydata#4144)
  provide a error summary for assert_allclose (pydata#3847)
  built-in accessor documentation (pydata#3988)
  Recommend installing cftime when time decoding fails. (pydata#4134)
  parameter documentation for DataArray.sel (pydata#4150)
  speed up map_blocks (pydata#4149)
  Remove outdated note from datetime accessor docstring (pydata#4148)
  Fix the upstream-dev pandas build failure (pydata#4138)
  map_blocks: Allow passing dask-backed objects in args (pydata#3818)
  keep attrs in reset_index (pydata#4103)
  Fix open_rasterio() for WarpedVRT with specified src_crs (pydata#4104)
  Allow non-unique and non-monotonic coordinates in get_clean_interp_index and polyfit (pydata#4099)
  update numpy's intersphinx url (pydata#4117)
  xr.infer_freq (pydata#4033)
  ...
dcherian added a commit to raphaeldussin/xarray that referenced this pull request Jul 1, 2020
* upstream/master: (21 commits)
  fix typo in error message in plot.py (pydata#4188)
  Support multiple dimensions in DataArray.argmin() and DataArray.argmax() methods (pydata#3936)
  Show data by default in HTML repr for DataArray (pydata#4182)
  Blackdoc (pydata#4177)
  Add CONTRIBUTING.md for the benefit of GitHub
  Correct dask handling for 1D idxmax/min on ND data (pydata#4135)
  use assert_allclose in the aggregation-with-units tests (pydata#4174)
  Remove old auto combine (pydata#3926)
  Fix 4009 (pydata#4173)
  Limit length of dataarray reprs (pydata#3905)
  Remove <pre> from nested HTML repr (pydata#4171)
  Proposal for better error message about in-place operation (pydata#3976)
  use builtin python types instead of the numpy alias (pydata#4170)
  Revise pull request template (pydata#4039)
  pint support for Dataset (pydata#3975)
  drop eccodes in docs (pydata#4162)
  Update issue templates inspired/based on dask (pydata#4154)
  Fix failing upstream-dev build & remove docs build (pydata#4160)
  Improve typehints of xr.Dataset.__getitem__ (pydata#4144)
  provide a error summary for assert_allclose (pydata#3847)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving typing of xr.Dataset.__getitem__
6 participants