Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-string dimension/variable names #2292

Closed
joshburkart opened this issue Jul 16, 2018 · 20 comments
Closed

Support non-string dimension/variable names #2292

joshburkart opened this issue Jul 16, 2018 · 20 comments

Comments

@joshburkart
Copy link

Problem description

Currently, it appears that "dimension"/"coordinate" labels must be strings. However, in more rigorous software engineering applications it is often desirable to use something more organized/structured for labels, e.g. enums. I think it would be great if xarray supported this.

Obviously storing to e.g. NetCDF necessitates string-valued field names, so I would think calling str could be appropriate when performing this sort of serialization. This is what pandas seems to do (see below). But I imagine there might be other issues that would need to be resolved to do what I'm suggesting...?

Code sample

import enum

import numpy as np
import pandas as pd
import xarray as xr

class CoordId(enum.Enum):
    LAT = 'lat'
    LON = 'lon'

pd.DataFrame({CoordId.LAT: [1,2,3]}).to_csv()
# Returns: ',CoordId.LAT\n0,1\n1,2\n2,3\n'

xr.DataArray(
    data=np.arange(3 * 2).reshape(3, 2),
    coords={CoordId.LAT: [1, 2, 3], CoordId.LON: [7, 8]},
    dims=[CoordId.LAT, CoordId.LON],
)
# Fails: TypeError: dimension CoordId.LAT is not a string

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-1010-gcp machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.7
pandas: 0.23.1
numpy: 1.14.5
scipy: 1.1.0
netCDF4: 1.3.1
h5netcdf: 0.5.0
h5py: 2.7.1
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.1.1
cartopy: 0.16.0
seaborn: None
setuptools: 39.2.0
pip: 9.0.1
conda: None
pytest: 3.6.1
IPython: 6.4.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jul 16, 2018

Hi @joshburkart -- thanks for raising this concern.

I agree, it would be nice to support enums (really any hashable value) as dimension names. Our current checks for strings are somewhat inconsistent, e.g., you can actually use these in an xarray.Dataset if you use the fully explicit API for constructing a dataset:

ds = xr.Dataset(
    data_vars={'foo': ((CoordId.LAT, CoordId.LON), np.arange(3 * 2).reshape(3, 2))},
    coords={CoordId.LAT: ((CoordId.LAT,), [1, 2, 3]),
            CoordId.LON: ((CoordId.LON,), [7, 8])},
)

But now if you try to print the dataset, you get an error about sorting:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    697                 type_pprinters=self.type_printers,
    698                 deferred_pprinters=self.deferred_printers)
--> 699             printer.pretty(obj)
    700             printer.flush()
    701             return stream.getvalue()

/usr/local/lib/python3.6/dist-packages/IPython/lib/pretty.py in pretty(self, obj)
    396                             if callable(meth):
    397                                 return meth(obj, self, cycle)
--> 398             return _default_pprint(obj, self, cycle)
    399         finally:
    400             self.end_group()

/usr/local/lib/python3.6/dist-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
    516     if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs:
    517         # A user-provided repr. Find newlines and replace them with p.break_()
--> 518         _repr_pprint(obj, p, cycle)
    519         return
    520     p.begin_group(1, '<')

/usr/local/lib/python3.6/dist-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    707     """A pprint that just redirects to the normal repr function."""
    708     # Find newlines and replace them with p.break_()
--> 709     output = repr(obj)
    710     for idx,output_line in enumerate(output.splitlines()):
    711         if idx:

/usr/local/lib/python3.6/dist-packages/xarray/core/formatting.py in __repr__(self)
     62 
     63     def __repr__(self):
---> 64         return ensure_valid_repr(self.__unicode__())
     65 
     66 

/usr/local/lib/python3.6/dist-packages/xarray/core/dataset.py in __unicode__(self)
   1188 
   1189     def __unicode__(self):
-> 1190         return formatting.dataset_repr(self)
   1191 
   1192     def info(self, buf=None):

/usr/local/lib/python3.6/dist-packages/xarray/core/formatting.py in dataset_repr(ds)
    415 
    416     dims_start = pretty_print(u'Dimensions:', col_width)
--> 417     summary.append(u'%s(%s)' % (dims_start, dim_summary(ds)))
    418 
    419     if ds.coords:

/usr/local/lib/python3.6/dist-packages/xarray/core/formatting.py in dim_summary(obj)
    324 
    325 def dim_summary(obj):
--> 326     elements = [u'%s: %s' % (k, v) for k, v in obj.sizes.items()]
    327     return u', '.join(elements)
    328 

/usr/local/lib/python3.6/dist-packages/xarray/core/formatting.py in <listcomp>(.0)
    324 
    325 def dim_summary(obj):
--> 326     elements = [u'%s: %s' % (k, v) for k, v in obj.sizes.items()]
    327     return u', '.join(elements)
    328 

/usr/lib/python3.6/_collections_abc.py in __iter__(self)
    741 
    742     def __iter__(self):
--> 743         for key in self._mapping:
    744             yield (key, self._mapping[key])
    745 

/usr/local/lib/python3.6/dist-packages/xarray/core/utils.py in __iter__(self)
    311 
    312     def __iter__(self):
--> 313         return iter(self.mapping)
    314 
    315     def __len__(self):

/usr/local/lib/python3.6/dist-packages/xarray/core/utils.py in __iter__(self)
    347 
    348     def __iter__(self):
--> 349         return iter(sorted(self.mapping))
    350 
    351     def __len__(self):

TypeError: '<' not supported between instances of 'CoordId' and 'CoordId'

I would be open to PRs to improve the situation.

@joshburkart
Copy link
Author

Thanks @shoyer. I'll see if I can take a look in the near future...

@shoyer
Copy link
Member

shoyer commented Jul 16, 2018

It would be an improvement even to clearly document the requirements for dimension/variable names.

I suspect we don't actually need them to be sortable, though we do using sorting as part of the current repr() for some xarray objecs. This is mostly to ensure reproducible displays across multiple loads/runs of a file, but it's increasingly less relevant now than Python's dict is ordered by default (since Python 3.6).

@shoyer
Copy link
Member

shoyer commented Jul 19, 2018

Another choice would be to intentionally simplify xarray's data model and not allow anything other than strings for variable/dimension names.

@joshburkart
Copy link
Author

Some options that come to mind:

  1. Allow any object with a __str__ method to be supplied as a variable/dimension label, but then delegate all internal sorting/printing/etc. logic to str(label).
  2. Just implicitly run str on everything a user tries to input as a label (both when creating an xarray object and when accessing fields from an existing object), so that only strings are used internally.
  3. Put this on the user, and only allow calling xarray objects/methods/etc. with labels already strings, as you suggested @shoyer.

I dunno. Whatever the maintainers think is best? (3) seems least complex on the xarray side, but (1) or (2) might be more convenient for users.

@shoyer shoyer changed the title Support non-string dim/coord labels Support non-string dimension/variable names Jul 19, 2018
@shoyer
Copy link
Member

shoyer commented Jul 19, 2018

(2) (1) seems like a pretty decent option to me. It's compatible with Python's duck-typing philosophy, and we don't really need string variable/dimension names for anything other than various serialization formats like netCDF. So the full requirement for names would be "Hashable, can be coerced with str() and not None" (we use None as a sentinel value to indicate the lack of a name in xarray).

CC @pydata/xarray in case anyone else has opinions.

@joshburkart
Copy link
Author

Just to clarify @shoyer, you said (2) sounds best to you, but your other comments (e.g. duck typing, requiring hashable) seem to describe (1)...? Slightly confused...

@shoyer
Copy link
Member

shoyer commented Jul 20, 2018

Oops, I was reading a little too quickly. I did indeed mean (1) above. The one thing I would emphasize is that we don't actually want to check for something like hasattr(dim, '__str__') if possible, but rather just call str(dim). (Though I guess Python's object type defines a default __str__ method, so pretty much everything will pass that test.)

@ttung
Copy link
Contributor

ttung commented Aug 2, 2018

We're using xarray in a project that is encouraging use of python typing, and we too would like to use enums as data dimension names. How do you feel about using a base class that data dimension classes need to subclass?

Here's a really simple proof-of-concept (though not very thorough, as it would certainly fail serialization): ttung@8e623eb

In [1]: import xarray as xr

In [2]: import numpy as np

In [5]: from enum import Enum

In [6]: class A(xr.core.dataarray.DimensionBase, Enum):
   ...:     X = "abc"
   ...:     Y = "def"
   ...:     Z = "ghi"
   ...:     

In [7]: a = xr.DataArray(np.random.randint(0, 255, size=(4, 3, 5)), dims=[A.X, A.Y, A.Z])

In [8]: a[A.X]
Out[8]: 
<xarray.DataArray <A.X: 'abc'> (A.X: 4)>
array([0, 1, 2, 3])
Dimensions without coordinates: A.X

In [9]: a.max(A.X)
Out[9]: 
<xarray.DataArray (A.Y: 3, A.Z: 5)>
array([[254, 226, 181, 191, 233],
       [139, 195, 212, 167, 169],
       [191, 241, 199, 174, 208]])
Dimensions without coordinates: A.Y, A.Z

In [10]: 

@shoyer
Copy link
Member

shoyer commented Aug 2, 2018

Most of the places in the code where we do the isinstance(obj, string) checks are where we allow passing in a single string as a convenient shortcut to a list of names. So I'm not sure it's really essential to allow flexible types there.

That said, checking explicitly for strings wasn't take a careful API choice. I could see a case for replacing all these "as sequence" casts with something more generic, e.g., based on checking explicitly for more generic scalar types. Certainly enums should be scalars.

If possible, I would rather stick to duck typing for any requirements we put on names. Base classes don't feel terribly Pythonic.

@ttung
Copy link
Contributor

ttung commented Aug 2, 2018

The problem with generic scalar types is that it wouldn't work after serialization and deserialization (which would presumably go to strings). My suggestion has the advantage of being able to create a __eq__ method in the base class that would match both the object itself or its string equivalent, so that one could use the scalar type even after ser/deser. I disagree that base classes aren't very pythonic.

However, I think (1)/(2) are both reasonable solution (in fact, they seem to be identical except for when you call str). It has its warts, as even a mutable sequence would pass muster. :)

If that's the direction you'd like to see the project go towards, I'd be happy to take a stab at it.

@shoyer
Copy link
Member

shoyer commented Aug 3, 2018

I disagree that base classes aren't very pythonic.

I should have said that required base classes don't feel very Pythonic. I'm not opposed to base classes in principle, and I'm definitely sympathetic to a desire to use static typing. See also #1900 for related discussion.

One consideration is what the advantages are of using enums over "dummy enums" like:

class A:
    X = 'X'
    Y = 'Y'
    Z = 'Z'

(i.e., constants in a namespace)

You can still refer to these programmatically like A.X, but I guess the string repr is different. On the plus side, "dummy enums" will serialize/deserialize perfectly to strings (because they are indeed strings).

I don't love the sound of names that deserialize to different types than their inputs. That seems very error prone, even if you do your best to overload all the special methods like __eq__.

What does seems like potentially a better idea to me would be a library with dedicated loader functions that "destringify" names by turning them back into enum objects.

@gimperiale
Copy link
Contributor

gimperiale commented May 9, 2019

There are problems with typing. I already mentioned them in #2929 but I'll summarize here.

The vast majority of xarray functions/methods allow for "string or sequence of strings, optional". When you move to "hashable or sequence of hashables, optional", however, you want to specifically avoid tuples, which are both Sequence and Hashable instances.

Most functions currently look like this:

if isinstance(x, str):
    x = [x]
elif x is None:
    x = [DEFAULT]
for xi in x:
    ...

After the change they would become:

if x is None:
    x = [DEFAULT]
elif isinstance(x, Hashable) and not isinstance(x, tuple):
    x = [x]
for xi in x:
    ...

Or:

if x is None:
    x = [DEFAULT]
elif isinstance(x, str) or not isinstance(x, Sequence):
    x = [x]
for xi in x:
    ...

Note how I moved the test for None above. This matters, because isinstance(None, Hashable) returns True.
This is very error-prone and expensive to maintain, which will very easily cause beginner contributors to introduce bugs. Every test that currently runs three use cases, one for None, one for str and another for a sequence of str, will now be forced to be expanded to SIX test cases:

  1. str
  2. tuple (hashable sequence) of str
  3. list (non-hashable sequence) of str
  4. enum (non-str, non-sequence hashable)
  5. sequence of non-sortable hashables
  6. None

One way to mitigate it would be to have an helper function, which would be invoked everywhere around the codebase, and then religiously make sure that the helper function is always used.

_no_default = [object()]
def ensure_sequence(name: str, x: Union[Hashable, Sequence[Hashable]], default: Sequence[Hashable] = _no_default) -> Sequence[Hashable]:
    if x is None:
        if default is _no_default:
            raise ValueError(name + ' must be explicitly defined')
        return default
    if isinstance(x, Sequence) and not isinstance(x, str):
        return x
    if isinstance(x, Hashable):
        return [x]
    raise TypeError(name + ' must be a Hashable or Sequence of Hashable')

You would still be forced to implement the test for non-sortable hashables, though.


A completely separate problem with typing is that I expect a huge amount of xarray users to just assume variable names and dims are always strings. They'll have things like

for k, v in ds.data_vars:
    if k.startswith('foo'):
        ...

or

[dim for dim in da.dims if "foo" in dim]

The above will fill the mypy output with errors as soon as xarray becomes integrated in mypy (#2929), and the user will have to go through a lot of explicitly forcing dims and variable names to str, even if in their project all dims and variables names are always str.


The final problem is that integers are Hashables, and there's a wealth of cases in xarray where there is special logic that dynamically treats ints as positional indices.

@gimperiale
Copy link
Contributor

A possible way out would be to open a PEP for "and" and "not" operators in the typing module. That way we could define a "variable-name-like" type and use it throughout the module:

xarray.utils:

from typing import AllOf, Hashable, NoneOf
VarName = AllOf[Hashable, NoneOf[None, tuple]]

Elsewhere:

from .utils import VarName
def f(x: Union[VarName, Sequence[VarName], None]):
    if x is None:
        x = [DEFAULT]
    elif isinstance(x, VarName):
        x = [x]
    elif not isinstance(x, Sequence):
        raise TypeError('x: expected hashable or sequence of hashables)

@shoyer
Copy link
Member

shoyer commented May 27, 2019

From a typing perspective, what if xarray.Dataset was a generic type? Then you could write something like xarray.Dataset[str, np.ndarray] to get a Dataset specialized to string keys and numpy arrays in the .data attribute of its constituent data.

I don't think we need to change the signature of xarray functions to support "hashable or sequence of hashable". It's OK if convenience features (like support for passing in only a single argument as a string) don't work in all cases. I agree that it would be a good idea to use a centralized helper function for this, though.

It is unfortunate that there doesn't seem to be a good way to distinguish between "string" and "non-string sequence of strings" in Python's typing system. But I don't know a good way to solve this problem. Certainly the folks who work on typing in Python are aware of this.

@crusaderky
Copy link
Contributor

@shoyer the biggest problem I see with your suggestion is that, for DataArrays, you'd likewise need to write xarray.DataArray[str, np.ndarray], except that str in this case refers to the coords alone, which I think novice users may find confusing as they won't mentally associate a DataArray to a dict-like - even if you can write da[coord name].

@DerWeh
Copy link

DerWeh commented Nov 5, 2020

I just came along this question as I tried something similar to @joshburkart. Using a string-enum instead, the code works in principle:

import enum

import numpy as np
import pandas as pd
import xarray as xr

class CoordId(str, enum.Enum):
    LAT = 'lat'
    LON = 'lon'

pd.DataFrame({CoordId.LAT: [1,2,3]}).to_csv()
# Returns: ',CoordId.LAT\n0,1\n1,2\n2,3\n'

xr.DataArray(
    data=np.arange(3 * 2).reshape(3, 2),
    coords={CoordId.LAT: [1, 2, 3], CoordId.LON: [7, 8]},
    dims=[CoordId.LAT, CoordId.LON],
)
# output
# <xarray.DataArray (lat: 3, lon: 2)>
# array([[0, 1],
#        [2, 3],
#        [4, 5]])
# Coordinates:
#   * lat          (CoordId.LAT) int64 1 2 3
#   * lon          (CoordId.LON) int64 7 8

We however got somewhat ambivalent results, that the dimensions are still enum elements dims = (<CoordId.LAT: 'lat'>, <CoordId.LON: 'lon'>), but the coordinate names are the strings. After writing and reading the DataArray, everything is a plain string, we can still access the elements using the enum elements, as they are equal to the strings.

@derhintze
Copy link

derhintze commented Nov 26, 2021

Would like to chime in that we use a similar approach as in the last comment of @DerWeh . But we extend this by overloading the __str__ method of the enum.Enum base class, and implement a Dimension base class to use for our enum dimensions:

class Dimension(str, enum.Enum):
    """Base class for all dimension enums

    It is of type string because this is needed for xarray.
    """

    def __str__(self):
        return type(self).__name__ + "." + self.name

Using this the xarray output is more consistent:

>>> class CoordId(Dimension):
...     LAT = 'lat'
...     LON = 'lon'
... 
>>> xr.DataArray(
...     data=np.arange(3 * 2).reshape(3, 2),
...     coords={CoordId.LAT: [1, 2, 3], CoordId.LON: [7, 8]},
...     dims=[CoordId.LAT, CoordId.LON],
... )
<xarray.DataArray (CoordId.LAT: 3, CoordId.LON: 2)>
array([[0, 1],
       [2, 3],
       [4, 5]])
Coordinates:
  * CoordId.LAT  (CoordId.LAT) int64 1 2 3
  * CoordId.LON  (CoordId.LON) int64 7 8

We then have deserialization code, that re-creates enum members when reading NetCDF files with corresponding dimensions (and coordinates). Access to coordinates works with enum members as well as their string value.

@max-sixty
Copy link
Collaborator

max-sixty commented Sep 10, 2024

The motivating example works when the coord keys are passed as tuples, since in the last few years we're coalesced on dim types being str | Iterable[Hashable] — i.e. to pass arbitrary hashables, we need to use an iterable; xarray will only coerce strings to a sequence:

import enum

import numpy as np
import pandas as pd
import xarray as xr

class CoordId(enum.Enum):
    LAT = 'lat'
    LON = 'lon'

pd.DataFrame({CoordId.LAT: [1,2,3]}).to_csv()

xr.DataArray(
    data=np.arange(3 * 2).reshape(3, 2),
    coords={(CoordId.LAT,): [1, 2, 3], (CoordId.LON,): [7, 8]},  # Change here
    dims=[CoordId.LAT, CoordId.LON],
)

Marking as close-able, please add a comment if that's not quite right

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Sep 10, 2024
@max-sixty
Copy link
Collaborator

We're discussing this at #8836. Some of these were five years ago. In particular, I think the issues in this comment #2292 (comment) were solved by having str | Iterable[Hashable] as the dim type, which:

  • no longer includes None
  • no longer is ambiguous with a tuple

Am I mistaken there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants