Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate passing pd.MultiIndex implicitly #8140

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Sep 3, 2023

This PR should normally raise a warning each time when indexed coordinates are created implicitly from a pd.MultiIndex object.

I updated the tests to create coordinates explicitly using Coordinates.from_pandas_multiindex().

I also refactored some parts where a pd.MultiIndex could still be passed and promoted internally, with the exception of:

  • swap_dims(): it should raise a warning! Right now the warning message is a bit confusing for this case, but instead of adding a special case we should probably deprecate the whole method? As it is suggested as a TODO comment... This method was to circumvent the limitations of dimension coordinates, which isn't needed anymore (rename_dims and/or set_xindex is equivalent and less confusing).
  • xr.DataArray(pandas_obj_with_multiindex, dims=...): I guess it should raise a warning too?
  • da.stack(z=...).groupby("z"): it shoudn't raise a warning, but this requires a (heavy?) refactoring of groupby. During building the "grouper" objects, grouper.group1d or grouper.unique_coord may still be built by extracting only the multi-index dimension coordinate. I'd greatly appreciate if anyone familiar with the groupby implementation could help me with this! @dcherian ?

@max-sixty
Copy link
Collaborator

xr.DataArray(pandas_obj_with_multiindex, dims=...): I guess it should raise a warning too?

I've been out of the loop of discussions recently (and less recently...). To the extent this isn't firmly decided — is this necessary? Is there a downside to having a good default when pandas objects are passed in? Is there significant ambiguity on what the result should be? What do we recommend for converting from pandas-object-with-multiindex to dataset/dataarray?

@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2023

I've been out of the loop of discussions recently (and less recently...)

No worries! There's a more context in #6293 (comment) and in #6392 (comment).

Is there a downside to having a good default when pandas objects are passed in? Is there significant ambiguity on what the result should be? What do we recommend for converting from pandas-object-with-multiindex to dataset/dataarray?

The main source of ambiguity is the extraction of each multi-index level as a coordinate and the possible conflict with the other coordinates.

More generally, maintaining the special cases for pandas multi-index has been a big hassle ever since support for it was added in Xarray. I share a lot of responsibility since I mainly contributed to adding that support :-). There has been numerous subtle bugs and it really makes the internal logic more complicated than it should in many places of the Xarray code base. Removing all those special cases will be a big relief!

I think that a good default behavior is to treat the pandas objects passed as data or coordinate variables like any other duck array. If we want a more specific behavior leveraging the index contained in those objects, the recommended way is to convert them using the explicit conversion methods provided by Xarray, e.g.,

  • For a pd.MultiIndex, use xr.Coordinates.from_pandas_multiindex(...)
  • For a pd.Series with a multi-index, use xr.DataArray.from_series(...).stack(...)
  • For a pd.DataFrame with a multi-index, use xr.Dataset.from_dataframe(...).stack(...)

(note: for the two latter we might want to add an option to skip expanding the multi-index so that we don't need to re-stack the dimensions)

Add suggestions for the cases where the pandas multi-index is passed via
a pandas dataframe or series.
@dcherian
Copy link
Contributor

dcherian commented Sep 7, 2023

During building the "grouper" objects, grouper.group1d or grouper.unique_coord may still be built by extracting only the multi-index dimension coordinate.

Can you describe what change you'd like to see?

@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2023

@dcherian ideally GroupBy._infer_concat_args() would return a xr.Coordinates object that contains both the coordinate(s) and their (multi-)index to assign to the result (combined) object.

The goal is to avoid calling create_default_index_implicit(coord) below where coord is a pd.MultiIndex or a single IndexVariable wrapping a multi-index. If coord is a Coordinates object, we could do combined = combined.assign_coords(coord) instead.

xarray/xarray/core/groupby.py

Lines 1573 to 1587 in e2b6f34

def _combine(self, applied):
"""Recombine the applied objects like the original."""
applied_example, applied = peek_at(applied)
coord, dim, positions = self._infer_concat_args(applied_example)
combined = concat(applied, dim)
(grouper,) = self.groupers
combined = _maybe_reorder(combined, dim, positions, N=grouper.group.size)
# assign coord when the applied function does not return that coord
if coord is not None and dim not in applied_example.dims:
index, index_vars = create_default_index_implicit(coord)
indexes = {k: index for k in index_vars}
combined = combined._overwrite_indexes(indexes, index_vars)
combined = self._maybe_restore_empty_groups(combined)
combined = self._maybe_unstack(combined)
return combined

There are actually more general issues:

  • The group parameter of Dataset.groupby being a single variable or variable name, it won't be possible to do groupby on a full pandas multi-index once we drop its dimension coordinate (Deprecate the multi-index dimension coordinate #8143). How can we still support it? Maybe passing a dimension name to group and check that there's only one index for that dimension?
  • How can we support custom, multi-coordinate indexes with groupby? I don't have any practical example in mind, but in theory just passing a single coordinate name as group will invalidate the index. Should we drop the index in the result? Or, like suggested above pass a dimension name as group and check the index?

@max-sixty
Copy link
Collaborator

Thanks @benbovy !

More generally, maintaining the special cases for pandas multi-index has been a big hassle ever since support for it was added in Xarray. I share a lot of responsibility since I mainly contributed to adding that support :-). There has been numerous subtle bugs and it really makes the internal logic more complicated than it should in many places of the Xarray code base. Removing all those special cases will be a big relief!

I totally agree with not having native MultiIndex support within a DataArray / Dataset. I'm wondering whether we can still do something reasonable when a MultiIndex is passed in, since that's quite common IME, and it's common with folks who want to do something quickly, possibly are less experienced xarray users — and so the costs of explicit conversions might have the largest impact.

I think that a good default behavior is to treat the pandas objects passed as data or coordinate variables like any other duck array.

OK great, I'm less familiar with what this would be like — would .sel still work? (Or feel free to point me to issues, thank you for your patience in advance...)

  • For a pd.MultiIndex, use xr.Coordinates.from_pandas_multiindex(...)
  • For a pd.Series with a multi-index, use xr.DataArray.from_series(...).stack(...)
  • For a pd.DataFrame with a multi-index, use xr.Dataset.from_dataframe(...).stack(...)

To the extent xr.Coordinates.from_pandas_multiindex(...) is what's required to get reasonable behavior, we could do that implicitly, and then for something more specific, folks can be explicit.

(FYI my guess is that often we don't want to .stack, since the indexes can be quite sparse)

@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2023

I'm wondering whether we can still do something reasonable when a MultiIndex is passed in, since that's quite common IME, and it's common with folks who want to do something quickly, possibly are less experienced xarray users — and so the costs of explicit conversions might have the largest impact.

Hmm even with the most reasonable option, extracting one or more level coordinates from a MultiIndex passed as a single variable feels too magical and is hardly predictable, IMHO. That's not the kind of a behavior one usually expects for generic mapping types.

What if the MultiIndex is wrapped in another object, e.g., a pandas.Series, xarray.Variable, xarray.DataArray? What would be the most reasonable behavior for those cases? Here are a few examples:

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))

# extracts the multi-index levels as coordinates with dimension "x"
xr.Dataset({"x": midx})
xr.Dataset(coords={"x": midx})
xr.Dataset(coords={"x": xr.Variable("x", midx)})
xr.Dataset({"x": xr.DataArray(midx, dims="x")})

# creates only one dimension coordinate "x" with tuple values
xr.Dataset({"x": xr.DataArray(xr.Variable("x", midx))})

# creates one dimension coordinate "x" with tuple values
# and two indexed coordinates "one", "two" sharing the same index
xr.Dataset({"x": xr.DataArray(xr.IndexVariable("x", midx))})

# extracts the multi-index levels as coordinates with dimension "dim_0"
xr.Dataset({"x": pd.Series(range(4), index=midx)})

# creates a dimension coordinate "x" with values [0, 1, 2, 3] 
xr.Dataset(coords={"x": pd.Series(range(4), index=midx)})
xr.Dataset({"x": ("x", pd.Series(range(4), index=midx))})

I doubt that all these results would have been accurately predicted by even experienced xarray users (the nested DataArray / IndexVariable example is certainly a bug).

Another question: how common using pandas MultiIndex will it be compared to other Xarray indexes that will be available in the future? To which point is it justified treating PandasMultiIndex so differently than any other Xarray multi-coordinate index?

To the extent xr.Coordinates.from_pandas_multiindex(...) is what's required to get reasonable behavior

I'm afraid it is more complicated than that.

@max-sixty
Copy link
Collaborator

Those are great examples!

Hmm even with the most reasonable option, extracting one or more level coordinates from a MultiIndex passed as a single variable feels too magical and is hardly predictable, IMHO. That's not the kind of a behavior one usually expects for generic mapping types.

OK. FWIW, extracting coords is what I was thinking... 😁

Another question: how common using pandas MultiIndex will it be compared to other Xarray indexes that will be available in the future? To which point is it justified treating PandasMultiIndex so differently than any other Xarray multi-coordinate index?

My mental model of this user is that they don't so much care about the MultiIndex object per se — but MultiIndexs are common in pandas, and they expect some reasonable-looking xarray object when implicitly converting from a pandas object. It remaining a literal MultiIndex within the da isn't important to them

I do worry that if we say "oh you want to pass in a dataframe with a multiindex, now you have to make a bunch of choices on how that should happen", that it won't be friendly.

(I'm by no means claiming this is every user; I'm loading on my own experience working with folks who use both pandas & xarray)

For example, this is very sufficient — pass in a DataFrame with a multiindex...

df = pd.DataFrame(dict(a=range(7,11)), index=midx)

df

Out[32]:
          a
one two
a   0     7
    1     8
b   0     9
    1    10

...and then we can use .sel on each of the levels:

xr.Dataset(df).sel(one='a', two=0)

Out[37]:
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    dim_0    object ('a', 0)
    one      <U1 'a'
    two      int64 0
Data variables:
    a        int64 7

It's not perfect — we have this dim_0 which has tuples since we didn't name the coord, but it does work pretty well.


You know this 10x better than I do, so I really don't mean to do a drive-by and slow anything down. I do wonder whether there's some synthesis of the two approaches — we make things robust once they're in the xarray data model, while remaining generous about accepting inputs.

@benbovy
Copy link
Member Author

benbovy commented Sep 8, 2023

Yes I guess more generally it all depends on whether we see an Xarray Dataset as a kind of multi-dimensional dataframe or as a mapping of n-dimensional arrays with labels.

While both point of views are valid, they are hard to reconcile through the same API. Trying to accommodate it too generously (or even with the barest amount of generosity) may reach a point where it is more harmful than beneficial for the two dataframe vs. array point of views (actually, I think we've already reached this point).

After working on the index refactor, my point of view shifted more towards n-d arrays (so I'm biased!). Unlike a dataframe, the concept of an array rarely encapsulates an index. Now that indexes are 1st class members of the Xarray data model, it makes better sense IMO to handle them (and dataframe objects) through an explicit API rather than trying to continue mixing them with arrays in the same API function or method arguments.

That said, I totally agree that we should never make Xarray unfriendly for you and other users using both Pandas & Xarray! We should continue to offer Premium™ builtin support, notably by keeping default PandasIndex objects for dimension coordinates and via API like .from_dataframe, .from_series, .from_pandas_multiindex, etc.

If we require to pass (pandas) index, series or dataframe objects via explicit conversion methods, we should indeed try to minimize the friction as much as possible. But I think that we are not far from that goal. Taking your example

xr.Dataset(df).sel(one='a', two=0)

Doing instead

xr.Dataset.from_dataframe(df).sel(one='a', two=0)

doesn't look like adding a lot of friction to me (note: the latter dataset doesn't have any dim_0 added).

I do worry that if we say "oh you want to pass in a dataframe with a multiindex, now you have to make a bunch of choices on how that should happen", that it won't be friendly.

I also agree with this. So if we choose to deprecate the current default behavior, we should consider a long deprecation cycle and make it clear what is the alternative to get the desired behavior.

@max-sixty
Copy link
Collaborator

Thank you for the very thoughtful responses. I actually think we're quite close in how we're thinking about it. I like your distinction of "Xarray Dataset as a kind of multi-dimensional dataframe or as a mapping of n-dimensional arrays with labels.", and I tend towards the latter too, even if it's nice to occasionally orient around the former.

If we require to pass (pandas) index, series or dataframe objects via explicit conversion methods, we should indeed try to minimize the friction as much as possible. But I think that we are not far from that goal. Taking your example

xr.Dataset(df).sel(one='a', two=0)

Doing instead

xr.Dataset.from_dataframe(df).sel(one='a', two=0)

doesn't look like adding a lot of friction to me (note: the latter dataset doesn't have any dim_0 added).

For me the main issues here are:

  • How would someone use .from_pandas_multiindex to convert a df to a ds? I tried a couple of things but couldn't get the correct indexes, and couldn't immediately see an example in the tests (sorry if this is basic / covered elsewhere — please feel very free to say "read X")
xr.Dataset(df.reset_index(drop=True), coords=xr.Coordinates.from_pandas_multiindex(df.index, dim='foo'))
Out[23]:
<xarray.Dataset>
Dimensions:  (dim_0: 4, foo: 4)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
  * foo      (foo) object MultiIndex
  * one      (foo) object 'a' 'a' 'b' 'b'
  * two      (foo) int64 0 1 0 1
Data variables:
    a        (dim_0) int64 7 8 9 10
  • Is there somewhere I can read about the end-state? I agree that supporting all of pandas' warts is something we can ideally avoid. What would "treat the pandas objects passed as data or coordinate variables like any other duck array" look like? What would the object from the previous bullet be, assuming this were correctly converted? Is there some state that the Dataset should be in, which allows for some notion of sparse data, which we can try and automagically move the dataset closer towards? While we want to generally be robust and explicit, we do have prior art of auto-magic (e.g. combine_by_coords).

  • Using .from_dataframe unstacks the array, which would obv be quite bad for sparse indexes. For example if a multiindex was used to label a date dimension with a n_days_counter (which we would use a coord for in xarray), then it would expand to an n x n array, with data only on the diagonals.


One note: I'm hesitant to push too hard here given how much work and thought has gone into it, and how absent I've been in the past year. So please forgive the continued questions if they feel like an imposition. I'm persevering because I do think it's important, and I do think there are a large number of users who may be more casual and so less represented here. I found xarray back in 2016 because of pandas dissatisfaction, so I'm keen to keep that immigration channel open for folks...

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2023

GroupBy._infer_concat_args() would return a xr.Coordinates object that contains both the coordinate(s) and their (multi-)index to assign to the result (combined) object.

This may take some time. I opened #8162 to track it

@benbovy
Copy link
Member Author

benbovy commented Sep 9, 2023

@max-sixty your questions and thoughts are very much appreciated, please continue to do it! While there seems to me that there is a broad agreement about deprecating special multi-index behavior in general, there hasn't been much discussion about it especially about all the possible impact that this would have.

Using .from_dataframe unstacks the array, which would obv be quite bad for sparse indexes.

Do you think it would be a reasonable option adding a dim=None argument to Dataset.from_dataframe (and DataArray.from_series)?

  • dim=None (default) corresponds to the current behavior
    • single index: a dimension coordinate is created and is named like the index name (or "dim_0" if the index has no name)
    • MultiIndex: the dataframe is unstacked and each multi-index level is extracted as a dimension coordinate
  • dim="x":
    • single index: if it has no name a dimension coordinate "x" is created, otherwise an indexed (non-dimension) coordinate is created, is named like the index and has dimension "x"
    • MultiIndex: the dataframe is not unstacked and the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"

I think that if users set dim="x" explicitly, it is pretty clear that they want to keep the Dataset as 1-dimensional (so no expansion of a MultiIndex into a tensor product).

Is there somewhere I can read about the end-state?

Not yet, but once it is clarified we should document it somewhere! I actually haven't thought much about dataframe objects passed directly to Dataset.__init__. If we don't try anymore to extract any index, so no special case anymore for pandas.DataFrame, we could naively consider it like any other input passed to Dataset, i.e., as a mapping of arrays. This could look like:

xr.Dataset({k: np.asarray(v) for k, v in df.items()})
# <xarray.Dataset>
# Dimensions:  (a: 4)
# Coordinates:
#   * a        (a) int64 7 8 9 10
# Data variables:
#    *empty*

Now, that's not super nice to have as many dimensions as they are columns.

Alternatively, we could have some special case for a dataframe but not trying to do too much (i.e., not trying to extract and convert the index). For example:

xr.Dataset({k: ("dim_0", np.asarray(v)) for k, v in df.reset_index().items()})
# <xarray.Dataset>
# Dimensions:  (dim_0: 4)
# Dimensions without coordinates: dim_0
# Data variables:
#     one      (dim_0) object 'a' 'a' 'b' 'b'
#     two      (dim_0) int64 0 1 0 1
#     a        (dim_0)) int64 7 8 9 10

What do you think?

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2023

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me.

To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.

@benbovy
Copy link
Member Author

benbovy commented Sep 9, 2023

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me.

Agreed. I guess that's because it has been there before any multi-index support in Xarray? I'm +1 for changing this behavior.

A smooth transition could be using the dim argument as proposed above to turn on the new behavior. Eventually dim=None won't unstack anymore.

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2023

Can we get away with a unstack: bool kwarg instead ( that is eventually removed) and have the user manually rename as an extra step?

@benbovy
Copy link
Member Author

benbovy commented Sep 9, 2023

Yes we certainly can!

We can also have both and keep dim afterwards, assuming that a MultiIndex rarely has its .name set (that's why I added a dim argument in Coordinates.from_pandas_multiindex).

@max-sixty
Copy link
Collaborator

Excellent, this is sounding good!

  • MultiIndex: the dataframe is not unstacked and the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"

This will betray how long I've been out for, but was there any progress on allowing .sel to work with coords? IIRC there were some plans to allow indexes on coords beyond those named the same as a dimension.

If that is possible, then this proposal would be ideal — basically a much better MultiIndex.

If that's not, then it's awkward, because it's no longer possible to .sel from that dimension, which seems quite important.


+1 to not unstacking automatically

The dim="x" (rather than unstack=False) I think might be required, because IIUC a MultiIndex doesn't have a .name, only a .names (referring to the level names), so a bool doesn't give the information for which dimension it should be on.


(thanks for suggestions on the default __init__ behavior, let me think; possibly it somewhat depends on whether we can still have "multi-level" indexes that can be accessed with .sel)

@benbovy
Copy link
Member Author

benbovy commented Sep 9, 2023

was there any progress on allowing .sel to work with coords? IIRC there were some plans to allow indexes on coords beyond those named the same as a dimension.

Yes it is now supported since v2022.06.0.

a MultiIndex doesn't have a .name, only a .names

Technically a pd.MultiIndex has a .name property (inherited from pd.Index) but in practice it is mostly ignored I think. In xarray.core.indexes.PandasMultiIndex we keep it in sync with the dimension name of the level coordinates, but I doubt that this is really useful (it might become useful for round-trip conversion between xarray.Dataset and pandas.DataFrame if we don't unstack anymore).

@max-sixty
Copy link
Collaborator

was there any progress on allowing .sel to work with coords? IIRC there were some plans to allow indexes on coords beyond those named the same as a dimension.

Yes it is now supported since v2022.06.0.

To confirm the question (sorry if I'm being unclear), If we do this:

MultiIndex: the dataframe is not unstacked and the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"

...then the result of:

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))
df = pd.DataFrame(dict(a=range(7,11)), index=midx)
ds = xr.Dataset(df)  # (or `.from_dataframe` with a dim arg)
ds

...would change to something like:

<xarray.Dataset>
Dimensions:  (dim_0: 4)
Coordinates:
-  * dim_0    (dim_0) object MultiIndex
  * one      (dim_0) object 'a' 'a' 'b' 'b'
  * two      (dim_0) int64 0 1 0 1
Data variables:
    a        (dim_0) int64 7 8 9 10

...but then we'd still be have some way of calling ds.sel(one='a').

I know we can currently do ds.sel(one='a') — but IIUC that's only because the MultiIndex is there.

Or does 'the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"' mean that we would still have a MultiIndex, and the change is smaller than I was envisaging — instead it's just that it needs to be specified with a dim when it's passed?

@benbovy
Copy link
Member Author

benbovy commented Sep 9, 2023

Or does 'the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"' mean that we would still have a MultiIndex, and the change is smaller than I was envisaging

Yes exactly (sorry that was a bit confusing). What I wanted to say is: xarray.Dataset.from_dataframe(df) with no unstack would preserve the MultiIndex of df, i.e., wrap it in a xarray.core.indexes.PandasMultiIndex, create 1-d coordinates from it and then put everything in the new created Dataset.

Those 1-d coordinates currently include both the dimension coordinate "dim_0" and the level coordinates "a", "b". If we consider #8143, eventually they will only include the level coordinates. In both cases, the level coordinates have a PandasMultiIndex so ds.sel(one='a') is supported. The latter case is possible because Xarray now allows setting an index for any set of arbitrary coordinate(s).

@max-sixty
Copy link
Collaborator

I see — great — I was conflating this & #8143 a bit, then.

One note as I'm looking at some of my existing code which uses xarray — the current behavior of xr.Dataset(df) is fairly sane; it's what I & folks I work with use a lot:

[ins] In [25]: df.index.name = 'foo'

[ins] In [26]: df
Out[26]:
          a
one two
a   0     7
    1     8
b   0     9
    1    10

[ins] In [27]: xr.Dataset(df)
Out[27]:
<xarray.Dataset>
Dimensions:  (foo: 4)
Coordinates:
  * foo      (foo) object MultiIndex
  * one      (foo) object 'a' 'a' 'b' 'b'
  * two      (foo) int64 0 1 0 1
Data variables:
    a        (foo) int64 7 8 9 10

...so no unstacking. But it does rely on renaming the dim after creation (or, as in this case, using .name property of a multiindex, which I hadn't even know was a thing, thanks for the pointer above)

So I think we're nearing consensus. Let me write a few things down as a starter — I imagine this is 80% right so please correct me:

  • We'll try to move away from .unstack-ing in .from_dataframe
  • We'll have a deprecation warning for .from_dataframe without a dim arg
  • The dim arg will be used as the name for the "index" dimension (the columns are data vars)
  • The dim arg will cause it to not unstack?
  • And then the direction of Deprecate the multi-index dimension coordinate #8143 can mean we can get the level coords without the parent name

Thank you very much for the discussion @benbovy

@benbovy
Copy link
Member Author

benbovy commented Sep 10, 2023

We'll have a deprecation warning for .from_dataframe without a dim arg
The dim arg will cause it to not unstack?

Either that (warning without a dim arg and when the passed df has a MultiIndex) or via another, temporary unstack argument as @dcherian suggests. The latter is clearer but the advantage of temporarily controlling unstack via dim is that we won't need to later introduce any breaking change in the API.

@benbovy
Copy link
Member Author

benbovy commented Sep 10, 2023

I opened #8140 to continue the discussion about Dataset.from_dataframe.

@max-sixty
Copy link
Collaborator

Sorry I dropped this a while ago — I was just ramping up and lost it in my inbox.

I think we were quite close to consensus, with the unstack kwarg. Was there even anything else to cover, or this was just waiting on me to test it out?

The one request I'd have is to be able to call xr.Dataset(df), where df has a multiindex, and have that work as it always has. That has had very reasonable behavior — it doesn't unstack. Recent Xarray code prints a deprecation warning — I think it would be quite unfriendly to force folks to instead take apart the dataframe, extract the multiindex, run through xr.Coordinates.from_pandas_multiindex, and then pass it all into the constructor....

@max-sixty
Copy link
Collaborator

xarray.Dataset.from_dataframe(df) with no unstack would preserve the MultiIndex of df, i.e., wrap it in a xarray.core.indexes.PandasMultiIndex, create 1-d coordinates from it and then put everything in the new created Dataset.

Those 1-d coordinates currently include both the dimension coordinate "dim_0" and the level coordinates "a", "b". If we consider #8143, eventually they will only include the level coordinates. In both cases, the level coordinates have a PandasMultiIndex so ds.sel(one='a') is supported. The latter case is possible because Xarray now allows setting an index for any set of arbitrary coordinate(s).

Coming back to this a while later — this seems very reasonable indeed.

...and seems consistent with (my) suggestion:

be able to call xr.Dataset(df), where df has a multiindex, and have that work as it always has [edit: have that work without unstacking, even if the exact behavior of the multiindex changes to multiple indexes]. The existing behavior is very reasonable — it doesn't unstack. Recent Xarray code prints a deprecation warning — I think it would be quite unfriendly to force folks to instead take apart the dataframe, extract the multiindex, run through xr.Coordinates.from_pandas_multiindex, and then pass it all into the constructor....

I think we're in broad consensus on the goals. Is that right? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

refactor broadcast for flexible indexes
3 participants