Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiindex selection #2621

Merged
merged 8 commits into from
Dec 24, 2018
Merged

Conversation

fujiisoup
Copy link
Member

Fix using MultiIndex.remove_unused_levels()

@pep8speaks
Copy link

pep8speaks commented Dec 19, 2018

Hello @fujiisoup! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 24, 2018 at 12:51 Hours UTC

@shoyer
Copy link
Member

shoyer commented Dec 19, 2018

Should we update levels after indexing or when stacking? Pandas does the later:

In [31]: import pandas as pd

In [32]: mindex = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']])

In [33]: s = pd.Series(range(6), mindex)

In [34]: s.loc[:1].index
Out[34]:
MultiIndex(levels=[[1, 2, 3], ['a', 'b']],
           labels=[[0, 0], [0, 1]])

In [35]: s.loc[:1].unstack()
Out[35]:
   a  b
1  0  1

The advantage of waiting until unstacking is that removing unused levels could be expensive compared to indexing: it runs in time O(L+N), where L is the size of the original level and N is the length of the new multi-index.

@fujiisoup
Copy link
Member Author

@shoyer

This issue does not matter only in unstack but also in sel (reindex also?).
We can do it in both methods for efficiency, but I just prefered to do it in one place for the quick fix.

But what is the good place to do it actually?
If we invoke this in sel, it would be again inefficient as this is called every time we did .sel.

Probably the best for the efficiency is to keep a flag for it in PandasIndexAdapter and do it in _get_item_with_mask and get_loc only if flag is raised?

@shoyer
Copy link
Member

shoyer commented Dec 20, 2018

This issue does not matter only in unstack but also in sel (reindex also?).

Can you explain why matters for .sel()? I guess it slows down repeated indexing?

My inclination was just to copy what pandas does (which is only removing unused levels in unstack)

@fujiisoup
Copy link
Member Author

@shoyer

My inclination was just to copy what pandas does (which is only removing unused levels in unstack)

Thanks. I will take a look the source later.

Can you explain why matters for .sel()?

Selection of non-exsiting level variable should be KeyError, but it gives a 0-size index.

In [8]: ds = xr.DataArray(np.arange(40).reshape(8, 5), dims=['x', 'y'],  
                  coords={'x': np.arange(8), 'y': np.arange(5)}).stack(xy=['x', 'y']) 

In [9]: ds.isel(xy=ds['x'] < 4).sel(x=5)  # should be KeyError
Out[9]: 
<xarray.DataArray (y: 0)>
array([], dtype=int64)
Coordinates:
  * y        (y) int64 

@fujiisoup
Copy link
Member Author

But this problem of .sel can be simply solved by manually raising a keyerror if the result of get_loc is size 0 array.

@fujiisoup
Copy link
Member Author

I moved MultiIndex.remove_unused_levels() to the inside of unstack. Now it is invoked only when unstack is called.

@@ -0,0 +1,79 @@
import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a local import here instead the function?

I'm a little nervous that some future version of pandas may drop this (private) module, which would then break imports even though this isn't actually used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed.

import pandas as pd
import pandas.core.algorithms as algos


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy the pandas license directly into this file, too? See coding/cftimesindex.py for an example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shoyer shoyer merged commit b5059a5 into pydata:master Dec 24, 2018
dcherian pushed a commit to yohai/xarray that referenced this pull request Jan 2, 2019
* master:
  DEP: drop python 2 support and associated ci mods (pydata#2637)
  TST: silence warnings from bottleneck (pydata#2638)
  revert to dev version
  DOC: fix docstrings and doc build for 0.11.1
  Source encoding always set when opening datasets (pydata#2626)
  Add flake check to travis (pydata#2632)
  Fix dayofweek and dayofyear attributes from dates generated by cftime_range (pydata#2633)
  silence import warning (pydata#2635)
  fill_value in shift (pydata#2470)
  Flake fixed (pydata#2629)
  Allow passing of positional arguments in `apply` for Groupby objects (pydata#2413)
  Fix failure in time encoding for pandas < 0.21.1 (pydata#2630)
  Fix multiindex selection (pydata#2621)
  Close files when CachingFileManager is garbage collected (pydata#2595)
  added some logic to deal with rasterio objects in addition to filepaths (pydata#2589)
  Get 0d slices of ndarrays directly from indexing (pydata#2625)
  FIX Don't raise a deprecation warning for xarray.ufuncs.{angle,iscomplex} (pydata#2615)
  CF: also decode time bounds when available (pydata#2571)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants