Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support rolling over one level of a MultiIndex #34642

Open
1 of 2 tasks
miccoli opened this issue Jun 8, 2020 · 13 comments
Open
1 of 2 tasks

ENH: Support rolling over one level of a MultiIndex #34642

miccoli opened this issue Jun 8, 2020 · 13 comments
Labels
Enhancement Window rolling, ewma, expanding

Comments

@miccoli
Copy link
Contributor

miccoli commented Jun 8, 2020

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


I had a hard time understanding how df.rolling works when df is indexed by a MultiIndex

This is an example data frame:

import pandas as pd
idx = pd.MultiIndex.from_product(
    [pd.date_range("2020-01-01", "2020-1-10"), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))

print(df)

which outputs

                c1
date       obs    
2020-01-01 a     0
           b     1
2020-01-02 a     2
           b     3
2020-01-03 a     4
           b     5
2020-01-04 a     6
           b     7
2020-01-05 a     8
           b     9
2020-01-06 a    10
           b    11
2020-01-07 a    12
           b    13
2020-01-08 a    14
           b    15
2020-01-09 a    16
           b    17
2020-01-10 a    18
           b    19

Now I want to apply a rolling window on the date level, keeping the obs level separate.

I tried with no success obvious and simple (least surprise) commands like

  • df.rolling("7d", index="date") or
  • df.rolling("7d", on="date")

but finally the desired result is obtained by

df_r = df.groupby(by="obs", group_keys=False).rolling(
    "7d", on=df.index.levels[0]
).mean().sort_index()

print(df_r)

which gives me the correct result:

                  c1
date       obs      
2020-01-01 a     0.0
           b     1.0
2020-01-02 a     1.0
           b     2.0
2020-01-03 a     2.0
           b     3.0
2020-01-04 a     3.0
           b     4.0
2020-01-05 a     4.0
           b     5.0
2020-01-06 a     5.0
           b     6.0
2020-01-07 a     6.0
           b     7.0
2020-01-08 a     8.0
           b     9.0
2020-01-09 a    10.0
           b    11.0
2020-01-10 a    12.0
           b    13.0

It seams to me that this should be a quite common situation, so I was wondering if there is a simpler way to obtain the same results. By the way my solution is not very robust, because there are hidden assumptions on how the objects returned by groupby are indexed, which do not necessarily hold for a generic data frame.

Moreover the doc of the on parameter in rolling was almost incomprehensible to me: I'm still wondering if my usage rolling( "7d", on=df.index.levels[0]) is the intended one or not.

@miccoli miccoli added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 8, 2020
@daskol
Copy link

daskol commented Jun 8, 2020

Probably, there is a related issue #34617. By the ways, if data frame is indexed with integer index then one can use level keyword of .rolling(). This example is not consistent and just shows API features.

import pandas as pd
idx = pd.MultiIndex.from_product(
    [range(7), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))

print(df)

df_r = df.groupby(by="obs", group_keys=False).rolling(
    7, level=1
).mean().sort_index()

print(df_r)

@miccoli
Copy link
Contributor Author

miccoli commented Jun 8, 2020

In @daskol example the level keyword seems a noop, as any other arg passed to rolling.

from pandas.testing import assert_frame_equal

assert_frame_equal(
    df.groupby(by="obs", group_keys=False).rolling(7).mean(),
    df.groupby(by="obs", group_keys=False).rolling(7, level=1).mean(),
)
assert_frame_equal(
    df.groupby(by="obs", group_keys=False).rolling(7).mean(),
    df.groupby(by="obs", group_keys=False).rolling(7, foo=8, bar='8').mean(),
)

I'm still very confused.

@jbrockmendel jbrockmendel added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@miccoli
Copy link
Contributor Author

miccoli commented Sep 7, 2020

A brief update. After updating pandas to v1.1.1 the above walk around fails with SEGFAULT!

>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0])
RollingGroupby [window=604800000000000,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', name='date', freq=None)]

but

>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()
Segmentation fault: 11

It seems to me that this issue should be labeled as BUG

pandas version info
INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.8.2.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.8.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : 1.0.1
pytables         : None
pyxlsb           : None
s3fs             : 0.5.0
scipy            : 1.5.2
sqlalchemy       : None
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.16.0
xlrd             : 1.2.0
xlwt             : None
numba            : 0.51.1

@miccoli
Copy link
Contributor Author

miccoli commented Sep 7, 2020

SEGFAULT possibly linked to #36018

@mroeschke mroeschke changed the title DOC: Is it possible to apply 'rolling' to a level in a multindex? ENH: Support rolling over one level of a MultiIndex Oct 5, 2020
@drlakecoder
Copy link

hi, I wonder if there were any update on this issue, in my situation, I am also unable to do rolling with non-integer windows when there is multi-index in the dataframe. With the simple example that was used in this thread, I would get different error like the following:

df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0])
Out[21]:
RollingGroupby [window=7d,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
'2020-01-09', '2020-01-10'],
dtype='datetime64[ns]', name='date', freq='D'),method=single]

df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()
Traceback (most recent call last):

File "C:\Users\c_yy\AppData\Local\Temp/ipykernel_15476/3165508401.py", line 1, in
df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1855, in mean
return super().mean(*args, engine=engine, engine_kwargs=engine_kwargs, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1309, in mean
return self._apply(window_func, name="mean", **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 594, in _apply
result = super()._apply(

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 545, in _apply
return self._apply_blockwise(homogeneous_func, name)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 441, in _apply_blockwise
new_mgr = mgr.apply(hfunc, ignore_failures=True)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 325, in apply
applied = b.apply(f, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 381, in apply
result = func(self.values, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 431, in hfunc
res_values = homogeneous_func(values)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 535, in homogeneous_func
result = np.apply_along_axis(calc, self.axis, values)

File "<array_function internals>", line 5, in apply_along_axis

File "C:\Users\c_yy\anaconda3\lib\site-packages\numpy\lib\shape_base.py", line 379, in apply_along_axis
res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 521, in calc
start, end = window_indexer.get_window_bounds(

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\indexers.py", line 333, in get_window_bounds
index_array = self.index_array.take(ensure_platform_int(indices))

IndexError: index 10 is out of bounds for axis 0 with size 10

When using my own real data set (see below), I get different error:

rxx.loc['2015-12-01':'2017-12-29',:]
TotRet SpecRet ... volumedays IsTrading
date order_book_id ...
2015-12-01 000155.XSHE 0.014580 0.007206 ... 53.0 1.0
2015-12-02 000155.XSHE -0.004785 0.002132 ... 53.0 1.0
2015-12-03 000155.XSHE 0.034894 0.006537 ... 53.0 1.0
2015-12-04 000155.XSHE 0.050001 0.021692 ... 53.0 1.0
2015-12-07 000155.XSHE 0.049832 0.026999 ... 53.0 1.0
... ... ... ... ...
2017-12-25 000155.XSHE -0.049707 -0.053313 ... 5.0 1.0
2017-12-26 000155.XSHE -0.023771 -0.013894 ... 6.0 1.0
2017-12-27 000155.XSHE 0.006497 0.014989 ... 7.0 1.0
2017-12-28 000155.XSHE -0.030648 -0.026356 ... 8.0 1.0
2017-12-29 000155.XSHE 0.003335 -0.005979 ... 9.0 1.0

[104 rows x 112 columns]

rxx.groupby(level='order_book_id',group_keys=False).rolling('63d',on=rxx.index.levels[0]).mean()
***** ValueError: Length of values (567) does not match length of index (150)**

pd.show_versions()

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_United States.1252

pandas : 1.3.4
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.2
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.08.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@jpodivin
Copy link

I have to concur. This seems like a useful thing to have. Or if there is an alternative way to handle this use case it could be described in docs.

@d-otto
Copy link

d-otto commented Dec 6, 2022

I've run into this issue as well.

The documentation for df.rolling() states on= should be: "a column label or Index level on which to calculate the rolling window". My expectation was that I could pass the name of a multiindex level and .rolling() would group rows by unique index level values. This all might be better handled by .groupby(), but I'd love to see more integrated multiindex handling where convenient.

@pollackscience
Copy link

I have a similar issue, which again involves performing rolling actions on a dataframe with duplicate datetimes. I've gotten a complicated workaround using groupby and some other hacks, but a more native support would be greatly appreciated.

@Kalaweksh
Copy link

The groupby transform method has this exact functionality. The docs explain it pretty well. Although I agree that the windowing API should definitely natively have this functionality.

@claudia-buss
Copy link

@meta-ks
Copy link

meta-ks commented Nov 9, 2023

The doc still mentions this as funtctionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

@jdkworld
Copy link

jdkworld commented Jan 4, 2024

The doc still mentions this as funtctionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

This is still an issue

@zdwhite
Copy link

zdwhite commented May 9, 2024

The doc still mentions this as functionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

This is still an issue however there is a hacky(?) work around.

Consider the following multi indexed series

 test.info()
<class 'pandas.core.series.Series'>
MultiIndex: 5442161 entries, ('1A1', Timestamp('2024-04-07 17:00:00.431000+0000', tz='UTC')) to ('3B9', Timestamp('2024-04-> 28 16:59:58.468000+0000', tz='UTC'))
Series name: supplytemp
Non-Null Count    Dtype  
--------------    -----  
5442159 non-null  float64
dtypes: float64(1)

test.index.names
FrozenList(['site', 'TimeStamp'])

Test['rolling_avg_supply_temp'] = Test.groupby(level='site').rolling('1h',on=Test.index.get_level_values(level='TimeStamp'))['supplytemp'].mean()

While this is a tad more verbose then pandas just acknowledging the closest index this is a workable replacement. The descriptive statistics for the series line up with this method and the non indexed method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests