Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Segmentation-fault during a "custom" rolling operation #44470

Closed
2 of 3 tasks
erezinman opened this issue Nov 15, 2021 · 8 comments · Fixed by #44497
Closed
2 of 3 tasks

BUG: Segmentation-fault during a "custom" rolling operation #44470

erezinman opened this issue Nov 15, 2021 · 8 comments · Fixed by #44497
Labels
Bug Error Reporting Incorrect or improved errors from pandas Segfault Non-Recoverable Error Window rolling, ewma, expanding
Milestone

Comments

@erezinman
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

from pandas.core.window.indexers import BaseIndexer
from typing import Union, Tuple, Optional

import numpy as np
import pandas as pd


class RollingFixedOffsetVariableWindowIndexer(BaseIndexer):

    def get_window_bounds(self, num_values: int = 0, min_periods: Optional[int] = None, center: Optional[bool] = None,
                          closed: Optional[str] = None) -> Tuple[np.ndarray, np.ndarray]:
        # Every hour
        start = np.arange(0, num_values, 60 * 60)
        # Every hour + day
        end = start + 60 * 60 * 24

        return start, end


ser = pd.Series(np.arange(100000), index=pd.date_range(start='2020', freq='1s', periods=100000))
idxr = RollingFixedOffsetVariableWindowIndexer(window_size=60 * 60 * 24)
out = ser.rolling(idxr, min_periods=10).mean()

Issue Description

(Maybe related to #43267 ? Couldn't tell.)

I'm trying to create a custom BaseIndexer for a .rolling operation. The extraction of the window is quick and efficient, yet when I attempt to perform various operations on it I get the seg-fault error:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I have created an MWE for this behavior using a fixed-indexer (that carries the same name). The original indexer was supposed to produce a rolling-operation with overlaps, so the above indexer is similar in the sense that it creates a 1-day windows 1-hour apart for a one-second index.

Expected Behavior

Hmmm... Not to have a seg-fault error?

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-77-generic
Version : #86~18.04.1-Ubuntu SMP Fri Jun 18 01:23:22 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.18.5
pytz : 2018.3
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.4.0
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : 3.5.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : None
pandas_datareader: None
bs4 : 4.6.0
bottleneck : None
fsspec : 0.8.7
fastparquet : None
gcsfs : None
matplotlib : 3.1.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
None

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

@erezinman erezinman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 15, 2021
@erezinman erezinman changed the title BUG: Segmentation-fault during "custom" rolling opertation BUG: Segmentation-fault during "custom" rolling operation Nov 15, 2021
@erezinman erezinman changed the title BUG: Segmentation-fault during "custom" rolling operation BUG: Segmentation-fault during a "custom" rolling operation Nov 15, 2021
@CloseChoice CloseChoice added Segfault Non-Recoverable Error Window rolling, ewma, expanding labels Nov 16, 2021
@mroeschke
Copy link
Member

A stipulation of rolling operations is that the result is the same shape as the input; therefore, the arrays in get_window_bounds needs to return start and end arrays that are also the same shape as the input.

Instead of segfaulting, pandas should return an error checking that these lengths are equal.

@mroeschke mroeschke added Enhancement Error Reporting Incorrect or improved errors from pandas Bug and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Nov 17, 2021
@mroeschke mroeschke added this to the 1.4 milestone Nov 17, 2021
@erezinman
Copy link
Author

erezinman commented Nov 18, 2021

@mroeschke
Hi,
first, please document this issue. Maybe in the BaseIndexer.get_window_bounds you could add this assumption to the doc-string?

Also, when a rolling operation that requires equal window-sizes, you can check that it does satisfy that, and raise an error only if needed. Maybe .sum requires it, but .apply surely doesn't.

Moreover - why is this a requirement? I mean .groupby, for example, doesn't have this requirement, so why isn't the underlying mechanism more similar?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2021

@erezinman there is already #44497

@erezinman
Copy link
Author

Yes, thank you. I see it takes care of my first statement.

@mroeschke
Copy link
Member

Moreover - why is this a requirement? I mean .groupby, for example, doesn't have this requirement, so why isn't the underlying mechanism more similar?

This is to achieve API output consistency with native rolling operations which also returns a result is the same shape as the input. https://pandas.pydata.org/docs/user_guide/window.html

If this API breaking change was made, it's also unclear what the corresponding index (or column with axis=1) label should be associated with a window result that is different than the shape of the input.

@erezinman
Copy link
Author

One way is to solve this is to allow returning the labels (much like in .groupby), but that might be too big a change (even though it can be made backward-compatible so it's mostly a matter of an internal change that you can test).

Another possibility is to just have the first/center item in the window (depending on the center argument) be the label. This would just result in non-unique indices which could occur only in future flows. That shouldn't be much of a change.

What do you think?

@mroeschke
Copy link
Member

The second sound reasonable, but currently the code needs significant refactoring to support outputs that are not the same shape as the input.

@mroeschke
Copy link
Member

As a work around, you can always iterate over the window object and use concat.

e.g.
result = pd.concat([window.mean() for window in df.rolling(MyIndexer)])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Segfault Non-Recoverable Error Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants