Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: hash of Timestamp on fold=1 create a Segfault #33931

Closed
3 tasks done
hasB4K opened this issue May 1, 2020 · 5 comments · Fixed by #44282
Closed
3 tasks done

BUG: hash of Timestamp on fold=1 create a Segfault #33931

hasB4K opened this issue May 1, 2020 · 5 comments · Fixed by #44282
Labels
Bug Segfault Non-Recoverable Error Timezones Timezone data dtype
Milestone

Comments

@hasB4K
Copy link
Member

hasB4K commented May 1, 2020

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
america_chicago = "dateutil//usr/share/zoneinfo/America/Chicago"
transition_1 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, tz=america_chicago)
transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz=america_chicago)

print(transition_1, transition_2)
print(hash(transition_1))
print(hash(transition_2))
$ python3 -q -X faulthandler hash_bug.py
2013-11-03 01:00:00-05:00 2013-11-03 01:00:00-06:00
780959649129526403
Fatal Python error: Segmentation fault

Current thread 0x00007f908caf1740 (most recent call first):
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 1814 in _datetime_to_timestamp
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 717 in _find_last_transition
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 809 in _resolve_ambiguous_time
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 739 in _find_ttinfo
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 828 in utcoffset
  File "t.py", line 8 in <module>
[1]    2761996 segmentation fault (core dumped)  python3 -q -X faulthandler t.py

Problem description

It should return a correct hash value, and it should not Segfault.
This create issue when using a manipulating a Timestamp with a dictionary or a set.

Expected Output

I would have expected the same behavior than datetime in Python:

import datetime as dt
from dateutil.tz import gettz

america_chicago = gettz("America/Chicago")
transition_1 = dt.datetime(year=2013, month=11, day=3, hour=1, minute=0, tzinfo=america_chicago)
transition_2 = dt.datetime(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tzinfo=america_chicago)

print(transition_1, transition_2)
print(hash(transition_1))
print(hash(transition_2))
2013-11-03 01:00:00-05:00 2013-11-03 01:00:00-06:00
780959649129526403
780959649129526403

So it seems that this bug is coming from pandas.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 1c88e6aff94cc9183909b7c110f554df42509073
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.5.13-arch2-1
Version          : #1 SMP PREEMPT Mon, 30 Mar 2020 20:42:41 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+1446.g1c88e6aff
numpy            : 1.18.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.3
Cython           : 0.29.16
pytest           : 5.4.1
hypothesis       : None
sphinx           : 3.0.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.16
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@hasB4K hasB4K added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2020
@mroeschke
Copy link
Member

Seems specific to dateutil timezones

In [1]: transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz="America/Chicago")

In [2]: hash(transition_2)
Out[2]: -188403566336196362

# UTC okay though
In [1]: transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz="dateutil/UTC")

In [2]: hash(transition_2)
Out[2]: -174076368429567082

In [3]: transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz="dateutil/America/Chi
   ...: cago")

In [4]: hash(transition_2)
Segmentation fault: 11

@mroeschke mroeschke added Segfault Non-Recoverable Error Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2020
@hasB4K
Copy link
Member Author

hasB4K commented May 2, 2020

Seems specific to dateutil timezones

From the doc of #31563:
Support is limited to dateutil timezones as pytz doesn’t support fold.
Could this be related?

Also I tried to figured it out the reason of this bug, it seems that this initialization create an undefined behavior (value, freq, etc. becomes 0/NULL later on...):

dts.sec, dts.us, tz, fold=fold)

I tried to have the support of the fold argument (without using the keyword argument init) in adding it as a field to _Timestamp here:

int64_t value, nanosecond

Anyway, it kind of worked at this end, and the segfault was no longer present, but I had weird hash values (always the same one for different timestamps). I am not sure of the origin of the bug, and I really don't really have the time to mess around too much with this. But hopefully, I thought that could help 🤷‍♂️.

Also maybe @AlexKirko could have a look on that? He did a great job on #31563, and he might have more insights on what's going on here.

hasB4K added a commit to hasB4K/pandas that referenced this issue May 2, 2020
@hasB4K
Copy link
Member Author

hasB4K commented May 2, 2020

I commented in this commit the change that I did: hasB4K@b7200a2 - it fixes the segfault (I have not added a proper test in this commit though), and it seems that the hash values are correct after all. The only thing is when fold=1, the hash value is different from the hash value of dt.datetime when fold=1.

Like I said, I don't think I will have the time to dig more on this issue anytime soon, so I'm not planning to create a PR for now, but maybe my debugging may help. 🤷‍♂️

@dlopuch
Copy link

dlopuch commented Apr 27, 2021

Found another way of hitting this (might be useful as a test case perhaps?): happens on the Fall DST boundary (not the Spring) when comparing DateTimeIndex's with mixed timezone sources, one of which comes from dateutil

python 3.9.4, pandas 1.2.4, dateutil 2.8.1:

# pandas_segfault.py:

import pandas as pd
import dateutil

DATEUTIL_US_PAC = dateutil.tz.gettz('US/Pacific')

# df_1 uses pandas timezone string resolution:
df_1 = pd.DataFrame(
    'aaa',
    columns=['A'],

    # Spring US/Pacific DST: Works fine
    # index=pd.date_range(start='2021-03-14', end='2021-03-15', tz='US/Pacific', freq='H'),

    # Fall US/Pacific DST: SEGFAULT!!!
    index=pd.date_range(start='2020-11-01', end='2020-11-02', tz='US/Pacific', freq='H'),
)

# df_2 uses dateutils timezone objects
df_2 = pd.DataFrame(
    'bbb', 
    columns=['B'],

    # Spring US/Pacific DST: Works fine
    # index=pd.date_range(start='2021-03-14', end='2021-03-15', tz=DATEUTIL_US_PAC, freq='H'),

    # Fall US/Pacific DST: SEGFAULT!!!
    index=pd.date_range(start='2020-11-01', end='2020-11-02', tz=DATEUTIL_US_PAC, freq='H'),
)

# Here we do an operation that compares the two mixed-tz DateTimeIndexes
print(pd.concat([df_1, df_2], axis=1))

If both df's use DATEUTIL_US_PAC or both 'US/Pacific', it works fine. Otherwise, we get the segfault (I'm assuming this hashing bug happens when there's need for some datetime normalization):

$ python3 -q -X faulthandler pandas_segfault.py
Fatal Python error: Segmentation fault

Current thread 0x0000000114b92e00 (most recent call first):
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 1814 in _datetime_to_timestamp
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 717 in _find_last_transition
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 809 in _resolve_ambiguous_time
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 739 in _find_ttinfo
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 828 in utcoffset
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 1769 in is_unique
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3170 in get_indexer
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3166 in get_indexer
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 516 in get_result
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 298 in concat
  File ".../Documents/developer/XXX/pandas_segfault.py", line 31 in <module>
[1]    26945 segmentation fault  python3 -q -X faulthandler pandas_segfault.py

@dlopuch
Copy link

dlopuch commented Apr 28, 2021

fwiw, pandas 1.0.5 doesn't segfault, pandas 1.1.5 does.

(segfault on my DateTimeIndex-mixed-tz-comparison use-case, not the initial fold=1 script here... the fold kwarg isn't supported in 1.0.x)

Workaround for me is to downgrade to the 1.0.x branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Segfault Non-Recoverable Error Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants