BUG: hash of Timestamp on fold=1 create a Segfault #33931

hasB4K · 2020-05-01T20:33:51Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
america_chicago = "dateutil//usr/share/zoneinfo/America/Chicago"
transition_1 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, tz=america_chicago)
transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz=america_chicago)

print(transition_1, transition_2)
print(hash(transition_1))
print(hash(transition_2))

$ python3 -q -X faulthandler hash_bug.py

2013-11-03 01:00:00-05:00 2013-11-03 01:00:00-06:00
780959649129526403
Fatal Python error: Segmentation fault

Current thread 0x00007f908caf1740 (most recent call first):
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 1814 in _datetime_to_timestamp
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 717 in _find_last_transition
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 809 in _resolve_ambiguous_time
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 739 in _find_ttinfo
  File "/usr/lib/python3.8/site-packages/dateutil/tz/tz.py", line 828 in utcoffset
  File "t.py", line 8 in <module>
[1]    2761996 segmentation fault (core dumped)  python3 -q -X faulthandler t.py

Problem description

It should return a correct hash value, and it should not Segfault.
This create issue when using a manipulating a Timestamp with a dictionary or a set.

Expected Output

I would have expected the same behavior than datetime in Python:

import datetime as dt
from dateutil.tz import gettz

america_chicago = gettz("America/Chicago")
transition_1 = dt.datetime(year=2013, month=11, day=3, hour=1, minute=0, tzinfo=america_chicago)
transition_2 = dt.datetime(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tzinfo=america_chicago)

print(transition_1, transition_2)
print(hash(transition_1))
print(hash(transition_2))

2013-11-03 01:00:00-05:00 2013-11-03 01:00:00-06:00
780959649129526403
780959649129526403

So it seems that this bug is coming from pandas.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : 1c88e6aff94cc9183909b7c110f554df42509073
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.5.13-arch2-1
Version          : #1 SMP PREEMPT Mon, 30 Mar 2020 20:42:41 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+1446.g1c88e6aff
numpy            : 1.18.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.3
Cython           : 0.29.16
pytest           : 5.4.1
hypothesis       : None
sphinx           : 3.0.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.16
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-05-02T03:09:14Z

Seems specific to dateutil timezones

In [1]: transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz="America/Chicago")

In [2]: hash(transition_2)
Out[2]: -188403566336196362

# UTC okay though
In [1]: transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz="dateutil/UTC")

In [2]: hash(transition_2)
Out[2]: -174076368429567082

In [3]: transition_2 = pd.Timestamp(year=2013, month=11, day=3, hour=1, minute=0, fold=1, tz="dateutil/America/Chi
   ...: cago")

In [4]: hash(transition_2)
Segmentation fault: 11

hasB4K · 2020-05-02T03:45:29Z

Seems specific to dateutil timezones

From the doc of #31563:
Support is limited to dateutil timezones as pytz doesn’t support fold.
Could this be related?

Also I tried to figured it out the reason of this bug, it seems that this initialization create an undefined behavior (value, freq, etc. becomes 0/NULL later on...):

pandas/pandas/_libs/tslibs/timestamps.pyx

Line 48 in 911e19b

dts.sec, dts.us, tz, fold=fold)

I tried to have the support of the fold argument (without using the keyword argument init) in adding it as a field to _Timestamp here:

pandas/pandas/_libs/tslibs/c_timestamp.pxd

Line 7 in 911e19b

int64_t value, nanosecond

Anyway, it kind of worked at this end, and the segfault was no longer present, but I had weird hash values (always the same one for different timestamps). I am not sure of the origin of the bug, and I really don't really have the time to mess around too much with this. But hopefully, I thought that could help 🤷‍♂️.

Also maybe @AlexKirko could have a look on that? He did a great job on #31563, and he might have more insights on what's going on here.

hasB4K · 2020-05-02T04:06:52Z

I commented in this commit the change that I did: hasB4K@b7200a2 - it fixes the segfault (I have not added a proper test in this commit though), and it seems that the hash values are correct after all. The only thing is when fold=1, the hash value is different from the hash value of dt.datetime when fold=1.

Like I said, I don't think I will have the time to dig more on this issue anytime soon, so I'm not planning to create a PR for now, but maybe my debugging may help. 🤷‍♂️

dlopuch · 2021-04-27T23:26:27Z

Found another way of hitting this (might be useful as a test case perhaps?): happens on the Fall DST boundary (not the Spring) when comparing DateTimeIndex's with mixed timezone sources, one of which comes from dateutil

python 3.9.4, pandas 1.2.4, dateutil 2.8.1:

# pandas_segfault.py:

import pandas as pd
import dateutil

DATEUTIL_US_PAC = dateutil.tz.gettz('US/Pacific')

# df_1 uses pandas timezone string resolution:
df_1 = pd.DataFrame(
    'aaa',
    columns=['A'],

    # Spring US/Pacific DST: Works fine
    # index=pd.date_range(start='2021-03-14', end='2021-03-15', tz='US/Pacific', freq='H'),

    # Fall US/Pacific DST: SEGFAULT!!!
    index=pd.date_range(start='2020-11-01', end='2020-11-02', tz='US/Pacific', freq='H'),
)

# df_2 uses dateutils timezone objects
df_2 = pd.DataFrame(
    'bbb', 
    columns=['B'],

    # Spring US/Pacific DST: Works fine
    # index=pd.date_range(start='2021-03-14', end='2021-03-15', tz=DATEUTIL_US_PAC, freq='H'),

    # Fall US/Pacific DST: SEGFAULT!!!
    index=pd.date_range(start='2020-11-01', end='2020-11-02', tz=DATEUTIL_US_PAC, freq='H'),
)

# Here we do an operation that compares the two mixed-tz DateTimeIndexes
print(pd.concat([df_1, df_2], axis=1))

If both df's use DATEUTIL_US_PAC or both 'US/Pacific', it works fine. Otherwise, we get the segfault (I'm assuming this hashing bug happens when there's need for some datetime normalization):

$ python3 -q -X faulthandler pandas_segfault.py
Fatal Python error: Segmentation fault

Current thread 0x0000000114b92e00 (most recent call first):
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 1814 in _datetime_to_timestamp
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 717 in _find_last_transition
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 809 in _resolve_ambiguous_time
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 739 in _find_ttinfo
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/dateutil/tz/tz.py", line 828 in utcoffset
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 1769 in is_unique
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3170 in get_indexer
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3166 in get_indexer
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 516 in get_result
  File ".../.local/share/virtualenvs/XXX-D1SziJ_u/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 298 in concat
  File ".../Documents/developer/XXX/pandas_segfault.py", line 31 in <module>
[1]    26945 segmentation fault  python3 -q -X faulthandler pandas_segfault.py

dlopuch · 2021-04-28T17:29:28Z

fwiw, pandas 1.0.5 doesn't segfault, pandas 1.1.5 does.

(segfault on my DateTimeIndex-mixed-tz-comparison use-case, not the initial fold=1 script here... the fold kwarg isn't supported in 1.0.x)

Workaround for me is to downgrade to the 1.0.x branch.

hasB4K added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2020

mroeschke added Segfault Non-Recoverable Error Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2020

hasB4K added a commit to hasB4K/pandas that referenced this issue May 2, 2020

BUG: fix segfault of fold (pandas-dev#33931) - WIP

b7200a2

mroeschke mentioned this issue Apr 7, 2021

BUG: Reindexing a DatetimeIndex during Daylight Saving Transition Causes Segmentation Fault with dateutil tz #40817

Closed

3 tasks

mzeitlin11 mentioned this issue Jul 1, 2021

BUG: segfault when using datetime.datetime.replace on Timestamp #42305

Closed

3 tasks

jonwiggins mentioned this issue Nov 2, 2021

BUG: Add fix for hashing timestamps with folds #44282

Merged

4 tasks

jreback added this to the 1.4 milestone Dec 31, 2021

jreback closed this as completed in #44282 Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: hash of Timestamp on fold=1 create a Segfault #33931

BUG: hash of Timestamp on fold=1 create a Segfault #33931

hasB4K commented May 1, 2020 •

edited

Loading

mroeschke commented May 2, 2020

hasB4K commented May 2, 2020 •

edited

Loading

hasB4K commented May 2, 2020

dlopuch commented Apr 27, 2021 •

edited

Loading

dlopuch commented Apr 28, 2021 •

edited

Loading

BUG: hash of Timestamp on fold=1 create a Segfault #33931

BUG: hash of Timestamp on fold=1 create a Segfault #33931

Comments

hasB4K commented May 1, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

mroeschke commented May 2, 2020

hasB4K commented May 2, 2020 • edited Loading

hasB4K commented May 2, 2020

dlopuch commented Apr 27, 2021 • edited Loading

dlopuch commented Apr 28, 2021 • edited Loading

hasB4K commented May 1, 2020 •

edited

Loading

Output of `pd.show_versions()`

hasB4K commented May 2, 2020 •

edited

Loading

dlopuch commented Apr 27, 2021 •

edited

Loading

dlopuch commented Apr 28, 2021 •

edited

Loading