Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] datetime shifted when using pyarrow.Table.from_pandas to load a pandas DateFrame containing datetime with timezone #20493

Closed
asfimport opened this issue Nov 10, 2022 · 3 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Nov 10, 2022

Problem:

When using pyarrow.Table.from_pandas to load a pandas DataFrame which contains a timestamp object with timezone information, the created Table object will shift the datetime, while still keeping the timezone information. Please see my scripts.

 

Reproduce scripts:

import pandas as pd
import pyarrow
ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
df = pd.DataFrame({"TS": [ts]})
table = pyarrow.Table.from_pandas(df)

print(df)
"""
                         TS
0 2022-10-21 22:46:17-07:00
"""

print(table)
"""
pyarrow.Table
TS: timestamp[ns, tz=America/Los_Angeles]
----
TS: [[2022-10-22 05:46:17.000000000]]""" 

Expected results:

The table should not shift the datetime when timezone information is provided.

Environment: MacOS M1, Python 3.8.13
Reporter: Adam Ling

Related issues:

Note: This issue was originally created as ARROW-18298. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
I thought initially it was just how it was presented, as going back to pandas in this example from the table gives the "correct" representation of the value:

import pandas as pd
import pyarrow
ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
df = pd.DataFrame(\{"TS": [ts]})
table = pyarrow.Table.from_pandas(df)

print(df)
#                          TS
# 0 2022-10-21 22:46:17-07:00

print(table.to_pandas())
#                          TS
# 0 2022-10-21 22:46:17-07:00

However, placing mixed timezones makes the behavior more apparent in that it is coercing to the first timezone.

ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
df = pd.DataFrame({"TS": [ts, pd.Timestamp("2022-10-21 22:46:17", tz="UTC")]})
table = pyarrow.Table.from_pandas(df)

print(df)
#                           TS
# 0  2022-10-21 22:46:17-07:00
# 1  2022-10-21 22:46:17+00:00

print(table)
# pyarrow.Table
# TS: timestamp[us, tz=America/Los_Angeles]
# ----
# TS: [[2022-10-22 05:46:17.000000,2022-10-21 22:46:17.000000]]

print(table.to_pandas())
#                          TS
# 0 2022-10-21 22:46:17-07:00
# 1 2022-10-21 15:46:17-07:00

I believe TimestampArray needs to store everything in the array similarly, and that's why it's doing this. I'm not sure what the right solution here is at the moment. In some way it seems like it's doing us a favor by aligning the values to the same timezone, as the first mixing of timezones gives an object dtype for that column, while after doing the roundtrip, it (the pandas Series) gets the arguably better datetime64[ns, America/Los_Angeles] dtype.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

I thought initially it was just how it was presented, as going back to pandas in this example from the table gives the "correct" representation of the value:

Yes, this is in this case the cause of the confusion. The dates are not "wrong" after conversion to arrow, they are just confusingly printed in UTC without any indication of this. We have ARROW-14567 to track this issue.

However, placing mixed timezones makes the behavior more apparent in that it is coercing to the first timezone.

That's a separate issue (and something that doesn't happen that often, for example also pandas requires a single timezone for a column, if you have a datetime64 dtype). But indeed, Arrow's timestamp type requires a single timezone, and thus when encountering multiple ones, we currently coerce to the first one. I think it would be better to coerce to UTC instead (-> ARROW-5912).
There is some discussion about the use case of actually having multiple timezones in a single array at ARROW-16540

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
Thanks for the report! :) Closing, as it is covered by to the two other referenced issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant