-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462
ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462
Conversation
@@ -107,10 +107,6 @@ def _check_series_roundtrip(s, type_=None, expected_pa_type=None): | |||
assert arr.type == expected_pa_type | |||
|
|||
result = pd.Series(arr.to_pandas(), name=s.name) | |||
if pa.types.is_timestamp(arr.type) and arr.type.tz is not None: | |||
result = (result.dt.tz_localize('utc') | |||
.dt.tz_convert(arr.type.tz)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The series roundtrip with timezone was actually already tested in this file, but this fixup of the result was masking the issue. So before, it was expected that Array.to_pandas would loose the timezone (maybe because this method could also return numpy arrays in some cases). But I don't see a reason to not keep it (certainly now it always returns Series, and now Column is gone, which retained the timezone)
Codecov Report
@@ Coverage Diff @@
## master #5462 +/- ##
===========================================
- Coverage 88.62% 66.21% -22.42%
===========================================
Files 958 505 -453
Lines 127421 69783 -57638
Branches 1495 0 -1495
===========================================
- Hits 112926 46206 -66720
- Misses 14130 23577 +9447
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks @jorisvandenbossche!
LGTM, thanks for the quick fix! |
It looks like this didn't completely solve the issue. When there is a >>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> table = pa.Table.from_arrays([a], ['a'])
>>> c = table.column(0)
>>> c
<pyarrow.lib.ChunkedArray object at 0x7f1eecf51318>
[
[
1970-01-01 00:00:00.000001
]
]
>>> c.to_pandas()
0 1970-01-01 00:00:00.000001
Name: a, dtype: datetime64[ns] My fault, I should have also mentioned |
@jorisvandenbossche is it just a matter of applying the same fix to the ChunkedArray method? |
Ah, yes, didn't think of ChunkedArray. The fix is probably the same yes. Will have a look. |
Applied the fix for ChunkedArray in #5471 |
Follow-up on #5462 to also apply this fix for ChunkedArray. Closes #5471 from jorisvandenbossche/ARROW-6652-chunked-array-timezone and squashes the following commits: 89d0044 <Joris Van den Bossche> add helper function 5122451 <Joris Van den Bossche> ARROW-6652: Fix ChunkedArray.to_pandas to retain timezone Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
https://issues.apache.org/jira/browse/ARROW-6652