Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462

Conversation

jorisvandenbossche
Copy link
Member

@@ -107,10 +107,6 @@ def _check_series_roundtrip(s, type_=None, expected_pa_type=None):
assert arr.type == expected_pa_type

result = pd.Series(arr.to_pandas(), name=s.name)
if pa.types.is_timestamp(arr.type) and arr.type.tz is not None:
result = (result.dt.tz_localize('utc')
.dt.tz_convert(arr.type.tz))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The series roundtrip with timezone was actually already tested in this file, but this fixup of the result was masking the issue. So before, it was expected that Array.to_pandas would loose the timezone (maybe because this method could also return numpy arrays in some cases). But I don't see a reason to not keep it (certainly now it always returns Series, and now Column is gone, which retained the timezone)

@codecov-io
Copy link

Codecov Report

Merging #5462 into master will decrease coverage by 22.41%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5462       +/-   ##
===========================================
- Coverage   88.62%   66.21%   -22.42%     
===========================================
  Files         958      505      -453     
  Lines      127421    69783    -57638     
  Branches     1495        0     -1495     
===========================================
- Hits       112926    46206    -66720     
- Misses      14130    23577     +9447     
+ Partials      365        0      -365
Impacted Files Coverage Δ
python/pyarrow/tests/test_pandas.py 94.57% <ø> (-0.01%) ⬇️
python/pyarrow/array.pxi 79.24% <100%> (+0.29%) ⬆️
python/pyarrow/tests/test_array.py 93.54% <100%> (+0.02%) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/logical_type.h 0% <0%> (-100%) ⬇️
cpp/src/parquet/hasher.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/basic_decimal_scalar.h 0% <0%> (-100%) ⬇️
... and 703 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2785d3...0e15f8f. Read the comment docs.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks @jorisvandenbossche!

@wesm wesm closed this in b95c9bf Sep 22, 2019
@BryanCutler
Copy link
Member

LGTM, thanks for the quick fix!

@BryanCutler
Copy link
Member

It looks like this didn't completely solve the issue. When there is a ChunkedArray it still removes the timezone:

>>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> table = pa.Table.from_arrays([a], ['a'])
>>> c = table.column(0)
>>> c
<pyarrow.lib.ChunkedArray object at 0x7f1eecf51318>
[
  [
    1970-01-01 00:00:00.000001
  ]
]
>>> c.to_pandas()
0   1970-01-01 00:00:00.000001
Name: a, dtype: datetime64[ns]

My fault, I should have also mentioned ChunkedArray in the JIRA, but I assumed it was the same code path.

@BryanCutler
Copy link
Member

@jorisvandenbossche is it just a matter of applying the same fix to the ChunkedArray method?

@jorisvandenbossche jorisvandenbossche deleted the ARROW-6652-array-to-pandas-timezone branch September 23, 2019 06:25
@jorisvandenbossche
Copy link
Member Author

Ah, yes, didn't think of ChunkedArray. The fix is probably the same yes. Will have a look.

@jorisvandenbossche
Copy link
Member Author

Applied the fix for ChunkedArray in #5471

wesm pushed a commit that referenced this pull request Sep 24, 2019
Follow-up on #5462 to also apply this fix for ChunkedArray.

Closes #5471 from jorisvandenbossche/ARROW-6652-chunked-array-timezone and squashes the following commits:

89d0044 <Joris Van den Bossche> add helper function
5122451 <Joris Van den Bossche> ARROW-6652:  Fix ChunkedArray.to_pandas to retain timezone

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants