ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462

jorisvandenbossche · 2019-09-21T18:01:46Z

https://issues.apache.org/jira/browse/ARROW-6652

jorisvandenbossche · 2019-09-21T18:04:38Z

python/pyarrow/tests/test_pandas.py

@@ -107,10 +107,6 @@ def _check_series_roundtrip(s, type_=None, expected_pa_type=None):
        assert arr.type == expected_pa_type

    result = pd.Series(arr.to_pandas(), name=s.name)
-    if pa.types.is_timestamp(arr.type) and arr.type.tz is not None:
-        result = (result.dt.tz_localize('utc')
-                  .dt.tz_convert(arr.type.tz))


The series roundtrip with timezone was actually already tested in this file, but this fixup of the result was masking the issue. So before, it was expected that Array.to_pandas would loose the timezone (maybe because this method could also return numpy arrays in some cases). But I don't see a reason to not keep it (certainly now it always returns Series, and now Column is gone, which retained the timezone)

codecov-io · 2019-09-21T18:36:34Z

Codecov Report

Merging #5462 into master will decrease coverage by 22.41%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master    #5462       +/-   ##
===========================================
- Coverage   88.62%   66.21%   -22.42%     
===========================================
  Files         958      505      -453     
  Lines      127421    69783    -57638     
  Branches     1495        0     -1495     
===========================================
- Hits       112926    46206    -66720     
- Misses      14130    23577     +9447     
+ Partials      365        0      -365

Impacted Files	Coverage Δ
python/pyarrow/tests/test_pandas.py	`94.57% <ø> (-0.01%)`	⬇️
python/pyarrow/array.pxi	`79.24% <100%> (+0.29%)`	⬆️
python/pyarrow/tests/test_array.py	`93.54% <100%> (+0.02%)`	⬆️
cpp/src/arrow/util/memory.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/date_utils.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/memory.cc	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/decimal_type_util.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/logical_type.h	`0% <0%> (-100%)`	⬇️
cpp/src/parquet/hasher.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/basic_decimal_scalar.h	`0% <0%> (-100%)`	⬇️
... and 703 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2785d3...0e15f8f. Read the comment docs.

wesm

+1. Thanks @jorisvandenbossche!

BryanCutler · 2019-09-22T04:52:34Z

LGTM, thanks for the quick fix!

BryanCutler · 2019-09-22T07:16:11Z

It looks like this didn't completely solve the issue. When there is a ChunkedArray it still removes the timezone:

>>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> table = pa.Table.from_arrays([a], ['a'])
>>> c = table.column(0)
>>> c
<pyarrow.lib.ChunkedArray object at 0x7f1eecf51318>
[
  [
    1970-01-01 00:00:00.000001
  ]
]
>>> c.to_pandas()
0   1970-01-01 00:00:00.000001
Name: a, dtype: datetime64[ns]

My fault, I should have also mentioned ChunkedArray in the JIRA, but I assumed it was the same code path.

BryanCutler · 2019-09-22T07:17:41Z

@jorisvandenbossche is it just a matter of applying the same fix to the ChunkedArray method?

jorisvandenbossche · 2019-09-23T06:26:35Z

Ah, yes, didn't think of ChunkedArray. The fix is probably the same yes. Will have a look.

jorisvandenbossche · 2019-09-23T06:55:13Z

Applied the fix for ChunkedArray in #5471

Follow-up on #5462 to also apply this fix for ChunkedArray. Closes #5471 from jorisvandenbossche/ARROW-6652-chunked-array-timezone and squashes the following commits: 89d0044 <Joris Van den Bossche> add helper function 5122451 <Joris Van den Bossche> ARROW-6652: Fix ChunkedArray.to_pandas to retain timezone Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Wes McKinney <[email protected]>

ARROW-6652: [Python] Fix Array.to_pandas to retain timezone

0e15f8f

jorisvandenbossche commented Sep 21, 2019

View reviewed changes

wesm approved these changes Sep 22, 2019

View reviewed changes

wesm closed this in b95c9bf Sep 22, 2019

jorisvandenbossche deleted the ARROW-6652-array-to-pandas-timezone branch September 23, 2019 06:25

jorisvandenbossche mentioned this pull request Sep 23, 2019

ARROW-6652: [Python] Fix ChunkedArray.to_pandas to retain timezone #5471

Closed

asfimport mentioned this pull request Sep 24, 2019

[Python] to_pandas conversion removes timezone from type #23003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462

ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462

jorisvandenbossche commented Sep 21, 2019

jorisvandenbossche Sep 21, 2019

codecov-io commented Sep 21, 2019

wesm left a comment

BryanCutler commented Sep 22, 2019

BryanCutler commented Sep 22, 2019

BryanCutler commented Sep 22, 2019

jorisvandenbossche commented Sep 23, 2019

jorisvandenbossche commented Sep 23, 2019

ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462

ARROW-6652: [Python] Fix Array.to_pandas to retain timezone #5462

Conversation

jorisvandenbossche commented Sep 21, 2019

jorisvandenbossche Sep 21, 2019

Choose a reason for hiding this comment

codecov-io commented Sep 21, 2019

Codecov Report

wesm left a comment

Choose a reason for hiding this comment

BryanCutler commented Sep 22, 2019

BryanCutler commented Sep 22, 2019

BryanCutler commented Sep 22, 2019

jorisvandenbossche commented Sep 23, 2019

jorisvandenbossche commented Sep 23, 2019