Fix underlying type for DATE64 #35494

sam-lunt · 2023-05-08T19:46:53Z

When converting a column of type DATE64 to parquet, the underlying type is INT32 instead of INT64. This updates the type to be the correct width.

When converting a column of type `DATE64` to parquet, the underlying type is `INT32` instead of `INT64`

github-actions · 2023-05-08T19:47:13Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

wjones127 · 2023-05-08T19:58:54Z

I don't think this is quite right. There isn't a 1-1 correspondence between Arrow's physical types and Parquet's physical types. Parquet always stores dates as int32: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date. We map to Arrow's Date64 if that's what the original Arrow type was.

sam-lunt · 2023-05-09T00:37:14Z

I see that the unit tests failed, so I'd be inclined to agree that this change isn't quite right, haha.

That said, I do think there is an issue with the current implementation, since date64 doesn't survive a round trip from Arrow to Parquet and back:

import datetime
import sys
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as paq

print(f"{sys.version=}")
print(f"{pa.__version__=}")
print()

schema = pa.schema(
    [
        ("date32", pa.date32()),
        ("date64", pa.date64()),
    ]
)

print("schema:", schema, "", sep="\n")

df = pd.DataFrame(
    {
        "date32": datetime.date(2020, 3, 15),
        "date64": datetime.date(2020, 3, 15),
    },
    index=pd.RangeIndex(1),
)

rb = pa.RecordBatch.from_pandas(df, schema=schema)
print("rb.schema:", rb.schema, "", sep="\n")

w = paq.ParquetWriter("test.parquet", version="2.6", schema=schema)
w.write(rb)
w.close()

f = paq.ParquetFile("test.parquet")
t = f.read()
print("t.schema:", t.schema, sep="\n")

On my machine (running x86_64 Linux), I see the following output:

sys.version='3.11.3 (main, Apr  5 2023, 15:52:25) [GCC 12.2.1 20230201]'
pa.__version__='10.0.1'

schema:
date32: date32[day]
date64: date64[ms]

rb.schema:
date32: date32[day]
date64: date64[ms]
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 412

t.schema:
date32: date32[day]
date64: date32[day]

wjones127 · 2023-05-09T18:00:36Z

That said, I do think there is an issue with the current implementation, since date64 doesn't survive a round trip from Arrow to Parquet and back:

Ah that makes sense. I think this is related to #15032

We should probably improve it so that all types roundtrip.

I'm closing this PR, but if you are interested in making a PR to make more types round trip I would review it.

Fix underlying type for DATE64

6f5cd9e

When converting a column of type `DATE64` to parquet, the underlying type is `INT32` instead of `INT64`

sam-lunt requested a review from wjones127 as a code owner May 8, 2023 19:46

github-actions bot added Component: C++ Component: Parquet awaiting review Awaiting review labels May 8, 2023

wjones127 closed this May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix underlying type for DATE64 #35494

Fix underlying type for DATE64 #35494

sam-lunt commented May 8, 2023

github-actions bot commented May 8, 2023

wjones127 commented May 8, 2023

sam-lunt commented May 9, 2023

wjones127 commented May 9, 2023

Fix underlying type for DATE64 #35494

Fix underlying type for DATE64 #35494

Conversation

sam-lunt commented May 8, 2023

github-actions bot commented May 8, 2023

wjones127 commented May 8, 2023

sam-lunt commented May 9, 2023

wjones127 commented May 9, 2023