Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix underlying type for DATE64 #35494

Closed
wants to merge 1 commit into from

Conversation

sam-lunt
Copy link

@sam-lunt sam-lunt commented May 8, 2023

When converting a column of type DATE64 to parquet, the underlying type is INT32 instead of INT64. This updates the type to be the correct width.

When converting a column of type `DATE64` to parquet, the underlying type is `INT32` instead of `INT64`
@sam-lunt sam-lunt requested a review from wjones127 as a code owner May 8, 2023 19:46
@github-actions
Copy link

github-actions bot commented May 8, 2023

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@wjones127
Copy link
Member

I don't think this is quite right. There isn't a 1-1 correspondence between Arrow's physical types and Parquet's physical types. Parquet always stores dates as int32: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date. We map to Arrow's Date64 if that's what the original Arrow type was.

@sam-lunt
Copy link
Author

sam-lunt commented May 9, 2023

I see that the unit tests failed, so I'd be inclined to agree that this change isn't quite right, haha.

That said, I do think there is an issue with the current implementation, since date64 doesn't survive a round trip from Arrow to Parquet and back:

import datetime
import sys
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as paq

print(f"{sys.version=}")
print(f"{pa.__version__=}")
print()

schema = pa.schema(
    [
        ("date32", pa.date32()),
        ("date64", pa.date64()),
    ]
)

print("schema:", schema, "", sep="\n")

df = pd.DataFrame(
    {
        "date32": datetime.date(2020, 3, 15),
        "date64": datetime.date(2020, 3, 15),
    },
    index=pd.RangeIndex(1),
)

rb = pa.RecordBatch.from_pandas(df, schema=schema)
print("rb.schema:", rb.schema, "", sep="\n")

w = paq.ParquetWriter("test.parquet", version="2.6", schema=schema)
w.write(rb)
w.close()

f = paq.ParquetFile("test.parquet")
t = f.read()
print("t.schema:", t.schema, sep="\n")

On my machine (running x86_64 Linux), I see the following output:

sys.version='3.11.3 (main, Apr  5 2023, 15:52:25) [GCC 12.2.1 20230201]'
pa.__version__='10.0.1'

schema:
date32: date32[day]
date64: date64[ms]

rb.schema:
date32: date32[day]
date64: date64[ms]
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 412

t.schema:
date32: date32[day]
date64: date32[day]

@wjones127
Copy link
Member

That said, I do think there is an issue with the current implementation, since date64 doesn't survive a round trip from Arrow to Parquet and back:

Ah that makes sense. I think this is related to #15032

We should probably improve it so that all types roundtrip.

I'm closing this PR, but if you are interested in making a PR to make more types round trip I would review it.

@wjones127 wjones127 closed this May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants