Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] date64[ms] comes back as date32[day] after roundtrip to Parquet #15032

Open
Thomas-Hirsch opened this issue Dec 19, 2022 · 1 comment
Open

Comments

@Thomas-Hirsch
Copy link

Thomas-Hirsch commented Dec 19, 2022

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to build a tool to confirm that a given parquet file conforms to an expected schema with our metadata format. There appears to be a bug with how pyarrow converts dates, however.

I have a dummy dataset, called test.csv (expand to view)
my_int animal my_email my_datetime my_date
16 NA [email protected] 2013-01-14 15:54:20 1993-09-27
13 cat [email protected] 2006-12-16 01:44:21 1989-04-08
13 dog [email protected] 1972-06-19 11:22:59 2006-10-01
18 NA [email protected] 2009-07-11 00:45:28 1992-07-24
10 fish [email protected] 1970-09-11 22:26:13 2011-11-30
13 cat [email protected] 2018-09-18 16:18:37 1979-01-22
16 dog [email protected] 2002-08-26 10:25:03 1981-03-27
11 dog [email protected] 1989-02-26 16:33:37 2012-06-19
19 chicken [email protected] 1974-07-25 04:41:22 2003-11-06
11 chicken [email protected] 1992-04-03 09:47:30 1972-07-06

I have a user-generated schema that resolves to this:

my_int: int64
animal: string
my_email: string
my_datetime: timestamp[s]
my_date: date64[ms]

However, the last column seems to be read by pyarrow as date32[day], and won't cast otherwise:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.csv.read_csv("test.csv")
table.cast(my_custom_schema)
pq.write_table(table, "test.parquet")
table_arrow_schema = pq.read_schema("test.parquet")
table_arrow_schema

gives

my_int: int64
animal: string
my_email: string
my_datetime: timestamp[ms]
my_date: date32[day]

Component(s)

Parquet, Python

@wjones127
Copy link
Member

pyarrow.parquet doesn't round trip all types right now. Right now, the only other example I know of is is dictionary types. They always come back with int32 indices, regardless of the original index type. See also: https://lists.apache.org/thread/rv29cwf4208jh73s0gyrzpw5l87pf7pb

date64 type only exists for compatibility with systems that use milliseconds to represent dates. That representation doesn't exist in the Parquet format. It's also not a sensible representation of a date, because the logical resolution is a day, so the milliseconds information isn't used.

But it looks like we handle this for nearly every other type, including Large* variants of string, binary, and list, different timestamp resolutions, and unsigned integers. So maybe it's worth fixing these last few types.

Python code to check type roundtripping
import pyarrow as pa
import pyarrow.parquet as pq
from decimal import Decimal

def check_parquet_roundtrip(arr):
    tab = pa.table({"x": arr})
    pq.write_table(tab, "test.parquet")
    schema = pq.read_schema("test.parquet")
    assert schema.field(0).type == arr.type

# These fail
check_parquet_roundtrip(
    pa.array([1, 2, 3], pa.date64())
)

check_parquet_roundtrip(
    pa.array(["a", "b"], pa.dictionary(pa.int8(), pa.string()))
)


# All these work
check_parquet_roundtrip(
    pa.array([1, 2, 3], pa.timestamp('ms'))
)

check_parquet_roundtrip(
    pa.array([1, 2, 3], pa.timestamp('us'))
)

check_parquet_roundtrip(
    pa.array(["a", "b"], pa.dictionary(pa.int32(), pa.string()))
)

check_parquet_roundtrip(
    pa.array([Decimal("10.000")], pa.decimal128(19, 3))
)

check_parquet_roundtrip(
    pa.array([Decimal("10.000")], pa.decimal256(19, 3))
)

check_parquet_roundtrip(
    pa.array(["a", "b"], pa.large_string())
)

check_parquet_roundtrip(
    pa.array([["a", "b"]], pa.large_list(pa.large_string()))
)

check_parquet_roundtrip(
    pa.array([["a", "b"]], pa.large_list(pa.large_string()))
)

check_parquet_roundtrip(
    pa.array([1, 2, 3], pa.uint32())
)

@jorisvandenbossche jorisvandenbossche changed the title pyarrow won't cast date32 to date64 [C++] date64[ms] comes back as date32[day] after roundtrip to Parquet Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants