Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Parquet Files With Inconsistent Timestamp Units #1459

Closed
anliakho2 opened this issue Mar 17, 2022 · 6 comments · Fixed by #1558
Closed

Handle Parquet Files With Inconsistent Timestamp Units #1459

anliakho2 opened this issue Mar 17, 2022 · 6 comments · Fixed by #1558
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@anliakho2
Copy link

Describe the bug
If parquet is written with timestamps with time unit other than ns reading such file would produce incorrect dates, whereas pandas is reading the dates correctly

To Reproduce
Generate parquet file as follows:
`
import pandas as pd
import numpy as np

np.random.seed(0)

create an array of 5 dates starting at '2015-02-24', one per minute

rng = pd.date_range('2020-01-01', periods=5, freq='H')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })
df.to_parquet('data/myfile.parquet', coerce_timestamps='ms', allow_truncated_timestamps=True)
`

Expected behavior
Data is not corrupted and dates are read back correctly.

Additional context
_

@jmahenriques
Copy link

We experience the same issue (using datafusion), parquet metadata states microsecond but it is inferred as nanosecond, completely distorting the timestamps.

@tustvold
Copy link
Contributor

tustvold commented Apr 13, 2022

So digging into this the issue is that pandas is attaching an arrow schema that specifies nanosecond precision, whilst specifying the following as the parquet column description.

converted_type: TIMESTAMP_MILLIS,
logical_type: Some(
    TIMESTAMP(
        TimestampType {
            is_adjusted_to_u_t_c: false,
            unit: MILLIS(
                MilliSeconds,
            ),
        },
    ),
),

This is pretty wild because not only are the two different schemas, but TIMESTAMP_MILLIS could overflow if converted to an arrow TimestampNanosecondArray which uses an i64 to store its values. I'm not really sure why it does this.

The issue doesn't occur with fastparquet which uses the LogicalType support for nanosecond precision i64 timestamps, but it also doesn't write an arrow schema so...

I think we can work around this, I need to work out exactly how, but imo this is a bug in pyarrow.

Edit: Adding flavor='spark' makes this work, likely because it stores the timestamps as Int96

Edit Edit: It would appear that pyarrow simply ignores the embedded schema - https://issues.apache.org/jira/browse/ARROW-2429

@tustvold
Copy link
Contributor

tustvold commented Apr 13, 2022

I've filed https://issues.apache.org/jira/browse/ARROW-16184 as this feels like a bug in pyarrow, I will now work on adapting our schema inference logic to handle this particular case

@tustvold tustvold changed the title Timestamps with time unit of MICROS or MILLIS are read incorrectly Handle Parquet Files With Inconsistent Timestamp Units Apr 13, 2022
@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Apr 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 13, 2022
Fix inference from null logical type (apache#1557)

Replace some `&Option<T>` with `Option<&T>` (apache#1556)
@tustvold
Copy link
Contributor

I've created #1558 which adds an option to skip decoding the garbled arrow metadata, this seemed like the safest option for allowing people to read these files, without this coming back to bite us in a strange way down the line.

I would recommend using flavor='spark', engine=fastparquet or version="2.4" when writing these files as a better workaround, but I appreciate rewriting files may not be an option

tustvold added a commit that referenced this issue Apr 14, 2022
* Add option to skip decoding arrow metadata from parquet (#1459)

Fix inference from null logical type (#1557)

Replace some `&Option<T>` with `Option<&T>` (#1556)

* Update parquet/src/arrow/arrow_reader.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Fmt

Co-authored-by: Andrew Lamb <[email protected]>
@alamb alamb added the parquet Changes to the parquet crate label Apr 15, 2022
@jorisvandenbossche
Copy link
Member

Note that you need version="2.6" (not "2.4) for nanosecond support (I would personally recommend this over the flavor="spark", because the int96 timestamps are deprecated in the Parquet spec)

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2022

Following on from #1663 and in particular #1682 these files can be read without issue, as the embedded arrow schema is no longer treated as authoritative. This has been released in arrow 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
5 participants