Handle Parquet Files With Inconsistent Timestamp Units #1459

anliakho2 · 2022-03-17T04:34:05Z

Describe the bug
If parquet is written with timestamps with time unit other than ns reading such file would produce incorrect dates, whereas pandas is reading the dates correctly

To Reproduce
Generate parquet file as follows:
`
import pandas as pd
import numpy as np

np.random.seed(0)

create an array of 5 dates starting at '2015-02-24', one per minute

rng = pd.date_range('2020-01-01', periods=5, freq='H')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })
df.to_parquet('data/myfile.parquet', coerce_timestamps='ms', allow_truncated_timestamps=True)
`

Expected behavior
Data is not corrupted and dates are read back correctly.

Additional context
_

The text was updated successfully, but these errors were encountered:

jmahenriques · 2022-03-30T10:57:15Z

We experience the same issue (using datafusion), parquet metadata states microsecond but it is inferred as nanosecond, completely distorting the timestamps.

tustvold · 2022-04-13T09:56:09Z

So digging into this the issue is that pandas is attaching an arrow schema that specifies nanosecond precision, whilst specifying the following as the parquet column description.

converted_type: TIMESTAMP_MILLIS,
logical_type: Some(
    TIMESTAMP(
        TimestampType {
            is_adjusted_to_u_t_c: false,
            unit: MILLIS(
                MilliSeconds,
            ),
        },
    ),
),

This is pretty wild because not only are the two different schemas, but TIMESTAMP_MILLIS could overflow if converted to an arrow TimestampNanosecondArray which uses an i64 to store its values. I'm not really sure why it does this.

The issue doesn't occur with fastparquet which uses the LogicalType support for nanosecond precision i64 timestamps, but it also doesn't write an arrow schema so...

I think we can work around this, I need to work out exactly how, but imo this is a bug in pyarrow.

Edit: Adding flavor='spark' makes this work, likely because it stores the timestamps as Int96

Edit Edit: It would appear that pyarrow simply ignores the embedded schema - https://issues.apache.org/jira/browse/ARROW-2429

tustvold · 2022-04-13T10:23:13Z

I've filed https://issues.apache.org/jira/browse/ARROW-16184 as this feels like a bug in pyarrow, I will now work on adapting our schema inference logic to handle this particular case

Fix inference from null logical type (apache#1557) Replace some `&Option<T>` with `Option<&T>` (apache#1556)

tustvold · 2022-04-13T12:49:31Z

I've created #1558 which adds an option to skip decoding the garbled arrow metadata, this seemed like the safest option for allowing people to read these files, without this coming back to bite us in a strange way down the line.

I would recommend using flavor='spark', engine=fastparquet or version="2.4" when writing these files as a better workaround, but I appreciate rewriting files may not be an option

* Add option to skip decoding arrow metadata from parquet (#1459) Fix inference from null logical type (#1557) Replace some `&Option<T>` with `Option<&T>` (#1556) * Update parquet/src/arrow/arrow_reader.rs Co-authored-by: Andrew Lamb <[email protected]> * Fmt Co-authored-by: Andrew Lamb <[email protected]>

jorisvandenbossche · 2022-04-15T14:04:10Z

Note that you need version="2.6" (not "2.4) for nanosecond support (I would personally recommend this over the flavor="spark", because the int96 timestamps are deprecated in the Parquet spec)

tustvold · 2022-06-01T10:08:55Z

Following on from #1663 and in particular #1682 these files can be read without issue, as the embedded arrow schema is no longer treated as authoritative. This has been released in arrow 15

anliakho2 added the bug label Mar 17, 2022

anliakho2 mentioned this issue Mar 17, 2022

Integrate fixes for Timestamp[MICROS] and infinite loop hang when reading parquet #1460

Closed

tustvold mentioned this issue Apr 1, 2022

wrong result when operation parquet apache/datafusion#2044

Closed

alitrack mentioned this issue Apr 3, 2022

Insert into Datetime64 column the result is not correct databendlabs/databend#4680

Closed

tustvold changed the title ~~Timestamps with time unit of MICROS or MILLIS are read incorrectly~~ Handle Parquet Files With Inconsistent Timestamp Units Apr 13, 2022

tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Apr 13, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 13, 2022

Add option to skip decoding arrow metadata from parquet (apache#1459)

019cd07

Fix inference from null logical type (apache#1557) Replace some `&Option<T>` with `Option<&T>` (apache#1556)

tustvold mentioned this issue Apr 13, 2022

Add ArrowReaderOptions to ParquetFileArrowReader, add option to skip decoding arrow metadata from parquet (#1459) #1558

Merged

tustvold closed this as completed in #1558 Apr 14, 2022

alamb added the parquet Changes to the parquet crate label Apr 15, 2022

This was referenced May 5, 2022

Separate Parquet -> Arrow Schema Conversion From ArrayBuilder #1655

Closed

Parquet Treats Embedded Arrow Schema as Authoritative #1663

Closed

Handling Unsupported Arrow Types in Parquet #1666

Closed

tustvold mentioned this issue Oct 21, 2022

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Parquet Files With Inconsistent Timestamp Units #1459

Handle Parquet Files With Inconsistent Timestamp Units #1459

anliakho2 commented Mar 17, 2022

jmahenriques commented Mar 30, 2022

tustvold commented Apr 13, 2022 •

edited

Loading

tustvold commented Apr 13, 2022 •

edited

Loading

tustvold commented Apr 13, 2022

jorisvandenbossche commented Apr 15, 2022

tustvold commented Jun 1, 2022

Handle Parquet Files With Inconsistent Timestamp Units #1459

Handle Parquet Files With Inconsistent Timestamp Units #1459

Comments

anliakho2 commented Mar 17, 2022

create an array of 5 dates starting at '2015-02-24', one per minute

jmahenriques commented Mar 30, 2022

tustvold commented Apr 13, 2022 • edited Loading

tustvold commented Apr 13, 2022 • edited Loading

tustvold commented Apr 13, 2022

jorisvandenbossche commented Apr 15, 2022

tustvold commented Jun 1, 2022

tustvold commented Apr 13, 2022 •

edited

Loading

tustvold commented Apr 13, 2022 •

edited

Loading