-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Parquet Files With Inconsistent Timestamp Units #1459
Comments
We experience the same issue (using datafusion), parquet metadata states microsecond but it is inferred as nanosecond, completely distorting the timestamps. |
So digging into this the issue is that pandas is attaching an arrow schema that specifies nanosecond precision, whilst specifying the following as the parquet column description.
This is pretty wild because not only are the two different schemas, but The issue doesn't occur with I think we can work around this, I need to work out exactly how, but imo this is a bug in pyarrow. Edit: Adding Edit Edit: It would appear that pyarrow simply ignores the embedded schema - https://issues.apache.org/jira/browse/ARROW-2429 |
I've filed https://issues.apache.org/jira/browse/ARROW-16184 as this feels like a bug in pyarrow, I will now work on adapting our schema inference logic to handle this particular case |
Fix inference from null logical type (apache#1557) Replace some `&Option<T>` with `Option<&T>` (apache#1556)
I've created #1558 which adds an option to skip decoding the garbled arrow metadata, this seemed like the safest option for allowing people to read these files, without this coming back to bite us in a strange way down the line. I would recommend using |
* Add option to skip decoding arrow metadata from parquet (#1459) Fix inference from null logical type (#1557) Replace some `&Option<T>` with `Option<&T>` (#1556) * Update parquet/src/arrow/arrow_reader.rs Co-authored-by: Andrew Lamb <[email protected]> * Fmt Co-authored-by: Andrew Lamb <[email protected]>
Note that you need |
Describe the bug
If parquet is written with timestamps with time unit other than
ns
reading such file would produce incorrect dates, whereas pandas is reading the dates correctlyTo Reproduce
Generate parquet file as follows:
`
import pandas as pd
import numpy as np
np.random.seed(0)
create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2020-01-01', periods=5, freq='H')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })
df.to_parquet('data/myfile.parquet', coerce_timestamps='ms', allow_truncated_timestamps=True)
`
Expected behavior
Data is not corrupted and dates are read back correctly.
Additional context
_
The text was updated successfully, but these errors were encountered: