Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

Closed
Tracked by #3148
houqp opened this issue Jun 13, 2021 · 8 comments
Closed
Tracked by #3148
Labels

Comments

@houqp
Copy link
Member

houqp commented Jun 13, 2021

Describe the bug

Here is a pandas dataframe with nanosecond timestamp Data index:

>>> hist.index
DatetimeIndex(['1986-03-13', '1986-03-14', '1986-03-17', '1986-03-18',
               '1986-03-19', '1986-03-20', '1986-03-21', '1986-03-24',
               '1986-03-25', '1986-03-26',
               ...
               '2021-05-28', '2021-06-01', '2021-06-02', '2021-06-03',
               '2021-06-04', '2021-06-07', '2021-06-08', '2021-06-09',
               '2021-06-10', '2021-06-11'],
              dtype='datetime64[ns]', name='Date', length=8885, freq=None)

When storing this dataframe into parquet 1.0 format, pyarrow stores the Date column in microsecond unit. pyarrow is able to load the Date column with microsecond precision as well:

>>> from pyarrow.parquet import ParquetFile
>>> pp = ParquetFile("test_data/msft.parquet")
>>> pp.metadata.schema
<pyarrow._parquet.ParquetSchema object at 0x7f720d1bbac0>
required group field_id=0 schema {
  optional double field_id=1 Open;
  optional double field_id=2 High;
  optional double field_id=3 Low;
  optional double field_id=4 Close;
  optional int64 field_id=5 Volume;
  optional double field_id=6 Dividends;
  optional double field_id=7 StockSplits;
  optional int64 field_id=8 Date (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

But when loaded using arrow parquet crate, it is incorrectly loaded as nanosecond timestamp type.

To Reproduce

Here is a sample file to reproduce the issue: https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.

The file can be reproduced with the following python code:

import yfinance as yf
hist = yf.Ticker('MSFT').history(period="max")
hist.to_parquet('msft.parquet')

Expected behavior

Data column should be loaded as micro second precision.

Additional context

Arrow parquet crate handles parquet 2.0 files without any issue.

Initially reported in roapi/roapi#42.

Here is the decoded ipc field from the 'ARROW:schema' metadata for the Date column in arrow crate:

Field {
    name: Some(
        "Date",
    ),
    nullable: true,
    type_type: Timestamp,
    type_: Timestamp {
        unit: NANOSECOND,
        timezone: None,
    },
    dictionary: None,
    children: Some(
        [],
    ),
    custom_metadata: None,
}
@houqp houqp added the bug label Jun 13, 2021
@nevi-me
Copy link
Contributor

nevi-me commented Jun 14, 2021

I'd have expected the ns timestamp to be written to int96 as that is the legacy nanosecond timestamp format. I'll also dig into this to see, maybe if the format is 1.0 then we should coerce the timestamp to a different resolution, but I'd also see this as a pyarrow quirk that could be documented somewhere.

@emkornfield do you know what we should do on the Rust side to roundtrip the file correctly?

@houqp
Copy link
Member Author

houqp commented Jun 14, 2021

pyarrow parquet writer has an option that one can set to write ns timestamp as int96 for parquet 1.0, but is turned off by default. It is quite strange that decoded IPC schema comes back a ns type even though it's stored as ms.

@emkornfield
Copy link
Contributor

@emkornfield do you know what we should do on the Rust side to roundtrip the file correctly?

Sorry, I would have to dig in the code to have a better understanding. In general we try to avoid int96 because it is a deprecated type. If this is an issue with pyarrow can we open a separate bug to track that?

@jorgecarleitao
Copy link
Member

This may be related to which priority we give when converting, the parquet schema or the arrow schema. I would expect pyarrow to write them in a consistent manner though, so, as @nevi-me mentioned, an arrow schema in ns with a parquet schema in ms does seem odd.

@emkornfield
Copy link
Contributor

https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.cc#L215 is probably the relevant writing code for 1.0 so that might explain the discrepancy, let me see If I can dig up the reader code.

@emkornfield
Copy link
Contributor

Here is C++ attempts to reuse original metadata. It doesn't look like we would use the arrow schema in this case. I'm guessing conversion back the Pandas type happens correctly because microseconds would be cast back at the arrow to pandas layer.

@emkornfield
Copy link
Contributor

So I think there is probably a bug in pyarrow here since we wouldn't round trip a pure timestamp[ns] type, but we don't run into it in from pandas because the casting just works.

@tustvold
Copy link
Contributor

I believe this is a duplicate of #1459 which was fixed by #1682. Please feel free to reopen if I am mistaken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants