Pyarrow and Arrow2 don't agree on Timestamp resolution #700

ritchie46 · 2021-12-22T07:54:31Z

simple dataset

    data = {
        "datetime": [  # unix timestamp in ms
            1618354800000,
            1618354740000,
            1618354680000,
            1618354620000,
            1618354560000,
        ],
        "laf_max": [73.1999969482, 71.0999984741, 74.5, 69.5999984741, 69.6999969482],
        "laf_eq": [59.5999984741, 61.0, 62.2999992371, 56.9000015259, 60.0],
    }
    df = pl.DataFrame(data)
    df = df.with_column(df["datetime"].cast(pl.Datetime))
    df

shape: (5, 3)
┌────────────────────────────┬───────────────┬───────────────┐
│ datetime                   ┆ laf_max       ┆ laf_eq        │
│ ---                        ┆ ---           ┆ ---           │
│ datetime                   ┆ f64           ┆ f64           │
╞════════════════════════════╪═══════════════╪═══════════════╡
│ 1970-01-01 00:26:58.354800 ┆ 73.1999969482 ┆ 59.5999984741 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354740 ┆ 71.0999984741 ┆ 61            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354680 ┆ 74.5          ┆ 62.2999992371 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354620 ┆ 69.5999984741 ┆ 56.9000015259 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354560 ┆ 69.6999969482 ┆ 60            │
└────────────────────────────┴───────────────┴───────────────┘

    f = io.BytesIO()
    df.to_parquet(f, use_pyarrow=True)
    f.seek(0)
    read = pl.read_parquet(f)
    read

shape: (5, 3)
┌───────────────────────────────┬───────────────┬───────────────┐
│ datetime                      ┆ laf_max       ┆ laf_eq        │
│ ---                           ┆ ---           ┆ ---           │
│ datetime                      ┆ f64           ┆ f64           │
╞═══════════════════════════════╪═══════════════╪═══════════════╡
│ 1970-01-01 00:00:01.618354800 ┆ 73.1999969482 ┆ 59.5999984741 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354740 ┆ 71.0999984741 ┆ 61            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354680 ┆ 74.5          ┆ 62.2999992371 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354620 ┆ 69.5999984741 ┆ 56.9000015259 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354560 ┆ 69.6999969482 ┆ 60            │
└───────────────────────────────┴───────────────┴───────────────┘

If we write and read with pyarrow the timestamp is correct. If we read and write with arrow2 the timestamp is also correct.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2021-12-22T15:40:24Z

This is an interesting case: pyarrow writes arrows' nanosecond precision as parquet's logical type microseconds, and divides the number accordingly. So, the file ends up with

parquet's logical type: microseconds
arrow's logical type (in the schema's metadata): nanoseconds

The bug on our end is that we ignore parquet's logical type when deserializing, which caused us reading parquet's microseconds into arrow's nanoseconds without correctly converting them.

ritchie46 · 2021-12-22T15:51:10Z

I had to read it twice, but I think I understand. :D So there are two logical types in a parquet file? The one written and the destination's logical type.

jorgecarleitao added the bug Something isn't working label Dec 22, 2021

jorgecarleitao self-assigned this Dec 22, 2021

jorgecarleitao mentioned this issue Dec 22, 2021

Added more conversions from parquet #701

Merged

jorgecarleitao closed this as completed in #701 Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

ritchie46 commented Dec 22, 2021 •

edited

Loading

jorgecarleitao commented Dec 22, 2021

ritchie46 commented Dec 22, 2021

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

Comments

ritchie46 commented Dec 22, 2021 • edited Loading

simple dataset

jorgecarleitao commented Dec 22, 2021

ritchie46 commented Dec 22, 2021

ritchie46 commented Dec 22, 2021 •

edited

Loading