unable to write parquet file with UTC timestamp #1932

msalib · 2022-06-23T19:17:05Z

Describe the bug
I cannot figure out how to write a parquet file with a timestamp column that gets encoded as UTC. All my efforts produce files with naive timestamps and no UTC metadata.

To Reproduce

Consider this program: it writes a tiny parquet file to /tmp/q.parquet. But using both pqrs and pandas/pyarrow on the resulting file shows that there is no timezone present -- the metric_date column is a naive timestamp.

use std::sync::Arc;

use arrow::{
    array::{StringArray, TimestampMillisecondArray},
    datatypes::{DataType, Field, Schema, TimeUnit},
    record_batch::RecordBatch,
};
use parquet::{
    arrow::arrow_writer::ArrowWriter,
    file::properties::{WriterProperties, WriterVersion},
};

fn main() {
    //let tz = Some("UTC".to_owned());
    let tz = None;
    let fields = vec![
        Field::new(
            "metric_date",
            DataType::Timestamp(TimeUnit::Millisecond, tz.clone()),
            false,
        ),
        Field::new("my_id", DataType::Utf8, false),
    ];
    let schema = Arc::new(Schema::new(fields));

    let my_ids = Arc::new(StringArray::from(vec!["hi", "there"]));
    let dates = Arc::new(TimestampMillisecondArray::from_vec(
        vec![1234532523, 1234124],
        tz,
    ));
    let batch = RecordBatch::try_new(schema.clone(), vec![dates, my_ids]).unwrap();

    let f = std::fs::File::create("/tmp/q.parquet").unwrap();
    let props = WriterProperties::builder()
        .set_writer_version(WriterVersion::PARQUET_2_0)
        .build();

    let mut writer = ArrowWriter::try_new(f, schema, Some(props)).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();
    println!("Hello, world!");
}

Additional context
Tested using arrow="16.0.0" and parquet="16.0.0".

The text was updated successfully, but these errors were encountered:

msalib · 2022-06-23T20:29:42Z

Here are some things I've tried (none of them make any difference):

setting tz to None or Some("+00:00".to_owned())
using v1 vs v2
using milliseconds vs nanoseconds vs microseconds

tustvold · 2022-06-23T21:17:42Z

Could you expand a bit on what the expected behaviour is, as honestly cannot find any comprehensive document on how this is supposed to be handled. It's one of the many data model mismatches between arrow and parquet where it isn't really very clearly defined what is "correct" - #1666.

Ultimately Parquet does not have a native mechanism to encode timezone information in its schema, instead opting for something slightly different - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp. The arrow schema is embedded in the parquet file, but as documented in #1663 this cannot be treated as authoritative.

What I can say is the following:

The timezone is being stored in the embedded schema
As of parquet 15.0.0, in particular Fix Parquet Reader's Arrow Schema Inference #1682, parquet-rs roundtrips timezones correctly
pqrs is on parquet 12.0.0 where timezones did not roundtrip correctly
pyarrow appears to ignore the timezone stored within the arrow schema, I don't understand why

msalib · 2022-06-24T01:27:27Z

Sure! For me, expected behavior is that pandas will read a rust-produced parquet file with UTC timestamp columns and recognize that they're UTC. Like this:

import pandas as pd
assert str(pd.read_parquet("/tmp/q.parquet").dtypes.metric_date) == 'datetime64[ns, UTC]'
# and not 'datetime64[ns]'

Thank you so much for explaining that given the nature of the specification, this might not be feasible; I was going crazy. In part, because this used to work (I have a python unit test that invokes rust code and reads parquet files generated by rust). Up to version 14 of parquet+arrow, this worked fine. But as of version 15, the behavior changed.

msalib · 2022-06-24T01:33:42Z

This slightly simplified example shows different behavior when depending on arrow=14.0.0,parquet=14.0.0 versus arrow=15.0.0,parquet=15.0.0. The difference is visible both from pandas and pqrs schema.

use std::sync::Arc;

use arrow::{
    array::{StringArray, TimestampMillisecondArray},
    datatypes::{DataType, Field, Schema, TimeUnit},
    record_batch::RecordBatch,
};
use parquet::arrow::arrow_writer::ArrowWriter;

fn main() {
    let tz = Some("UTC".to_owned());
    let fields = vec![
        Field::new(
            "metric_date",
            DataType::Timestamp(TimeUnit::Millisecond, tz.clone()),
            false,
        ),
        Field::new("my_id", DataType::Utf8, false),
    ];
    let schema = Arc::new(Schema::new(fields));

    let my_ids = Arc::new(StringArray::from(vec!["hi", "there"]));
    let dates = Arc::new(TimestampMillisecondArray::from_vec(
        vec![1234532523, 1234124],
        tz,
    ));
    let batch = RecordBatch::try_new(schema.clone(), vec![dates, my_ids]).unwrap();

    let f = std::fs::File::create("/tmp/q.parquet").unwrap();

    let mut writer = ArrowWriter::try_new(f, schema, None).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();
    println!("Hello, world!");
}

Given the unfortunate state of the specification, I understand that the changes in version 15 might be better in many ways and fix all manner of issues, but in this regard, they constitute a regression.

msalib · 2022-06-24T14:27:51Z

@tustvold Thank you so much for fixing this so quickly! I really appreciate it!

We're using rust+parquet+python+serverless for geospatial computing at work and arrow-rs' work has been incredibly helpful!

* Set is_adjusted_to_utc if any timezone set (#1932) * Fix roundtrip

msalib added the bug label Jun 23, 2022

tustvold mentioned this issue Jun 24, 2022

Cast Kernel Ignores Timezone #1936

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 24, 2022

Set adjusted to UTC if UTC timezone (apache#1932)

f8eee22

tustvold mentioned this issue Jun 24, 2022

Set adjusted to UTC if UTC timezone (#1932) #1937

Merged

tustvold closed this as completed in #1937 Jun 24, 2022

tustvold added a commit that referenced this issue Jun 24, 2022

Set adjusted to UTC if UTC timezone (#1932) (#1937)

d52be30

alamb added the parquet Changes to the parquet crate label Jun 24, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 28, 2022

Set is_adjusted_to_utc if any timezone set (apache#1932)

c614511

tustvold mentioned this issue Jun 28, 2022

Set is_adjusted_to_utc if any timezone set (#1932) #1953

Merged

tustvold added a commit that referenced this issue Jun 29, 2022

Set is_adjusted_to_utc if any timezone set (#1932) (#1953)

420d669

* Set is_adjusted_to_utc if any timezone set (#1932) * Fix roundtrip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to write parquet file with UTC timestamp #1932

unable to write parquet file with UTC timestamp #1932

msalib commented Jun 23, 2022

msalib commented Jun 23, 2022

tustvold commented Jun 23, 2022 •

edited

Loading

msalib commented Jun 24, 2022

msalib commented Jun 24, 2022

msalib commented Jun 24, 2022

unable to write parquet file with UTC timestamp #1932

unable to write parquet file with UTC timestamp #1932

Comments

msalib commented Jun 23, 2022

msalib commented Jun 23, 2022

tustvold commented Jun 23, 2022 • edited Loading

msalib commented Jun 24, 2022

msalib commented Jun 24, 2022

msalib commented Jun 24, 2022

tustvold commented Jun 23, 2022 •

edited

Loading