ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp #10461

isichei · 2021-06-07T07:56:04Z

Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp.

See added test: TestArrowReadWrite.DownsampleDeprecatedInt96 which demonstrates use and expected results.

Main discussion of changes in JIRA Issue ARROW-12096.

github-actions · 2021-06-07T07:56:32Z

https://issues.apache.org/jira/browse/ARROW-12096

pitrou

Thanks for this PR. Here are a bunch of comments.

pitrou · 2021-06-07T15:35:32Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);


I know you're essentially copying this from the test above, but nowadays we have ArrowFromJSON which allows to express test data much more easily and tersely (you can grep through the source tree to find examples).

You may also change the test above to use it, at the same time.

Have rewritten tests to use a helper function. Hopefully cleaner.

pitrou · 2021-06-07T15:36:11Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertSchemaEqual(*ex_result_s->schema(),
+                                                     *result_s->schema(),
+                                                     /*check_metadata=*/false));
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result_s, *result_s));


You can probably create a smaller helper function, method, or even a lambda, to avoid repeating those three lines below.

Will give it a go! I'm afraid it has been a long time since I wrote any C++ code so the languange is basically new to me at this point - hence the basic repitition in places.

C++11 is quite a bit better than what was available before, if your experience was with C++98 :-)

See previous comment thread.

pitrou · 2021-06-07T15:36:49Z

cpp/src/parquet/arrow/reader_internal.cc

@@ -353,7 +353,8 @@ Status TransferBool(RecordReader* reader, MemoryPool* pool, Datum* out) {
 }

 Status TransferInt96(RecordReader* reader, MemoryPool* pool,
-                     const std::shared_ptr<DataType>& type, Datum* out) {
+                     const std::shared_ptr<DataType>& type, Datum* out,
+                     const ::arrow::TimeUnit::type& int96_arrow_time_unit) {


You do not need to pass TimeUnit::type as a reference, since it's a cheap trivial type. Just pass it by value.

pitrou · 2021-06-07T15:37:42Z

cpp/src/parquet/arrow/reader_internal.cc

-        default:
-          return Status::NotImplemented("TimeUnit not supported");
+          } break;
+          default:


This default case doesn't seem useful, unless the compiler requires it?

I don't think so. I was just copying how others had written switch expressions in the existing codebase. Will remove 👍

pitrou · 2021-06-07T15:38:32Z

cpp/src/parquet/arrow/schema_internal.cc

@@ -181,7 +181,8 @@ Result<std::shared_ptr<ArrowType>> FromInt64(const LogicalType& logical_type) {

 Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
                                                const LogicalType& logical_type,
-                                                int type_length) {
+                                                int type_length,
+                                                const ::arrow::TimeUnit::type& int96_arrow_time_unit) {


Same comment here, with respect to passing by value vs. reference.

pitrou · 2021-06-07T15:39:03Z

cpp/src/parquet/arrow/schema_internal.cc

@@ -211,14 +212,22 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
  }
 }

+// ARROW-12096 -- Overloading functions with new input (setting default as NANO)


This comment doesn't seem terribly informative. Can you remove it?

pitrou · 2021-06-07T15:39:08Z

cpp/src/parquet/arrow/schema_internal.cc

+                      descriptor.type_length(), ::arrow::TimeUnit::NANO);
+}
+
+// ARROW-12096 -- Exposing INT96 arrow type definition fromm parquet reader


pitrou · 2021-06-07T15:40:48Z

cpp/src/parquet/arrow/schema_internal.h

+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
+                                                        const LogicalType& logical_type,
+                                                        int type_length,
+                                                        const ::arrow::TimeUnit::type& int96_arrow_time_unit);


I don't think this is the right place to pass int96-specific information. Perhaps this should be done at a higher level (for example in schema.cc?).

Went back to to review this and not sure how to address.

Only thing I can imagine would be to add to GetTypeForNode (from arrow/schema.cc) and overwrite the standard storage_type if the parquet physical_type is INT96 and reader properties are not set to NANO? Let me know if I have misunderstood.

pitrou · 2021-06-07T15:42:39Z

cpp/src/parquet/arrow/schema_internal.h

 Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
    const schema::PrimitiveNode& primitive);
+
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(


pitrou · 2021-06-07T15:43:30Z

cpp/src/parquet/types.h

+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);


There is some amount of repetition in those four functions that would be nice to avoid, IMHO.

For example:

struct DecodedInt96 { uint64_t days_since_epoch; uint64_t nanoseconds; }; static inline int64_t DecodeInt96Timestamp(const parquet::Int96& i96) { // We do the computations in the unsigned domain to avoid unsigned behaviour // on overflow. DecodedInt96 result; result.days_since_epoch = i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays); result.nanoseconds = 0; memcpy(&result.nanoseconds, &i96.value, sizeof(uint64_t)); return result; } static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) { const auto decoded = DecodeInt96Timestamp(i96); return static_cast<int64_t>(decoded.days_since_epoch * kNanosecondsPerDay + decoded.nanoseconds); }

Yeah I agree.

I was concerned about changing the original function Int96GetNanoSeconds incase I introduced some unexpected change. Perhaps a halfway house is to replace the 3 (us, ms and s) INT96 functions to something like:

static inline int64_t Int96GetDownsampledTimestamp(const parquet::Int96& i96, const ::arrow::TimeUnit::type unit) { // We do the computations in the unsigned domain to avoid unsigned behaviour // on overflow. uint64_t days_since_epoch = i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays); uint64_t nanoseconds = 0; memcpy(&nanoseconds, &i96.value, sizeof(uint64_t)); uint64_t seconds; switch (unit){ case ::arrow::TimeUnit::SECOND: seconds = nanoseconds/static_cast<uint64_t>(1000000000); case ::arrow::TimeUnit::MILLI: seconds = nanoseconds/static_cast<uint64_t>(1000000); case ::arrow::TimeUnit::MICRO: seconds = nanoseconds/static_cast<uint64_t>(1000); } return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + seconds); }

Then there is an if/else in the TransferInt96 function where default NANO unit calls the unchanged Int96GetNanoSeconds otherwise it calls the downcast version of the function?

As you prefer really. As long as the chosen solution avoids repeating the same decoding code, it should be ok.

Went with your example in the end. As it made far more sense IMO.

pitrou · 2021-06-07T15:48:38Z

@wesm @emkornfield What do you think about the functionality that is added here? Is it a reasonable burden for us to take on?

cpp/src/parquet/types.h

emkornfield · 2021-06-07T16:56:28Z

@wesm @emkornfield What do you think about the functionality that is added here? Is it a reasonable burden for us to take on?

It seems like a small enough change so I'm okay with it. In general, though since Int96 is deprecated we should be looking very carefully at adding new functionality for it.

…s to define INT96 -> arrow timestamp unit. Tests added. Please enter the commit message for your changes. Lines starting

Co-authored-by: emkornfield <[email protected]>

… on GitHub.

pitrou

Thanks for the updates! I've checked in a couple of minor changes and will merge if CI is green.

jorisvandenbossche · 2021-06-15T14:39:33Z

It will probably be useful to expose this in Python (or R) as well? (the original JIRA report is also using a pyarrow example)

(this can be done in a follow-up to be clear, just to be sure we then create a JIRA for a follow-up task)

pitrou · 2021-06-15T16:26:32Z

Thank you @isichei !

…rquet INT96 timestamp Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp. See added test: `TestArrowReadWrite.DownsampleDeprecatedInt96` which demonstrates use and expected results. Main discussion of changes in [JIRA Issue ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096). Closes apache#10461 from isichei/ARROW-12096 Lead-authored-by: Karik Isichei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled. (cherry picked from commit 1d3ea49)

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled. (cherry picked from commit 1d3ea49) # Conflicts: # be/test/exec/vectorized/parquet_scanner_test.cpp

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled. (cherry picked from commit 1d3ea49)

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

StarRocks#23158) The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled. Signed-off-by: Moonm3n <[email protected]>

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64. The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461. This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp. With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

github-actions bot added Component: C++ Component: Parquet labels Jun 7, 2021

pitrou reviewed Jun 7, 2021

View reviewed changes

emkornfield reviewed Jun 7, 2021

View reviewed changes

cpp/src/parquet/types.h Outdated Show resolved Hide resolved

pitrou changed the title ~~ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp~~ ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp Jun 8, 2021

isichei marked this pull request as draft June 13, 2021 07:20

isichei marked this pull request as ready for review June 13, 2021 12:58

isichei requested a review from pitrou June 13, 2021 19:11

isichei and others added 7 commits June 15, 2021 16:16

Fixes JIRA issue ARROW-12096. Only the CPP code changes to allow user…

4de42a5

…s to define INT96 -> arrow timestamp unit. Tests added. Please enter the commit message for your changes. Lines starting

Adding comment to test

eaaf788

another comment - to explain test

4977c03

Typo on bracket

534bcb1

Update cpp/src/parquet/types.h

ed1e921

Co-authored-by: emkornfield <[email protected]>

Addressed all but one comment in PR. Will discuss outstanding comment…

3741d08

… on GitHub.

Nit + fix lint + fix compile error

ee293fc

pitrou force-pushed the ARROW-12096 branch from 1232b76 to ee293fc Compare June 15, 2021 14:24

pitrou approved these changes Jun 15, 2021

View reviewed changes

pitrou closed this in 85f192a Jun 15, 2021

isichei deleted the ARROW-12096 branch June 20, 2021 18:28

asfimport mentioned this pull request Jun 15, 2021

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

Closed

rickif mentioned this pull request Apr 24, 2023

[BugFix] overflow in parsing parquet int96 timestamp StarRocks/starrocks#22356

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp #10461

ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp #10461

isichei commented Jun 7, 2021

github-actions bot commented Jun 7, 2021

pitrou left a comment

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 7, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021 •

edited

Loading

pitrou Jun 7, 2021

isichei Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou Jun 7, 2021

pitrou Jun 7, 2021

isichei Jun 7, 2021 •

edited

Loading

pitrou Jun 7, 2021

isichei Jun 13, 2021

pitrou commented Jun 7, 2021

emkornfield commented Jun 7, 2021

pitrou left a comment

jorisvandenbossche commented Jun 15, 2021

pitrou commented Jun 15, 2021


		uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));

		return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp #10461

ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp #10461

Conversation

isichei commented Jun 7, 2021

github-actions bot commented Jun 7, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isichei Jun 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isichei Jun 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jun 7, 2021

emkornfield commented Jun 7, 2021

pitrou left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 15, 2021

pitrou commented Jun 15, 2021

isichei Jun 13, 2021 •

edited

Loading

isichei Jun 7, 2021 •

edited

Loading