Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp #10461

Closed
wants to merge 7 commits into from

Conversation

isichei
Copy link
Contributor

@isichei isichei commented Jun 7, 2021

Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp.

See added test: TestArrowReadWrite.DownsampleDeprecatedInt96 which demonstrates use and expected results.

Main discussion of changes in JIRA Issue ARROW-12096.

@github-actions
Copy link

github-actions bot commented Jun 7, 2021

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Here are a bunch of comments.

ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're essentially copying this from the test above, but nowadays we have ArrowFromJSON which allows to express test data much more easily and tersely (you can grep through the source tree to find examples).

You may also change the test above to use it, at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have rewritten tests to use a helper function. Hopefully cleaner.

ASSERT_NO_FATAL_FAILURE(::arrow::AssertSchemaEqual(*ex_result_s->schema(),
*result_s->schema(),
/*check_metadata=*/false));
ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result_s, *result_s));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably create a smaller helper function, method, or even a lambda, to avoid repeating those three lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give it a go! I'm afraid it has been a long time since I wrote any C++ code so the languange is basically new to me at this point - hence the basic repitition in places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++11 is quite a bit better than what was available before, if your experience was with C++98 :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment thread.

@@ -353,7 +353,8 @@ Status TransferBool(RecordReader* reader, MemoryPool* pool, Datum* out) {
}

Status TransferInt96(RecordReader* reader, MemoryPool* pool,
const std::shared_ptr<DataType>& type, Datum* out) {
const std::shared_ptr<DataType>& type, Datum* out,
const ::arrow::TimeUnit::type& int96_arrow_time_unit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to pass TimeUnit::type as a reference, since it's a cheap trivial type. Just pass it by value.

Copy link
Contributor Author

@isichei isichei Jun 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

default:
return Status::NotImplemented("TimeUnit not supported");
} break;
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default case doesn't seem useful, unless the compiler requires it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I was just copying how others had written switch expressions in the existing codebase. Will remove 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -181,7 +181,8 @@ Result<std::shared_ptr<ArrowType>> FromInt64(const LogicalType& logical_type) {

Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
const LogicalType& logical_type,
int type_length) {
int type_length,
const ::arrow::TimeUnit::type& int96_arrow_time_unit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, with respect to passing by value vs. reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -211,14 +212,22 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
}
}

// ARROW-12096 -- Overloading functions with new input (setting default as NANO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem terribly informative. Can you remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

descriptor.type_length(), ::arrow::TimeUnit::NANO);
}

// ARROW-12096 -- Exposing INT96 arrow type definition fromm parquet reader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
const LogicalType& logical_type,
int type_length,
const ::arrow::TimeUnit::type& int96_arrow_time_unit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to pass int96-specific information. Perhaps this should be done at a higher level (for example in schema.cc?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went back to to review this and not sure how to address.

Only thing I can imagine would be to add to GetTypeForNode (from arrow/schema.cc) and overwrite the standard storage_type if the parquet physical_type is INT96 and reader properties are not set to NANO? Let me know if I have misunderstood.

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
const schema::PrimitiveNode& primitive);

// ARROW-12096 Exposing int96 arrow timestamp unit definition
Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));

return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some amount of repetition in those four functions that would be nice to avoid, IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

struct DecodedInt96 {
  uint64_t days_since_epoch;
  uint64_t nanoseconds;
};

static inline int64_t DecodeInt96Timestamp(const parquet::Int96& i96) {
  // We do the computations in the unsigned domain to avoid unsigned behaviour
  // on overflow.
  DecodedInt96 result;
  result.days_since_epoch =
      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
  result.nanoseconds = 0;
  memcpy(&result.nanoseconds, &i96.value, sizeof(uint64_t));
  return result;
}

static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
  const auto decoded = DecodeInt96Timestamp(i96);
  return static_cast<int64_t>(decoded.days_since_epoch * kNanosecondsPerDay + decoded.nanoseconds);
}

Copy link
Contributor Author

@isichei isichei Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree.

I was concerned about changing the original function Int96GetNanoSeconds incase I introduced some unexpected change. Perhaps a halfway house is to replace the 3 (us, ms and s) INT96 functions to something like:

static inline int64_t Int96GetDownsampledTimestamp(const parquet::Int96& i96, const ::arrow::TimeUnit::type unit) {
  // We do the computations in the unsigned domain to avoid unsigned behaviour
  // on overflow.
  uint64_t days_since_epoch =
      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
  uint64_t nanoseconds = 0;

  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
  uint64_t seconds;

  switch (unit){
    case ::arrow::TimeUnit::SECOND:
      seconds = nanoseconds/static_cast<uint64_t>(1000000000);
    case ::arrow::TimeUnit::MILLI:
      seconds = nanoseconds/static_cast<uint64_t>(1000000);
    case ::arrow::TimeUnit::MICRO:
      seconds = nanoseconds/static_cast<uint64_t>(1000);
  }
  return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + seconds);
}

Then there is an if/else in the TransferInt96 function where default NANO unit calls the unchanged Int96GetNanoSeconds otherwise it calls the downcast version of the function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you prefer really. As long as the chosen solution avoids repeating the same decoding code, it should be ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with your example in the end. As it made far more sense IMO.

@pitrou
Copy link
Member

pitrou commented Jun 7, 2021

@wesm @emkornfield What do you think about the functionality that is added here? Is it a reasonable burden for us to take on?

cpp/src/parquet/types.h Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

@wesm @emkornfield What do you think about the functionality that is added here? Is it a reasonable burden for us to take on?

It seems like a small enough change so I'm okay with it. In general, though since Int96 is deprecated we should be looking very carefully at adding new functionality for it.

@pitrou pitrou changed the title ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp Jun 8, 2021
@isichei isichei marked this pull request as draft June 13, 2021 07:20
@isichei isichei marked this pull request as ready for review June 13, 2021 12:58
@isichei isichei requested a review from pitrou June 13, 2021 19:11
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I've checked in a couple of minor changes and will merge if CI is green.

@jorisvandenbossche
Copy link
Member

It will probably be useful to expose this in Python (or R) as well? (the original JIRA report is also using a pyarrow example)

(this can be done in a follow-up to be clear, just to be sure we then create a JIRA for a follow-up task)

@pitrou pitrou closed this in 85f192a Jun 15, 2021
@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

Thank you @isichei !

@isichei isichei deleted the ARROW-12096 branch June 20, 2021 18:28
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…rquet INT96 timestamp

Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp.

See added test: `TestArrowReadWrite.DownsampleDeprecatedInt96` which demonstrates use and expected results.

Main discussion of changes in [JIRA Issue ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096).

Closes apache#10461 from isichei/ARROW-12096

Lead-authored-by: Karik Isichei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
chaoyli pushed a commit to StarRocks/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

(cherry picked from commit 1d3ea49)
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

(cherry picked from commit 1d3ea49)

# Conflicts:
#	be/test/exec/vectorized/parquet_scanner_test.cpp
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

(cherry picked from commit 1d3ea49)

# Conflicts:
#	be/test/exec/vectorized/parquet_scanner_test.cpp
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

(cherry picked from commit 1d3ea49)

# Conflicts:
#	be/test/exec/vectorized/parquet_scanner_test.cpp
rickif added a commit to rickif/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
rickif added a commit to rickif/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
rickif added a commit to rickif/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
rickif added a commit to rickif/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
chaoyli pushed a commit to StarRocks/starrocks that referenced this pull request May 10, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
wanpengfei-git pushed a commit to StarRocks/starrocks that referenced this pull request May 12, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

(cherry picked from commit 1d3ea49)
meegoo pushed a commit to StarRocks/starrocks that referenced this pull request May 12, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
rickif added a commit to rickif/starrocks that referenced this pull request May 22, 2023
StarRocks#23158)

The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
wanpengfei-git pushed a commit to StarRocks/starrocks that referenced this pull request May 22, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
Moonm3n pushed a commit to Moonm3n/starrocks that referenced this pull request May 23, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.

Signed-off-by: Moonm3n <[email protected]>
numbernumberone pushed a commit to numbernumberone/starrocks that referenced this pull request May 31, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
abc982627271 pushed a commit to abc982627271/starrocks that referenced this pull request Jun 5, 2023
The parquet format would use int96 to store big datetime like `9999-12-31 23:59:59`, which leads to overflow when it is cast to int64.
The parquet reader provides an option to set the unit of int96 timestamp in apache/arrow#10461.
This PR adds a config `parquet_coerce_int96_timestamp_unit` for BE to set the unit of reading parquet int96 timestamp.
With the default value `MICRO`, the maximum datetime value `9999-12-31 23:59:59.999999` in MySQL could be correctly handled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants