Support for plain and dictionary encoded INT64 timestamp in parquet files #8325

mskapilks · 2024-01-10T09:49:36Z

Adds support for UTC adjusted INT64 timestamp in parquet files. Can read both the current parquet logical type and old converted type annotated timestamp.
This PR takes inspiration from this #4680

netlify · 2024-01-10T09:49:42Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`187e2c5`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/66d84e78fabff200089f64e5

mskapilks · 2024-01-10T09:50:11Z

@rui-mo

facebook-github-bot · 2024-01-10T09:53:45Z

Hi @mskapilks!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

rui-mo

Thanks for your work! I'm still reading this PR, and just added several comments for the code style. Could you also sign the CLA?

velox/dwio/common/TimestampDecoder.h

velox/type/TimestampConversion.h

velox/dwio/common/TimestampDecoder.h

velox/dwio/parquet/reader/ParquetReader.cpp

mskapilks · 2024-01-12T05:53:02Z

@rui-mo Thanks for your review. Have resolved all the comments

rui-mo · 2024-01-15T01:29:50Z

@yingsu00 @Yuhta Could you help review the INT64 timestamp support in Parquet reader? Thank you.

mskapilks · 2024-01-18T04:31:09Z

Not sure what this failure means in the CI pipeline.

From github.com:facebookincubator/velox
 * [new ref]             refs/pull/8325/head -> origin/pull/8325
Checking out branch
fatal: reference is not a tree: 29afcfa878266626e6ef8956452eb4a8b4be5504

exit status 128

mskapilks · 2024-01-22T06:19:46Z

@yingsu00 @Yuhta Gentle ping for review.

cc: @rui-mo

rui-mo

Thanks for your efforts on the support of timestamp reader. Added several comments.

velox/dwio/parquet/reader/PageReader.cpp

velox/dwio/common/TimestampDecoder.h

mskapilks · 2024-02-07T06:37:38Z

@rui-mo Thank you for your review. I have addressed all the comments.

mskapilks · 2024-02-07T09:06:38Z

Will check on the failure

*** Aborted at 1707296079 (Unix time, try 'date -d @1707296079') ***
*** Signal 6 (SIGABRT) (0x3db07) received by PID 252679 (pthread TID 0x7fa36cab7500) (linux TID 252679) (maybe from PID 252679, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
/bin/bash: line 5: 252679 Aborted                 (core dumped) _build/debug/velox/exec/tests/velox_join_fuzzer_test --seed ${RANDOM} --duration_sec 1800 --logtostderr=1 --minloglevel=0

mskapilks · 2024-02-12T07:38:58Z

Will check on the failure

*** Aborted at 1707296079 (Unix time, try 'date -d @1707296079') ***
*** Signal 6 (SIGABRT) (0x3db07) received by PID 252679 (pthread TID 0x7fa36cab7500) (linux TID 252679) (maybe from PID 252679, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
/bin/bash: line 5: 252679 Aborted                 (core dumped) _build/debug/velox/exec/tests/velox_join_fuzzer_test --seed ${RANDOM} --duration_sec 1800 --logtostderr=1 --minloglevel=0

Passing now in recent build

rui-mo

Thanks!

rui-mo · 2024-03-20T01:39:55Z

velox/dwio/common/TimestampDecoder.h

+        memcpy(&value, &timestamp, sizeof(int128_t));
+        toSkip = visitor.process(value, atEnd);
+      } else {
+        toSkip = visitor.process(


Could you remind me when will this path be executed, and what is the expected behavior here?

Thanks for pointing out. I don't think it will ever go there as type will always be int128_t. Updated the code.

velox/dwio/parquet/reader/PageReader.cpp

rui-mo · 2024-03-20T01:43:31Z

velox/dwio/parquet/reader/PageReader.cpp

+
+            auto precisionUnit = logicalType.TIMESTAMP.unit.__isset.MICROS
+                ? dwio::common::TimestampPrecision::kMicros
+                : dwio::common::TimestampPrecision::kMillis;


Do we need to check __isset.MILLIS before assign kMillis, and throw for the other units?

There are only 3 units. For Nano we throw error a line above. So one of MILLIS or MICROS will be true.

velox/dwio/common/TimestampDecoder.h

velox/dwio/parquet/reader/ParquetReader.cpp

yingsu00 · 2024-04-04T16:51:10Z

velox/dwio/parquet/reader/PageReader.cpp

+        auto logicalType = type_->logicalType_.value();
+        if (logicalType.__isset.TIMESTAMP) {
+          VELOX_CHECK(
+              logicalType.TIMESTAMP.isAdjustedToUTC,


Should this be VELOX_NYI?

Btw, what do you want to do if isAdjustedToUTC is false?

I haven't thought about it much. My perspective is from Spark, I think it writes in UTC adjusted. But once this framework is working it shouldn't be much effort to add that support in a followup PR.

velox/dwio/parquet/reader/TimestampColumnReader.h

yingsu00 · 2024-04-24T21:41:29Z

@mskapilks Hi Kapil, Hi will you be able to rebase and address the comments? There're conflicts.

yingsu00 · 2024-04-26T20:33:01Z

cc @Yuhta

velox/type/TimestampConversion.h

mskapilks · 2024-04-29T08:54:21Z

@mskapilks Hi Kapil, Hi will you be able to rebase and address the comments? There're conflicts.

Yes will update the PR soon

yingsu00 · 2024-07-05T20:27:41Z

velox/dwio/parquet/reader/TimestampColumnReader.h

+    // Use int128_t as a workaroud. Timestamp type in Velox is comprised of an
+    // int64_t seconds_ field and a uint64_t nanos_ field, a total of 16-byte
+    // length
+    prepareRead<int128_t>(offset, rows, nullptr);


@mskapilks I think a follow up PR would be better if this implementation doesn't meet any problem in using it

mskapilks · 2024-07-08T10:39:34Z

@mskapilks @Yuhta I think we should merge this PR before Support for dictionary encoded INT96 timestamp in parquet files #4680. This PR is a more generic implementation, while 4680 only worked for dictionary encodings. Also 4680 needs update to read the unit from the column Parquet logical type, not through a hive config. @mskapilks Can you please address the comment and rebase? Thanks!

Thanks for review. I have rebased again and resolved comments.

yingsu00 · 2024-07-09T15:21:49Z

Thanks @mskapilks
@Yuhta Jimmy do you have more comments for this PR? If not can we merge it soon?

Yuhta · 2024-07-09T19:55:19Z

velox/dwio/common/SelectiveColumnReader.cpp

@@ -216,6 +216,9 @@ void SelectiveColumnReader::getIntValues(
          VELOX_FAIL("Unsupported value size: {}", valueSize_);
      }
      break;
+    case TypeKind::TIMESTAMP:


You don't need this

This is needed as the output type is Timestamp

You cannot use getIntValues directly for Timestamp, you need to override TimestampColumnReader::getValues like here (seems you are trying to create the same file):

velox/velox/dwio/parquet/reader/TimestampColumnReader.h

Line 38 in e83fe4c

void getValues(RowSet rows, VectorPtr* result) override {

Yuhta · 2024-07-09T19:58:14Z

velox/dwio/parquet/reader/PageReader.h

@@ -489,6 +496,7 @@ class PageReader {
  std::unique_ptr<StringDecoder> stringDecoder_;
  std::unique_ptr<BooleanDecoder> booleanDecoder_;
  std::unique_ptr<DeltaBpDecoder> deltaBpDecoder_;
+  std::unique_ptr<TimestampDecoder> timestampDecoder_;


I would not put this at decoder level, this should be a normal int64 decoder and you should convert it into Timestamp in column reader

I have removed the TimestampDecoder class. And moved the logic to direct decoder.

mskapilks · 2024-07-16T10:16:01Z

@Yuhta @yingsu00 Please review, updated the PR based on previous comments. Thanks

liujiayi771 · 2024-07-17T02:42:49Z

velox/dwio/parquet/reader/PageReader.cpp

+        auto logicalType = type_->logicalType_.value();
+        if (logicalType.__isset.TIMESTAMP) {
+          if (!logicalType.TIMESTAMP.isAdjustedToUTC) {
+            VELOX_NYI("Only UTC adjusted Timestamp is supported.");


Hi @mskapilks. After your PR is merged, I will submit a follow-up PR to support the case when isAdjustedToUTC=false because I have already added the timezone information to the Parquet reader.

@liujiayi771 Thats great, thanks 👍

Yuhta · 2024-07-17T22:34:48Z

velox/dwio/common/SelectiveColumnReader.cpp

@@ -216,6 +216,9 @@ void SelectiveColumnReader::getIntValues(
          VELOX_FAIL("Unsupported value size: {}", valueSize_);
      }
      break;
+    case TypeKind::TIMESTAMP:


You cannot use getIntValues directly for Timestamp, you need to override TimestampColumnReader::getValues like here (seems you are trying to create the same file):

velox/velox/dwio/parquet/reader/TimestampColumnReader.h

Line 38 in e83fe4c

void getValues(RowSet rows, VectorPtr* result) override {

Yuhta · 2024-07-17T22:35:47Z

velox/type/TimestampConversion.h

@@ -221,5 +221,4 @@ Timestamp fromDatetime(int64_t daysSinceEpoch, int64_t microsSinceMidnight);
 /// Returns the number of days since epoch for a given timestamp and optional
 /// time zone.
 int32_t toDate(const Timestamp& timestamp, const date::time_zone* timeZone_);
-


Just leave the file unchanged

Yuhta · 2024-07-17T22:39:38Z

velox/dwio/common/DirectDecoder.h

@@ -92,7 +94,24 @@ class DirectDecoder : public IntDecoder<isSigned> {
      } else if constexpr (std::is_same_v<
                               typename Visitor::DataType,
                               int128_t>) {
-        toSkip = visitor.process(super::template readInt<int128_t>(), atEnd);
+        if (precision_.has_value()) {


This does not look correct. As far as I see you should not need to change anything at decoder level (the int96 implementation is a little hacky and you should not follow it), just use it as the plain vanilla int64_t decoder, and all these conversion should happen inside TimestampColumnReader (or rename it to something else to avoid name clash with the one reading int96).

I also have PR for timestamp filter support for INT64. So I did try moving the conversion inside TimestampColumnReader but it was causing issue with timestamp filter. As the filtering was happening with int64 values in decoder. I was getting testInt64() is not supported. Maybe I missed something.

Let me wait for the INT96 pr to go in first (since that is almost ready), as some effort is common in both to avoid conflicts

You can process the filter in column reader (there is no way to handle it in int decoder, different format has different timestamp representation), see example in

velox/velox/dwio/dwrf/reader/SelectiveTimestampColumnReader.cpp

Line 228 in 09e1b0d

void SelectiveTimestampColumnReader::processFilter(

Updated change based on the suggestion. Let me know your input on this.
Will try to merge both classes INT64, INT96 readers if feasible.

mskapilks · 2024-08-12T09:13:52Z

@yingsu00 @liujiayi771 @Yuhta Can you please take a look.

velox/dwio/parquet/reader/TimestampColumnReader.h

velox/dwio/parquet/reader/ParquetReader.cpp

Yuhta · 2024-08-15T16:42:08Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

Can you also add some tests to https://github.com/facebookincubator/velox/blob/main/velox/dwio/parquet/tests/reader/E2EFilterTest.cpp?

Added the tests. @Yuhta @zhli1142015

But not able to make it worked.
The null buffer I am getting after intezers are read seems to be not correct.

Error

Value of: result->equalValueAt(expectedColumn, i, expectedRow) Actual: false Expected: true Content mismatch at 3073 column 0: expected: null actual: 2015-06-02T06:50:18.000035000 Google Test trace:

As we want to process timestamp filter in TimestampInt64ColumnReader, i am passing AlwaysTrue filter to get all intezer values from parquet. In some case disabling fastpath worked but not in all cases.

Any thought where should I look, how to figure out the issue?

Most likely issue here?

@Yuhta Hi, can you suggest any workaround on this?
Thanks

It's probably something wrong in the way you use readerNulls and resultNulls and row numbers in readHelper. The always true filter should be fine.

velox/dwio/parquet/reader/TimestampColumnReader.h

Yuhta · 2024-08-15T16:46:04Z

velox/dwio/parquet/reader/TimestampColumnReader.h

+    readCommon<IntegerColumnReader, true>(rows);
+
+    auto tsValues =
+        AlignedBuffer::allocate<Timestamp>(numValues_, &memoryPool_);


Can we do it in-place instead of allocating new buffer for each batch?

j7nhai · 2024-08-19T09:08:25Z

@mskapilks Hi, The unit tests of commit 6081415 run fine on my machine. But the unit tests of commit 9166ff6 run fail on my machine. Can you provided some information for how to debug?

[ RUN      ] ParquetTableScanTest.sessionTimezone
[       OK ] ParquetTableScanTest.sessionTimezone (11 ms)
[ RUN      ] ParquetTableScanTest.timestampFilter
[       OK ] ParquetTableScanTest.timestampFilter (13 ms)
[ RUN      ] ParquetTableScanTest.timestampPrecisionMicrosecond
[       OK ] ParquetTableScanTest.timestampPrecisionMicrosecond (5 ms)
[ RUN      ] ParquetTableScanTest.timestampINT64millis
*** Aborted at 1724058475 (Unix time, try 'date -d @1724058475') ***
*** Signal 11 (SIGSEGV) (0x0) received by PID 1322556 (pthread TID 0x7fcd795ec700) (linux TID 1322979) (code: address not mapped to object), stack trace: ***
(error retrieving stack trace)
Segmentation fault (core dumped)

mskapilks · 2024-09-04T04:13:14Z

@mskapilks Hi, The unit tests of commit 6081415 run fine on my machine. But the unit tests of commit 9166ff6 run fail on my machine. Can you provided some information for how to debug?

[ RUN      ] ParquetTableScanTest.sessionTimezone
[       OK ] ParquetTableScanTest.sessionTimezone (11 ms)
[ RUN      ] ParquetTableScanTest.timestampFilter
[       OK ] ParquetTableScanTest.timestampFilter (13 ms)
[ RUN      ] ParquetTableScanTest.timestampPrecisionMicrosecond
[       OK ] ParquetTableScanTest.timestampPrecisionMicrosecond (5 ms)
[ RUN      ] ParquetTableScanTest.timestampINT64millis
*** Aborted at 1724058475 (Unix time, try 'date -d @1724058475') ***
*** Signal 11 (SIGSEGV) (0x0) received by PID 1322556 (pthread TID 0x7fcd795ec700) (linux TID 1322979) (code: address not mapped to object), stack trace: ***
(error retrieving stack trace)
Segmentation fault (core dumped)

Fixed this, missed a null check

Resolve comments Fix build PR comments Remove reinterpret_cast Fix compile PR comments Update parquet files Refactor Fix formatting Fix compile PR comment Fix decimal tests Typo Remove timestamp decoder Remove white space Remove import

Tmp

Add E2E test

mskapilks · 2024-09-04T12:06:02Z

velox/dwio/parquet/tests/reader/E2EFilterTest.cpp

+      },
+      true,
+      {"timestamp_val_0", "timestamp_val_1"},
+      1);


Will update to 20 once issue is resolved

majetideepak · 2024-09-04T21:00:30Z

@mskapilks can you please sign the CLA? Thanks.

mskapilks · 2024-09-06T06:36:51Z

@mskapilks can you please sign the CLA? Thanks.

Done

Yuhta · 2024-09-12T18:17:40Z

velox/dwio/common/tests/utils/DataSetBuilder.h

@@ -50,6 +50,11 @@ class DataSetBuilder {
  // groups. Tests skipping row groups based on row group stats.
  DataSetBuilder& withRowGroupSpecificData(int32_t numRowsPerGroup);

+  DataSetBuilder& adjustTimestampToPrecision(TimestampPrecision precision);
+  void adjustTimestampToPrecision(
+      VectorPtr batch,


VectorPtr& batch

Yuhta · 2024-09-12T18:42:17Z

velox/dwio/parquet/reader/TimestampColumnReader.h

+      case common::FilterKind::kAlwaysTrue:
+        // Simply add all rows to output.
+        for (vector_size_t i = 0; i < numValues_; i++) {
+          addOutputRow(rows[i]);


I don't think you need to do this, we use inputRows_ in case there is no filter

yingsu00 · 2024-09-27T21:14:39Z

@mskapilks Can you please address the comments and rebase? Thanks!

majetideepak · 2024-10-16T15:57:04Z

@mskapilks Do you mind if I push to this PR? We can work on this together. Thanks.

mskapilks · 2024-10-17T06:03:49Z

@mskapilks Do you mind if I push to this PR? We can work on this together. Thanks.

Sure

mskapilks changed the title ~~Support for INT64 timestamp in parquet files~~ Support for plain and dictionary encoded INT64 timestamp in parquet files Jan 10, 2024

rui-mo reviewed Jan 11, 2024

View reviewed changes

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2024

mskapilks force-pushed the timestamp_int64 branch from 72e67b3 to 29afcfa Compare January 16, 2024 12:26

rui-mo reviewed Feb 5, 2024

View reviewed changes

velox/dwio/common/TimestampDecoder.h Outdated Show resolved Hide resolved

mskapilks force-pushed the timestamp_int64 branch from 29afcfa to 630f96c Compare February 7, 2024 06:35

mskapilks force-pushed the timestamp_int64 branch 2 times, most recently from 4173963 to df6ed31 Compare February 12, 2024 05:05

mskapilks requested a review from rui-mo March 18, 2024 04:59

rui-mo reviewed Mar 20, 2024

View reviewed changes

Yuhta reviewed Mar 20, 2024

View reviewed changes

velox/dwio/common/TimestampDecoder.h Outdated Show resolved Hide resolved

yingsu00 self-requested a review April 4, 2024 11:49

yingsu00 reviewed Apr 4, 2024

View reviewed changes

8dukongjian mentioned this pull request Apr 22, 2024

Summary of Parquet reader Issues #9560

Open

liujiayi771 reviewed Apr 29, 2024

View reviewed changes

velox/type/TimestampConversion.h Outdated Show resolved Hide resolved

mskapilks force-pushed the timestamp_int64 branch from a42f007 to b0aa8c6 Compare April 29, 2024 12:32

yingsu00 approved these changes Jul 5, 2024

View reviewed changes

mskapilks force-pushed the timestamp_int64 branch from 5c9d744 to dff3b7d Compare July 8, 2024 10:38

Yuhta reviewed Jul 9, 2024

View reviewed changes

liujiayi771 reviewed Jul 17, 2024

View reviewed changes

Yuhta reviewed Jul 17, 2024

View reviewed changes

liujiayi771 mentioned this pull request Jul 19, 2024

read iceberg table fail for timestamp type column. apache/incubator-gluten#6514

Open

mskapilks force-pushed the timestamp_int64 branch from 31c6a2e to 6081415 Compare July 22, 2024 11:15

zml1206 mentioned this pull request Aug 13, 2024

[V] Remove complex type fallback for parquet apache/incubator-gluten#6712

Merged

Yuhta reviewed Aug 15, 2024

View reviewed changes

wang-zhun mentioned this pull request Sep 3, 2024

[CORE] The fallback check for Scan should not be skipped when DPP is present apache/incubator-gluten#7078

Open

mskapilks added 5 commits September 4, 2024 16:19

Support for INT64 timestamp in parquet files

d754622

Resolve comments Fix build PR comments Remove reinterpret_cast Fix compile PR comments Update parquet files Refactor Fix formatting Fix compile PR comment Fix decimal tests Typo Remove timestamp decoder Remove white space Remove import

Fix

74fce2d

Tmp

Refactor

f4764d3

PR Comment

a2f6ade

Add E2E test

Fix rebase

b029524

mskapilks force-pushed the timestamp_int64 branch from 70bc8fe to b029524 Compare September 4, 2024 11:55

mskapilks commented Sep 4, 2024

View reviewed changes

Remove extra code

187e2c5

Yuhta reviewed Sep 12, 2024

View reviewed changes

Support for plain and dictionary encoded INT64 timestamp in parquet files #8325

Are you sure you want to change the base?

Support for plain and dictionary encoded INT64 timestamp in parquet files #8325

Conversation

mskapilks commented Jan 10, 2024

netlify bot commented Jan 10, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

mskapilks commented Jan 10, 2024

facebook-github-bot commented Jan 10, 2024

Action Required

Process

rui-mo left a comment • edited Loading

Choose a reason for hiding this comment

mskapilks commented Jan 12, 2024

rui-mo commented Jan 15, 2024

mskapilks commented Jan 18, 2024

mskapilks commented Jan 22, 2024 • edited Loading

rui-mo left a comment

Choose a reason for hiding this comment

mskapilks commented Feb 7, 2024

mskapilks commented Feb 7, 2024

mskapilks commented Feb 12, 2024

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mskapilks Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mskapilks Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 commented Apr 24, 2024

yingsu00 commented Apr 26, 2024 • edited Loading

mskapilks commented Apr 29, 2024

yingsu00 Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

mskapilks commented Jul 8, 2024

yingsu00 commented Jul 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mskapilks commented Jul 16, 2024

liujiayi771 Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yuhta Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mskapilks Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

mskapilks commented Aug 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j7nhai commented Aug 19, 2024

mskapilks commented Sep 4, 2024

Choose a reason for hiding this comment

majetideepak commented Sep 4, 2024

mskapilks commented Sep 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 commented Sep 27, 2024

majetideepak commented Oct 16, 2024

mskapilks commented Oct 17, 2024

netlify bot commented Jan 10, 2024 •

edited

Loading

rui-mo left a comment •

edited

Loading

mskapilks commented Jan 22, 2024 •

edited

Loading

mskapilks Apr 29, 2024 •

edited

Loading

mskapilks Apr 29, 2024 •

edited

Loading

yingsu00 commented Apr 26, 2024 •

edited

Loading

yingsu00 Jul 5, 2024 •

edited

Loading

liujiayi771 Jul 17, 2024 •

edited

Loading

Yuhta Jul 17, 2024 •

edited

Loading

mskapilks Aug 7, 2024 •

edited

Loading