[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

dongjoon-hyun · 2018-05-19T18:22:34Z

What changes were proposed in this pull request?

ORC 1.4.4 includes nine fixes. One of the issues is about Timestamp bug (ORC-306) which occurs when native ORC vectorized reader reads ORC column vector's sub-vector times and nanos. ORC-306 fixes this according to the original definition and this PR includes the updated interpretation on ORC column vectors. Note that hive ORC reader and ORC MR reader is not affected.

scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+

This PR aims to update Apache Spark to use it.

FULL LIST

ID	TITLE
ORC-281	Fix compiler warnings from clang 5.0
ORC-301	`extractFileTail` should open a file in `try` statement
ORC-304	Fix TestRecordReaderImpl to not fail with new storage-api
ORC-306	Fix incorrect workaround for bug in java.sql.Timestamp
ORC-324	Add support for ARM and PPC arch
ORC-330	Remove unnecessary Hive artifacts from root pom
ORC-332	Add syntax version to orc_proto.proto
ORC-336	Remove avro and parquet dependency management entries
ORC-360	Implement error checking on subtype fields in Java

How was this patch tested?

Pass the Jenkins.

SparkQA · 2018-05-19T18:33:02Z

Test build #90837 has finished for PR 21372 at commit 2d11cdc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-19T18:36:14Z

Retest this please

SparkQA · 2018-05-19T18:43:02Z

Test build #90838 has finished for PR 21372 at commit 2d11cdc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-19T18:45:26Z

The master branch failure is due to #21299 .

kiszk · 2018-05-20T08:29:49Z

retest this please

SparkQA · 2018-05-20T11:58:18Z

Test build #90852 has finished for PR 21372 at commit 2d11cdc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-20T22:49:32Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java

@@ -136,7 +136,7 @@ public int getInt(int rowId) {
  public long getLong(int rowId) {
    int index = getRowIndex(rowId);
    if (isTimestamp) {
-      return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000;
+      return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;


In Apache ORC 1.4.4, ORC-306 fixes this according to the original definition.

Add a test case?

Do you know when this issue was introduced in ORC?

The change is on TreeReaderFactory.java. From Apache ORC project, the prior code is ORC-1 which was the initial importing code from Hive two years ago.

Effectively, the writer side is the same. Only, reader side is changed.

OrcHadoopFsRelationSuite covers this changes via end-to-end write and read test cases.

Based on my understanding, ORC-306 changes the query result, right?

ORC-306 changes the content of exposed ORC column vectors in reader side. The interpretation is Spark's logic as we see in this PR.

Are you saying no external impact of ORC-306?

No, what I mean is, with ORC-306 and this fix, there is no external impact outside Spark. More specifically, outside OrcColumnVector/OrcColumnarBatchReader. In other words, ORC 1.4.4 cannot be used with Apache Spark without this patch.

Java Timestamp.getTime and Timestamp.getNano has an overlap by definition. Previously, ORC didn't stick to the definition.

SparkQA · 2018-05-21T02:57:48Z

Test build #90869 has finished for PR 21372 at commit 700872d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-21T04:07:45Z

You've already checked if we have no performance difference, right?

dongjoon-hyun · 2018-05-21T15:40:25Z

Sure, @maropu . In addition, I reviewed the nine patches, almost trivial ones. I'll update the PR description more.

dongjoon-hyun · 2018-05-22T15:44:27Z

@HyukjinKwon . Could you review this please?

HyukjinKwon

I have been actually watching this. LGTM from my side.

gatorsmile · 2018-05-22T19:51:41Z

A few basic questions about this upgrade.

What are the benefits of these nine trivial patches? If no impact on Spark users, we should not upgrade it; if the new release fixes the bug, we need to add the test cases to verify the fix. Please prove the necessity of the upgrade.

dongjoon-hyun · 2018-05-22T20:35:07Z

@gatorsmile .
Basically, ORC-301 will reduce the change of ORC file leakage in some cases. I made that patch and merged it long time ago, but it's released at this release. Also, ORC-306 fixes a bug on Java Timestamp and it's ORC workaround. Please see here for the detail of Java Timestamp bug and the issue on previous ORC workaround.

dongjoon-hyun · 2018-05-22T20:39:47Z

For file leakage issues, we have been monitoring the flakiness of SPARK-23458 and SPARK-23390 in our Jenkins environment. Until now, I couldn't reproduce it locally.

dongjoon-hyun · 2018-05-22T20:47:47Z

For Timestamp issue, I'm trying to find some example.

gatorsmile · 2018-05-22T21:05:32Z

https://github.com/dongjoon-hyun/orc/blob/cad48d6b11a65264a5b22c73aa2be9029aa72988/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L520-L522

Regarding the file leakage, I did not see any exception issued in these three lines from our log? Does that mean ORC eat the exceptions attempt to re-open the files?

dongjoon-hyun · 2018-05-22T21:11:09Z

For me, those three lines do not throws exceptions. Do you mean another lines?

OrcProto.PostScript ps;
OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder();
long modificationTime;

gatorsmile · 2018-05-22T21:15:58Z

I am just trying to find out why ORC-301 resolves the issues of SPARK-23458 and SPARK-23390

dongjoon-hyun · 2018-05-22T21:32:44Z

I didn't say ORC-301 resolves the issue of SPARK-23458 and SPARK-23390.
SPARK-23458 and SPARK-23390 reports open file leakages in some unknown situations, doesn't it?

dongjoon-hyun · 2018-05-22T22:29:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

+      Seq(ts).toDF.write.orc(path.getCanonicalPath)
+      checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
+    }
+  }


I added the test case for ORC-306 and update the PR title.

Does that mean the Hive ORC reader works, but the native ORC reader has the bug?

please explicitly set hive reader to native for this test.

Oh, I missed this comments. Hive ORC and ORC MR reader doesn't have this bug because it uses java.sql.Timestamp class to unserialize it. This happens when we directly access the ORC column's sub-vectors, times and nanos.

OrcSourceSuite is dedicated for native Orc Reader . For hive ORC reader, HiveOrcSourceSuite.

SparkQA · 2018-05-23T02:05:33Z

Test build #90999 has finished for PR 21372 at commit 954d1d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-23T03:21:17Z

The failure is irrelevant to this PR.

org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(It is not a test it is a sbt.testing.SuiteSelector)

dongjoon-hyun · 2018-05-23T03:21:24Z

Retest this please.

SparkQA · 2018-05-23T07:05:02Z

Test build #91013 has finished for PR 21372 at commit 954d1d9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-23T08:51:22Z

It's weird. UnivocityParserSuite is still complaining.

Error Message

java.lang.IllegalStateException: LiveListenerBus is stopped.

Stacktrace

sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped.

dongjoon-hyun · 2018-05-23T08:51:28Z

Retest this please

HyukjinKwon · 2018-05-23T08:58:38Z

UnivocityParserSuite failed in my PR too. Shouldn't be related with this.

SparkQA · 2018-05-23T12:31:24Z

Test build #91024 has finished for PR 21372 at commit 954d1d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-23T13:56:04Z

since it has a bug fix, shall we backport it to Spark 2.3?

dongjoon-hyun · 2018-05-23T17:48:58Z

Thank you for review, @cloud-fan . Sure, if possible.

dongjoon-hyun · 2018-05-23T17:49:05Z

Retest this please.

gatorsmile · 2018-05-23T20:05:33Z

Before we do the merge, could you address the comment: #21372 (comment)?

gatorsmile · 2018-05-23T20:21:28Z

Please document the description of the bug in both JIRA and PR description? Also need to mention which ORC reader is affected.

SparkQA · 2018-05-23T21:34:26Z

Test build #91055 has finished for PR 21372 at commit 954d1d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-23T21:54:13Z

Yep. Both JIRA and PR description is updated.

dongjoon-hyun · 2018-05-23T21:54:27Z

Retest this please.

SparkQA · 2018-05-24T02:26:42Z

Test build #91070 has finished for PR 21372 at commit 954d1d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-24T02:50:28Z

Finally! Could you review this again, @HyukjinKwon , @gatorsmile , @cloud-fan ?

cloud-fan · 2018-05-24T03:37:03Z

thanks, merging to master/2.3!

gatorsmile · 2018-05-24T03:37:05Z

LGTM

Thanks! Merged to master/2.3

ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. ```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--------------------------+ |value | +--------------------------+ |1900-05-05 12:34:55.000789| +--------------------------+ ``` This PR aims to update Apache Spark to use it. **FULL LIST** ID | TITLE -- | -- ORC-281 | Fix compiler warnings from clang 5.0 ORC-301 | `extractFileTail` should open a file in `try` statement ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp ORC-324 | Add support for ARM and PPC arch ORC-330 | Remove unnecessary Hive artifacts from root pom ORC-332 | Add syntax version to orc_proto.proto ORC-336 | Remove avro and parquet dependency management entries ORC-360 | Implement error checking on subtype fields in Java Pass the Jenkins. Author: Dongjoon Hyun <[email protected]> Closes #21372 from dongjoon-hyun/SPARK_ORC144. (cherry picked from commit 486ecc6) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2018-05-24T03:42:16Z

Thank you, @cloud-fan , @gatorsmile , @HyukjinKwon .

ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. ```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--------------------------+ |value | +--------------------------+ |1900-05-05 12:34:55.000789| +--------------------------+ ``` This PR aims to update Apache Spark to use it. **FULL LIST** ID | TITLE -- | -- ORC-281 | Fix compiler warnings from clang 5.0 ORC-301 | `extractFileTail` should open a file in `try` statement ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp ORC-324 | Add support for ARM and PPC arch ORC-330 | Remove unnecessary Hive artifacts from root pom ORC-332 | Add syntax version to orc_proto.proto ORC-336 | Remove avro and parquet dependency management entries ORC-360 | Implement error checking on subtype fields in Java Pass the Jenkins. Author: Dongjoon Hyun <[email protected]> Closes apache#21372 from dongjoon-hyun/SPARK_ORC144.

dongjoon-hyun closed this May 20, 2018

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

700872d

dongjoon-hyun reopened this May 20, 2018

dongjoon-hyun commented May 20, 2018

View reviewed changes

HyukjinKwon approved these changes May 22, 2018

View reviewed changes

Add test cases.

954d1d9

dongjoon-hyun commented May 22, 2018

View reviewed changes

asfgit closed this in 486ecc6 May 24, 2018

dongjoon-hyun deleted the SPARK_ORC144 branch May 24, 2018 03:42

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

Conversation

dongjoon-hyun commented May 19, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 19, 2018

dongjoon-hyun commented May 19, 2018

SparkQA commented May 19, 2018

dongjoon-hyun commented May 19, 2018

kiszk commented May 20, 2018

SparkQA commented May 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 22, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented May 21, 2018

maropu commented May 21, 2018

dongjoon-hyun commented May 21, 2018

dongjoon-hyun commented May 22, 2018 • edited Loading

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

gatorsmile commented May 22, 2018

dongjoon-hyun commented May 22, 2018

dongjoon-hyun commented May 22, 2018 • edited Loading

dongjoon-hyun commented May 22, 2018

gatorsmile commented May 22, 2018

dongjoon-hyun commented May 22, 2018

gatorsmile commented May 22, 2018

dongjoon-hyun commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 23, 2018

dongjoon-hyun commented May 23, 2018

dongjoon-hyun commented May 23, 2018

SparkQA commented May 23, 2018

dongjoon-hyun commented May 23, 2018

dongjoon-hyun commented May 23, 2018

HyukjinKwon commented May 23, 2018

SparkQA commented May 23, 2018

cloud-fan commented May 23, 2018

dongjoon-hyun commented May 23, 2018

dongjoon-hyun commented May 23, 2018

gatorsmile commented May 23, 2018

gatorsmile commented May 23, 2018

SparkQA commented May 23, 2018

dongjoon-hyun commented May 23, 2018

dongjoon-hyun commented May 23, 2018

SparkQA commented May 24, 2018

dongjoon-hyun commented May 24, 2018

cloud-fan commented May 24, 2018

gatorsmile commented May 24, 2018

dongjoon-hyun commented May 24, 2018

dongjoon-hyun commented May 19, 2018 •

edited

Loading

dongjoon-hyun May 21, 2018 •

edited

Loading

dongjoon-hyun May 22, 2018 •

edited

Loading

dongjoon-hyun commented May 22, 2018 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun commented May 22, 2018 •

edited

Loading