Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

Closed
wants to merge 2 commits into from
Closed

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 19, 2018

What changes were proposed in this pull request?

ORC 1.4.4 includes nine fixes. One of the issues is about Timestamp bug (ORC-306) which occurs when native ORC vectorized reader reads ORC column vector's sub-vector times and nanos. ORC-306 fixes this according to the original definition and this PR includes the updated interpretation on ORC column vectors. Note that hive ORC reader and ORC MR reader is not affected.

scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+

This PR aims to update Apache Spark to use it.

FULL LIST

ID TITLE
ORC-281 Fix compiler warnings from clang 5.0
ORC-301 extractFileTail should open a file in try statement
ORC-304 Fix TestRecordReaderImpl to not fail with new storage-api
ORC-306 Fix incorrect workaround for bug in java.sql.Timestamp
ORC-324 Add support for ARM and PPC arch
ORC-330 Remove unnecessary Hive artifacts from root pom
ORC-332 Add syntax version to orc_proto.proto
ORC-336 Remove avro and parquet dependency management entries
ORC-360 Implement error checking on subtype fields in Java

How was this patch tested?

Pass the Jenkins.

@SparkQA
Copy link

SparkQA commented May 19, 2018

Test build #90837 has finished for PR 21372 at commit 2d11cdc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented May 19, 2018

Test build #90838 has finished for PR 21372 at commit 2d11cdc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The master branch failure is due to #21299 .

@kiszk
Copy link
Member

kiszk commented May 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 20, 2018

Test build #90852 has finished for PR 21372 at commit 2d11cdc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun reopened this May 20, 2018
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
public long getLong(int rowId) {
int index = getRowIndex(rowId);
if (isTimestamp) {
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000;
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Apache ORC 1.4.4, ORC-306 fixes this according to the original definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know when this issue was introduced in ORC?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is on TreeReaderFactory.java. From Apache ORC project, the prior code is ORC-1 which was the initial importing code from Hive two years ago.

Effectively, the writer side is the same. Only, reader side is changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrcHadoopFsRelationSuite covers this changes via end-to-end write and read test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, ORC-306 changes the query result, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC-306 changes the content of exposed ORC column vectors in reader side. The interpretation is Spark's logic as we see in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying no external impact of ORC-306?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I mean is, with ORC-306 and this fix, there is no external impact outside Spark. More specifically, outside OrcColumnVector/OrcColumnarBatchReader. In other words, ORC 1.4.4 cannot be used with Apache Spark without this patch.

Java Timestamp.getTime and Timestamp.getNano has an overlap by definition. Previously, ORC didn't stick to the definition.

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90869 has finished for PR 21372 at commit 700872d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 21, 2018

You've already checked if we have no performance difference, right?

@dongjoon-hyun
Copy link
Member Author

Sure, @maropu . In addition, I reviewed the nine patches, almost trivial ones. I'll update the PR description more.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented May 22, 2018

@HyukjinKwon . Could you review this please?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been actually watching this. LGTM from my side.

@gatorsmile
Copy link
Member

A few basic questions about this upgrade.

What are the benefits of these nine trivial patches? If no impact on Spark users, we should not upgrade it; if the new release fixes the bug, we need to add the test cases to verify the fix. Please prove the necessity of the upgrade.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile .
Basically, ORC-301 will reduce the change of ORC file leakage in some cases. I made that patch and merged it long time ago, but it's released at this release. Also, ORC-306 fixes a bug on Java Timestamp and it's ORC workaround. Please see here for the detail of Java Timestamp bug and the issue on previous ORC workaround.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented May 22, 2018

For file leakage issues, we have been monitoring the flakiness of SPARK-23458 and SPARK-23390 in our Jenkins environment. Until now, I couldn't reproduce it locally.

@dongjoon-hyun
Copy link
Member Author

For Timestamp issue, I'm trying to find some example.

@gatorsmile
Copy link
Member

https://github.com/dongjoon-hyun/orc/blob/cad48d6b11a65264a5b22c73aa2be9029aa72988/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L520-L522

Regarding the file leakage, I did not see any exception issued in these three lines from our log? Does that mean ORC eat the exceptions attempt to re-open the files?

@dongjoon-hyun
Copy link
Member Author

For me, those three lines do not throws exceptions. Do you mean another lines?

OrcProto.PostScript ps;
OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder();
long modificationTime;

@gatorsmile
Copy link
Member

I am just trying to find out why ORC-301 resolves the issues of SPARK-23458 and SPARK-23390

@dongjoon-hyun
Copy link
Member Author

I didn't say ORC-301 resolves the issue of SPARK-23458 and SPARK-23390.
SPARK-23458 and SPARK-23390 reports open file leakages in some unknown situations, doesn't it?

Seq(ts).toDF.write.orc(path.getCanonicalPath)
checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test case for ORC-306 and update the PR title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean the Hive ORC reader works, but the native ORC reader has the bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explicitly set hive reader to native for this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed this comments. Hive ORC and ORC MR reader doesn't have this bug because it uses java.sql.Timestamp class to unserialize it. This happens when we directly access the ORC column's sub-vectors, times and nanos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrcSourceSuite is dedicated for native Orc Reader . For hive ORC reader, HiveOrcSourceSuite.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #90999 has finished for PR 21372 at commit 954d1d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The failure is irrelevant to this PR.

org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(It is not a test it is a sbt.testing.SuiteSelector)

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91013 has finished for PR 21372 at commit 954d1d9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

It's weird. UnivocityParserSuite is still complaining.

Error Message

java.lang.IllegalStateException: LiveListenerBus is stopped.

Stacktrace

sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@HyukjinKwon
Copy link
Member

UnivocityParserSuite failed in my PR too. Shouldn't be related with this.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91024 has finished for PR 21372 at commit 954d1d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

since it has a bug fix, shall we backport it to Spark 2.3?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @cloud-fan . Sure, if possible.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@gatorsmile
Copy link
Member

Before we do the merge, could you address the comment: #21372 (comment)?

@gatorsmile
Copy link
Member

Please document the description of the bug in both JIRA and PR description? Also need to mention which ORC reader is affected.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91055 has finished for PR 21372 at commit 954d1d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Yep. Both JIRA and PR description is updated.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91070 has finished for PR 21372 at commit 954d1d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Finally! Could you review this again, @HyukjinKwon , @gatorsmile , @cloud-fan ?

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request May 24, 2018
ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected.

```scala
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+
```

This PR aims to update Apache Spark to use it.

**FULL LIST**

ID | TITLE
-- | --
ORC-281 | Fix compiler warnings from clang 5.0
ORC-301 | `extractFileTail` should open a file in `try` statement
ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api
ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp
ORC-324 | Add support for ARM and PPC arch
ORC-330 | Remove unnecessary Hive artifacts from root pom
ORC-332 | Add syntax version to orc_proto.proto
ORC-336 | Remove avro and parquet dependency management entries
ORC-360 | Implement error checking on subtype fields in Java

Pass the Jenkins.

Author: Dongjoon Hyun <[email protected]>

Closes #21372 from dongjoon-hyun/SPARK_ORC144.

(cherry picked from commit 486ecc6)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 486ecc6 May 24, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan , @gatorsmile , @HyukjinKwon .

@dongjoon-hyun dongjoon-hyun deleted the SPARK_ORC144 branch May 24, 2018 03:42
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected.

```scala
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+
```

This PR aims to update Apache Spark to use it.

**FULL LIST**

ID | TITLE
-- | --
ORC-281 | Fix compiler warnings from clang 5.0
ORC-301 | `extractFileTail` should open a file in `try` statement
ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api
ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp
ORC-324 | Add support for ARM and PPC arch
ORC-330 | Remove unnecessary Hive artifacts from root pom
ORC-332 | Add syntax version to orc_proto.proto
ORC-336 | Remove avro and parquet dependency management entries
ORC-360 | Implement error checking on subtype fields in Java

Pass the Jenkins.

Author: Dongjoon Hyun <[email protected]>

Closes apache#21372 from dongjoon-hyun/SPARK_ORC144.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants