-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372
Conversation
Test build #90837 has finished for PR 21372 at commit
|
Retest this please |
Test build #90838 has finished for PR 21372 at commit
|
The master branch failure is due to #21299 . |
retest this please |
Test build #90852 has finished for PR 21372 at commit
|
@@ -136,7 +136,7 @@ public int getInt(int rowId) { | |||
public long getLong(int rowId) { | |||
int index = getRowIndex(rowId); | |||
if (isTimestamp) { | |||
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000; | |||
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Apache ORC 1.4.4, ORC-306 fixes this according to the original definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know when this issue was introduced in ORC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is on TreeReaderFactory.java. From Apache ORC project, the prior code is ORC-1 which was the initial importing code from Hive two years ago.
Effectively, the writer side is the same. Only, reader side is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrcHadoopFsRelationSuite
covers this changes via end-to-end write and read test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, ORC-306 changes the query result, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC-306 changes the content of exposed ORC column vectors in reader side. The interpretation is Spark's logic as we see in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying no external impact of ORC-306?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I mean is, with ORC-306 and this fix, there is no external impact outside Spark. More specifically, outside OrcColumnVector
/OrcColumnarBatchReader
. In other words, ORC 1.4.4 cannot be used with Apache Spark without this patch.
Java Timestamp.getTime
and Timestamp.getNano
has an overlap by definition. Previously, ORC didn't stick to the definition.
Test build #90869 has finished for PR 21372 at commit
|
You've already checked if we have no performance difference, right? |
Sure, @maropu . In addition, I reviewed the nine patches, almost trivial ones. I'll update the PR description more. |
@HyukjinKwon . Could you review this please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been actually watching this. LGTM from my side.
A few basic questions about this upgrade. What are the benefits of these nine trivial patches? If no impact on Spark users, we should not upgrade it; if the new release fixes the bug, we need to add the test cases to verify the fix. Please prove the necessity of the upgrade. |
@gatorsmile . |
For file leakage issues, we have been monitoring the flakiness of SPARK-23458 and SPARK-23390 in our Jenkins environment. Until now, I couldn't reproduce it locally. |
For Timestamp issue, I'm trying to find some example. |
Regarding the file leakage, I did not see any exception issued in these three lines from our log? Does that mean ORC eat the exceptions attempt to re-open the files? |
For me, those three lines do not throws exceptions. Do you mean another lines?
|
I am just trying to find out why ORC-301 resolves the issues of SPARK-23458 and SPARK-23390 |
I didn't say ORC-301 resolves the issue of SPARK-23458 and SPARK-23390. |
Seq(ts).toDF.write.orc(path.getCanonicalPath) | ||
checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test case for ORC-306 and update the PR title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean the Hive ORC reader works, but the native ORC reader has the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please explicitly set hive reader to native for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed this comments. Hive ORC and ORC MR reader doesn't have this bug because it uses java.sql.Timestamp
class to unserialize it. This happens when we directly access the ORC column's sub-vectors, times
and nanos
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrcSourceSuite
is dedicated for native
Orc Reader . For hive
ORC reader, HiveOrcSourceSuite
.
Test build #90999 has finished for PR 21372 at commit
|
The failure is irrelevant to this PR.
|
Retest this please. |
Test build #91013 has finished for PR 21372 at commit
|
It's weird.
|
Retest this please |
|
Test build #91024 has finished for PR 21372 at commit
|
since it has a bug fix, shall we backport it to Spark 2.3? |
Thank you for review, @cloud-fan . Sure, if possible. |
Retest this please. |
Before we do the merge, could you address the comment: #21372 (comment)? |
Please document the description of the bug in both JIRA and PR description? Also need to mention which ORC reader is affected. |
Test build #91055 has finished for PR 21372 at commit
|
Yep. Both JIRA and PR description is updated. |
Retest this please. |
Test build #91070 has finished for PR 21372 at commit
|
Finally! Could you review this again, @HyukjinKwon , @gatorsmile , @cloud-fan ? |
thanks, merging to master/2.3! |
LGTM Thanks! Merged to master/2.3 |
ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. ```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--------------------------+ |value | +--------------------------+ |1900-05-05 12:34:55.000789| +--------------------------+ ``` This PR aims to update Apache Spark to use it. **FULL LIST** ID | TITLE -- | -- ORC-281 | Fix compiler warnings from clang 5.0 ORC-301 | `extractFileTail` should open a file in `try` statement ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp ORC-324 | Add support for ARM and PPC arch ORC-330 | Remove unnecessary Hive artifacts from root pom ORC-332 | Add syntax version to orc_proto.proto ORC-336 | Remove avro and parquet dependency management entries ORC-360 | Implement error checking on subtype fields in Java Pass the Jenkins. Author: Dongjoon Hyun <[email protected]> Closes #21372 from dongjoon-hyun/SPARK_ORC144. (cherry picked from commit 486ecc6) Signed-off-by: Wenchen Fan <[email protected]>
Thank you, @cloud-fan , @gatorsmile , @HyukjinKwon . |
ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. ```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--------------------------+ |value | +--------------------------+ |1900-05-05 12:34:55.000789| +--------------------------+ ``` This PR aims to update Apache Spark to use it. **FULL LIST** ID | TITLE -- | -- ORC-281 | Fix compiler warnings from clang 5.0 ORC-301 | `extractFileTail` should open a file in `try` statement ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp ORC-324 | Add support for ARM and PPC arch ORC-330 | Remove unnecessary Hive artifacts from root pom ORC-332 | Add syntax version to orc_proto.proto ORC-336 | Remove avro and parquet dependency management entries ORC-360 | Implement error checking on subtype fields in Java Pass the Jenkins. Author: Dongjoon Hyun <[email protected]> Closes apache#21372 from dongjoon-hyun/SPARK_ORC144.
What changes were proposed in this pull request?
ORC 1.4.4 includes nine fixes. One of the issues is about
Timestamp
bug (ORC-306) which occurs whennative
ORC vectorized reader reads ORC column vector's sub-vectortimes
andnanos
. ORC-306 fixes this according to the original definition and this PR includes the updated interpretation on ORC column vectors. Note thathive
ORC reader and ORC MR reader is not affected.This PR aims to update Apache Spark to use it.
FULL LIST
extractFileTail
should open a file intry
statementHow was this patch tested?
Pass the Jenkins.