Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions dev/deps/spark-deps-hadoop-2.6
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,8 @@ objenesis-2.1.jar
okhttp-3.8.1.jar
okio-1.13.0.jar
opencsv-2.3.jar
orc-core-1.4.3-nohive.jar
orc-mapreduce-1.4.3-nohive.jar
orc-core-1.4.4-nohive.jar
orc-mapreduce-1.4.4-nohive.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
Expand Down
4 changes: 2 additions & 2 deletions dev/deps/spark-deps-hadoop-2.7
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,8 @@ objenesis-2.1.jar
okhttp-3.8.1.jar
okio-1.13.0.jar
opencsv-2.3.jar
orc-core-1.4.3-nohive.jar
orc-mapreduce-1.4.3-nohive.jar
orc-core-1.4.4-nohive.jar
orc-mapreduce-1.4.4-nohive.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
Expand Down
4 changes: 2 additions & 2 deletions dev/deps/spark-deps-hadoop-3.1
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,8 @@ okhttp-2.7.5.jar
okhttp-3.8.1.jar
okio-1.13.0.jar
opencsv-2.3.jar
orc-core-1.4.3-nohive.jar
orc-mapreduce-1.4.3-nohive.jar
orc-core-1.4.4-nohive.jar
orc-mapreduce-1.4.4-nohive.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@
<hive.version.short>1.2.1</hive.version.short>
<derby.version>10.12.1.1</derby.version>
<parquet.version>1.10.0</parquet.version>
<orc.version>1.4.3</orc.version>
<orc.version>1.4.4</orc.version>
<orc.classifier>nohive</orc.classifier>
<hive.parquet.version>1.6.0</hive.parquet.version>
<jetty.version>9.3.20.v20170531</jetty.version>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ public int getInt(int rowId) {
public long getLong(int rowId) {
int index = getRowIndex(rowId);
if (isTimestamp) {
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000;
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Apache ORC 1.4.4, ORC-306 fixes this according to the original definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know when this issue was introduced in ORC?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is on TreeReaderFactory.java. From Apache ORC project, the prior code is ORC-1 which was the initial importing code from Hive two years ago.

Effectively, the writer side is the same. Only, reader side is changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrcHadoopFsRelationSuite covers this changes via end-to-end write and read test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, ORC-306 changes the query result, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC-306 changes the content of exposed ORC column vectors in reader side. The interpretation is Spark's logic as we see in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying no external impact of ORC-306?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I mean is, with ORC-306 and this fix, there is no external impact outside Spark. More specifically, outside OrcColumnVector/OrcColumnarBatchReader. In other words, ORC 1.4.4 cannot be used with Apache Spark without this patch.

Java Timestamp.getTime and Timestamp.getNano has an overlap by definition. Previously, ORC didn't stick to the definition.

} else {
return longData.vector[index];
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,7 @@ private void putValues(
* Returns the number of micros since epoch from an element of TimestampColumnVector.
*/
private static long fromTimestampColumnVector(TimestampColumnVector vector, int index) {
return vector.time[index] * 1000L + vector.nanos[index] / 1000L;
return vector.time[index] * 1000 + (vector.nanos[index] / 1000 % 1000);
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
package org.apache.spark.sql.execution.datasources.orc

import java.io.File
import java.sql.Timestamp
import java.util.Locale

import org.apache.orc.OrcConf.COMPRESS
Expand Down Expand Up @@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll {
}
}
}

test("SPARK-24322 Fix incorrect workaround for bug in java.sql.Timestamp") {
withTempPath { path =>
val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
Seq(ts).toDF.write.orc(path.getCanonicalPath)
checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test case for ORC-306 and update the PR title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean the Hive ORC reader works, but the native ORC reader has the bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explicitly set hive reader to native for this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed this comments. Hive ORC and ORC MR reader doesn't have this bug because it uses java.sql.Timestamp class to unserialize it. This happens when we directly access the ORC column's sub-vectors, times and nanos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrcSourceSuite is dedicated for native Orc Reader . For hive ORC reader, HiveOrcSourceSuite.

}

class OrcSourceSuite extends OrcSuite with SharedSQLContext {
Expand Down