[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

wangyum · 2018-07-10T06:49:29Z

What changes were proposed in this pull request?

Timestamp support pushdown to parquet data source.
Only TIMESTAMP_MICROS and TIMESTAMP_MILLIS support push down.

How was this patch tested?

unit tests and benchmark tests

wangyum · 2018-07-10T06:53:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.")
+    .internal()
+    .booleanConf
+    .createWithDefault(true)


May be default should be false. because PARQUET_OUTPUT_TIMESTAMP_TYPE default is INT96.

Because we're using the file schema, it doesn't mater what the write configuration is. It only matters what it was when the file was written. If the file has an INT96 timestamp, this should just not push anything down.

SparkQA · 2018-07-10T07:05:01Z

Test build #92796 has finished for PR 21741 at commit b2a9000.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-07-10T07:17:29Z

retest this please

SparkQA · 2018-07-10T11:15:53Z

Test build #92799 has finished for PR 21741 at commit b2a9000.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-11T13:50:55Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+
+    val data = Seq(ts1, ts2, ts3, ts4)
+
+    withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->


This case is quite similar to the one below. Should we use a loop for setting the key SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key to avoid duplicated code.

I changed to:

// spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS val millisData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123"), Timestamp.valueOf("2018-06-15 08:28:53.123"), Timestamp.valueOf("2018-06-16 08:28:53.123"), Timestamp.valueOf("2018-06-17 08:28:53.123")) withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> ParquetOutputTimestampType.TIMESTAMP_MILLIS.toString) { testTimestampPushdown(millisData) } // spark.sql.parquet.outputTimestampType = TIMESTAMP_MICROS val microsData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123456"), Timestamp.valueOf("2018-06-15 08:28:53.123456"), Timestamp.valueOf("2018-06-16 08:28:53.123456"), Timestamp.valueOf("2018-06-17 08:28:53.123456")) withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> ParquetOutputTimestampType.TIMESTAMP_MICROS.toString) { testTimestampPushdown(microsData) }

We shouldn't use same data to test TIMESTAMP_MILLIS type and TIMESTAMP_MICROS type:

TIMESTAMP_MILLIS type will truncate 456 if use microsData to test.

It can't test DateTimeUtils.fromJavaTimestamp(t.asInstanceOf[Timestamp] if use millisData.

gengliangwang · 2018-07-11T13:52:28Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

@@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
    }
  }

+  test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") {


I think we should also test INT96 timestamp type. Also maybe when PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED is disabled.

SparkQA · 2018-07-12T05:37:04Z

Test build #92908 has finished for PR 21741 at commit 5471d79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang

LGTM

gatorsmile · 2018-07-13T06:03:26Z

cc @michal-databricks @mswit-databricks @rdblue @cloud-fan

cloud-fan · 2018-07-13T07:05:24Z

sql/core/benchmarks/FilterPushdownBenchmark-results.txt

+Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
+
+Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative


shall we add a new line after the benchmark name? e.g.

Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ...

We can send a follow-up PR to fix this entire file.

OK. I'll send a follow-up PR.

cloud-fan · 2018-07-13T07:05:43Z

LGTM

HyukjinKwon · 2018-07-13T07:10:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.parquet.filterPushdown.timestamp")
+      .doc("If true, enables Parquet filter push-down optimization for Timestamp. " +
+        "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " +
+        "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.")


Shell we note INT64 here?

I think end users have a better understanding of TIMESTAMP_MICROS and TIMESTAMP_MILLIS.

... I don't think ordinary users will understand any of them ..

You need to explain how to use spark.sql.parquet.outputTimestampType to control the Parquet timestamp type Spark uses to writes parquet files.

I would just note that push-down doesn't work for INT96 timestamps in the file. It should work for the others.

HyukjinKwon · 2018-07-13T07:12:43Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

@@ -517,7 +585,6 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
    }
  }

-


nit: I would revert this change if you are going to push more changes.

HyukjinKwon

LGTM too

rdblue · 2018-07-13T19:34:27Z

+1

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

SparkQA · 2018-07-14T19:10:20Z

Test build #93006 has finished for PR 21741 at commit f206457.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SerializableConfiguration(@transient var value: Configuration)
class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)
case class SchemaType(dataType: DataType, nullable: Boolean)
implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T])
implicit class AvroDataFrameReader(reader: DataFrameReader)
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],
abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast
case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike

HyukjinKwon · 2018-07-15T03:12:59Z

Merged to master.

Timestamp support pushdown to parquet data source

b2a9000

wangyum commented Jul 10, 2018

View reviewed changes

gengliangwang reviewed Jul 11, 2018

View reviewed changes

wangyum added 2 commits July 12, 2018 09:38

Merge remote-tracking branch 'upstream/master' into SPARK-24718

ff31610

Refactor test

5471d79

gengliangwang approved these changes Jul 12, 2018

View reviewed changes

cloud-fan reviewed Jul 13, 2018

View reviewed changes

HyukjinKwon reviewed Jul 13, 2018

View reviewed changes

HyukjinKwon approved these changes Jul 13, 2018

View reviewed changes

asfgit closed this in 43e4e85 Jul 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

wangyum commented Jul 10, 2018

wangyum Jul 10, 2018

rdblue Jul 13, 2018

SparkQA commented Jul 10, 2018

wangyum commented Jul 10, 2018

SparkQA commented Jul 10, 2018

gengliangwang Jul 11, 2018

wangyum Jul 12, 2018

gengliangwang Jul 11, 2018

SparkQA commented Jul 12, 2018

gengliangwang left a comment

gatorsmile commented Jul 13, 2018

cloud-fan Jul 13, 2018

wangyum Jul 13, 2018

cloud-fan commented Jul 13, 2018

HyukjinKwon Jul 13, 2018

wangyum Jul 13, 2018

HyukjinKwon Jul 13, 2018 •

edited

Loading

gatorsmile Jul 13, 2018

rdblue Jul 13, 2018

HyukjinKwon Jul 13, 2018

wangyum Jul 13, 2018

HyukjinKwon left a comment

rdblue commented Jul 13, 2018

SparkQA commented Jul 14, 2018

HyukjinKwon commented Jul 15, 2018


		val data = Seq(ts1, ts2, ts3, ts4)

		withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->

		@@ -517,7 +585,6 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
		}
		}

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

Conversation

wangyum commented Jul 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 10, 2018

wangyum commented Jul 10, 2018

SparkQA commented Jul 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2018

gengliangwang left a comment

Choose a reason for hiding this comment

gatorsmile commented Jul 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

rdblue commented Jul 13, 2018

SparkQA commented Jul 14, 2018

HyukjinKwon commented Jul 15, 2018

HyukjinKwon Jul 13, 2018 •

edited

Loading