[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819

marmbrus · 2014-08-06T23:40:16Z

This PR adds an experimental flag spark.sql.hive.convertMetastoreParquet that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native ParquetTableScan.

SparkQA · 2014-08-06T23:43:43Z

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18078/consoleFull

SparkQA · 2014-08-07T00:34:41Z

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18086/consoleFull

concretevitamin · 2014-08-07T00:47:43Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

+   * SerDe.
+   */
+  private[spark] def convertMetastoreParquet: Boolean =
+    getConf("spark.sql.hive.convertMetastoreParquet", "false") == "true"


I am going to test this PR soon. In the meantime would it make sense to only put this in SQLConf (as well as a field of the key string in the singleton object), making that class the central place that stores SQL configs?

I have mixed feelings about that. The problem being that this only applies to HiveContexts, so it doesn't really make much sense in a SQLContext.

Sounds like a job for HiveConf extends SQLConf! After all, there's nothing better than confusing users trying to use org.apache.hadoop.hive.conf.HiveConf!

When in doubt, make up longer names: SQLConfigOpts, HiveConfigOpts. But this is only possibly relevant in the future and should not block this PR.

SparkQA · 2014-08-07T00:49:30Z

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18078/consoleFull

SparkQA · 2014-08-07T01:40:15Z

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18086/consoleFull

patmcdonough · 2014-08-07T02:13:10Z

@marmbrus - great to see this. Let's test the Hive 13 syntactic sugar too to make sure it still works (... STORED AS PARQUET)

yhuai · 2014-08-07T17:21:10Z

sql/hive/src/test/scala/org/apache/spark/sql/parquet/ParquetMetastoreSuite.scala

+      .saveAsParquetFile(partDir.getCanonicalPath)
+  }
+
+  sql(s"""


If we execute setup queries in the constructor, will we introduce any issue to mvn tests? It looks similar with what we originally did for HiveTableScanSuite. Then, we have to use createQueryTest to atomically run setup and execution.

I think we are okay as long as we don't use createQueryTest anywhere, since it runs reset(). I can try to move the DDL into each test to be safe though.

Partitioning columns can be resolved.

SparkQA · 2014-08-08T01:49:30Z

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18168/consoleFull

SparkQA · 2014-08-08T01:50:13Z

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18168/consoleFull

yhuai · 2014-08-08T01:57:13Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+
+        val unresolvedProjection = projectList.map(_ transform {
+          // Handle non-partitioning columns
+          case a: AttributeReference if !partitionKeyIds.contains(a.exprId) => UnresolvedAttribute(a.name)


My bad... My IDE was misconfigured on the right margin...

… partition values from the InputSplit.

SparkQA · 2014-08-08T03:29:28Z

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18179/consoleFull

SparkQA · 2014-08-08T04:44:39Z

QA results for PR 1819:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18179/consoleFull

chenghao-intel · 2014-08-08T05:43:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+                }
+              })
+
+              hiveContext


Will that causes performance issue if there are lots of partitions?

It did due to the hadoopConf getting broadcasted over and over again. Hence: c0d9b72

SparkQA · 2014-08-08T20:29:35Z

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18218/consoleFull

SparkQA · 2014-08-08T21:33:46Z

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(originalPlan: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18218/consoleFull

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

SparkQA · 2014-08-14T18:30:22Z

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18555/consoleFull

SparkQA · 2014-08-14T19:38:16Z

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(originalPlan: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18555/consoleFull

SparkQA · 2014-08-16T20:15:18Z

QA tests have started for PR 1819 at commit 570fd9e.

This patch merges cleanly.

SparkQA · 2014-08-16T20:35:21Z

QA tests have started for PR 1819 at commit 41ebc5f.

This patch merges cleanly.

SparkQA · 2014-08-16T20:36:11Z

QA tests have finished for PR 1819 at commit 41ebc5f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Serializer
- abstract class SerializerInstance
- abstract class SerializationStream
- abstract class DeserializationStream
- class ShuffleBlockManager(blockManager: BlockManager,
- case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan
- implicit class LogicalPlanHacks(s: SchemaRDD)
- implicit class PhysicalPlanHacks(originalPlan: SparkPlan)
- class FakeParquetSerDe extends SerDe

SparkQA · 2014-08-16T21:24:45Z

QA tests have finished for PR 1819 at commit 570fd9e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan
- implicit class LogicalPlanHacks(s: SchemaRDD)
- implicit class PhysicalPlanHacks(originalPlan: SparkPlan)

SparkQA · 2014-08-18T02:05:21Z

QA tests have started for PR 1819 at commit 4f3d54f.

This patch merges cleanly.

SparkQA · 2014-08-18T03:12:48Z

QA tests have finished for PR 1819 at commit 4f3d54f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

This reverts commit 41ebc5f. Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala

marmbrus · 2014-08-18T19:02:58Z

Jenkins, test this please.

SparkQA · 2014-08-18T19:05:37Z

QA tests have started for PR 1819 at commit 1620079.

This patch merges cleanly.

SparkQA · 2014-08-18T20:15:05Z

QA tests have finished for PR 1819 at commit 1620079.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan
- implicit class LogicalPlanHacks(s: SchemaRDD)
- implicit class PhysicalPlanHacks(originalPlan: SparkPlan)
- class FakeParquetSerDe extends SerDe

marmbrus · 2014-08-18T20:16:34Z

This only failed the thrift server tests. I'm going to merge into master and 1.1

…HiveMetaStore tables. This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`. Author: Michael Armbrust <[email protected]> Author: Yin Huai <[email protected]> Closes #1819 from marmbrus/parquetMetastore and squashes the following commits: 1620079 [Michael Armbrust] Revert "remove hive parquet bundle" cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4f3d54f [Michael Armbrust] fix style 41ebc5f [Michael Armbrust] remove hive parquet bundle a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4c4dc19 [Michael Armbrust] Fix bug with tree splicing. ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later). c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve partition values from the InputSplit. 8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore a0baec7 [Yin Huai] Partitioning columns can be resolved. 1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening 212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables. (cherry picked from commit 3abd0c1) Signed-off-by: Michael Armbrust <[email protected]>

…HiveMetaStore tables. This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`. Author: Michael Armbrust <[email protected]> Author: Yin Huai <[email protected]> Closes apache#1819 from marmbrus/parquetMetastore and squashes the following commits: 1620079 [Michael Armbrust] Revert "remove hive parquet bundle" cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4f3d54f [Michael Armbrust] fix style 41ebc5f [Michael Armbrust] remove hive parquet bundle a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4c4dc19 [Michael Armbrust] Fix bug with tree splicing. ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later). c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve partition values from the InputSplit. 8cdc93c [Michael Armbrust] Merge pull request apache#8 from yhuai/parquetMetastore a0baec7 [Yin Huai] Partitioning columns can be resolved. 1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening 212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.

…ch-3.4.0 (apache#1819) * rdar://112325953: Add Rio pipeline to run iceberg unit tests for branch-3.4.0 * Comment some shadow-test * For review * Upgrade Iceberg version Co-authored-by: Liang-Chi Hsieh <[email protected]>

Initial support for using ParquetTableScan to read HiveMetaStore tables.

212d5cd

Add a test to make sure conversion is actually happening

1161338

concretevitamin reviewed Aug 7, 2014
View reviewed changes

yhuai reviewed Aug 7, 2014
View reviewed changes

yhuai and others added 2 commits August 7, 2014 15:51

Partitioning columns can be resolved.

a0baec7

Merge pull request #8 from yhuai/parquetMetastore

8cdc93c

Partitioning columns can be resolved.

yhuai reviewed Aug 8, 2014
View reviewed changes

marmbrus added 2 commits August 7, 2014 20:23

Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve…

c0d9b72

… partition values from the InputSplit.

include parquet hive to tests pass (Remove this later).

ebb267e

marmbrus changed the title ~~[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables.~~ [WIP][SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. Aug 8, 2014

chenghao-intel reviewed Aug 8, 2014
View reviewed changes

marmbrus mentioned this pull request Aug 8, 2014

[SPARK-2590][SQL] Added option to handle incremental collection, disabled by default #1853

Closed

Fix bug with tree splicing.

4c4dc19

Merge remote-tracking branch 'origin/master' into parquetMetastore

a43e0da

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

remove hive parquet bundle

41ebc5f

fix style

4f3d54f

marmbrus added 2 commits August 18, 2014 11:07

Merge remote-tracking branch 'origin/master' into parquetMetastore

cc30430

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

Revert "remove hive parquet bundle"

1620079

This reverts commit 41ebc5f. Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala

marmbrus changed the title ~~[WIP][SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables.~~ [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. Aug 18, 2014

asfgit closed this in 3abd0c1 Aug 18, 2014

marmbrus deleted the parquetMetastore branch August 27, 2014 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819

[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819

marmbrus commented Aug 6, 2014

SparkQA commented Aug 6, 2014

SparkQA commented Aug 7, 2014

concretevitamin Aug 7, 2014

marmbrus Aug 7, 2014

aarondav Aug 7, 2014

concretevitamin Aug 7, 2014

SparkQA commented Aug 7, 2014

SparkQA commented Aug 7, 2014

patmcdonough commented Aug 7, 2014

yhuai Aug 7, 2014

marmbrus Aug 16, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

yhuai Aug 8, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

chenghao-intel Aug 8, 2014

marmbrus Aug 8, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

marmbrus commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

marmbrus commented Aug 18, 2014

[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819

[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819

Conversation

marmbrus commented Aug 6, 2014

SparkQA commented Aug 6, 2014

SparkQA commented Aug 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 7, 2014

SparkQA commented Aug 7, 2014

patmcdonough commented Aug 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

marmbrus commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

marmbrus commented Aug 18, 2014