Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819

Closed
wants to merge 12 commits into from

Conversation

marmbrus
Copy link
Contributor

@marmbrus marmbrus commented Aug 6, 2014

This PR adds an experimental flag spark.sql.hive.convertMetastoreParquet that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native ParquetTableScan.

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18078/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18086/consoleFull

* SerDe.
*/
private[spark] def convertMetastoreParquet: Boolean =
getConf("spark.sql.hive.convertMetastoreParquet", "false") == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to test this PR soon. In the meantime would it make sense to only put this in SQLConf (as well as a field of the key string in the singleton object), making that class the central place that stores SQL configs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about that. The problem being that this only applies to HiveContexts, so it doesn't really make much sense in a SQLContext.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a job for HiveConf extends SQLConf! After all, there's nothing better than confusing users trying to use org.apache.hadoop.hive.conf.HiveConf!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When in doubt, make up longer names: SQLConfigOpts, HiveConfigOpts. But this is only possibly relevant in the future and should not block this PR.

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18078/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18086/consoleFull

@patmcdonough
Copy link

@marmbrus - great to see this. Let's test the Hive 13 syntactic sugar too to make sure it still works (... STORED AS PARQUET)

.saveAsParquetFile(partDir.getCanonicalPath)
}

sql(s"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we execute setup queries in the constructor, will we introduce any issue to mvn tests? It looks similar with what we originally did for HiveTableScanSuite. Then, we have to use createQueryTest to atomically run setup and execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are okay as long as we don't use createQueryTest anywhere, since it runs reset(). I can try to move the DDL into each test to be safe though.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18168/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18168/consoleFull


val unresolvedProjection = projectList.map(_ transform {
// Handle non-partitioning columns
case a: AttributeReference if !partitionKeyIds.contains(a.exprId) => UnresolvedAttribute(a.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad... My IDE was misconfigured on the right margin...

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18179/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1819:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(s: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18179/consoleFull

@marmbrus marmbrus changed the title [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. [WIP][SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. Aug 8, 2014
}
})

hiveContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that causes performance issue if there are lots of partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did due to the hadoopConf getting broadcasted over and over again. Hence: c0d9b72

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18218/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(originalPlan: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18218/consoleFull

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala
@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1819. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18555/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1819:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {
implicit class LogicalPlanHacks(s: SchemaRDD) {
implicit class PhysicalPlanHacks(originalPlan: SparkPlan) {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18555/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have started for PR 1819 at commit 570fd9e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have started for PR 1819 at commit 41ebc5f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have finished for PR 1819 at commit 41ebc5f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Serializer
    • abstract class SerializerInstance
    • abstract class SerializationStream
    • abstract class DeserializationStream
    • class ShuffleBlockManager(blockManager: BlockManager,
    • case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan
    • implicit class LogicalPlanHacks(s: SchemaRDD)
    • implicit class PhysicalPlanHacks(originalPlan: SparkPlan)
    • class FakeParquetSerDe extends SerDe

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have finished for PR 1819 at commit 570fd9e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan
    • implicit class LogicalPlanHacks(s: SchemaRDD)
    • implicit class PhysicalPlanHacks(originalPlan: SparkPlan)

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1819 at commit 4f3d54f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1819 at commit 4f3d54f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
This reverts commit 41ebc5f.

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala
@marmbrus marmbrus changed the title [WIP][SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. Aug 18, 2014
@marmbrus
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1819 at commit 1620079.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1819 at commit 1620079.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan
    • implicit class LogicalPlanHacks(s: SchemaRDD)
    • implicit class PhysicalPlanHacks(originalPlan: SparkPlan)
    • class FakeParquetSerDe extends SerDe

@marmbrus
Copy link
Contributor Author

This only failed the thrift server tests. I'm going to merge into master and 1.1

asfgit pushed a commit that referenced this pull request Aug 18, 2014
…HiveMetaStore tables.

This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.

Author: Michael Armbrust <[email protected]>
Author: Yin Huai <[email protected]>

Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:

1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.

(cherry picked from commit 3abd0c1)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 3abd0c1 Aug 18, 2014
@marmbrus marmbrus deleted the parquetMetastore branch August 27, 2014 20:55
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…HiveMetaStore tables.

This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.

Author: Michael Armbrust <[email protected]>
Author: Yin Huai <[email protected]>

Closes apache#1819 from marmbrus/parquetMetastore and squashes the following commits:

1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request apache#8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
viirya added a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…ch-3.4.0 (apache#1819)

* rdar://112325953: Add Rio pipeline to run iceberg unit tests for branch-3.4.0

* Comment some shadow-test

* For review

* Upgrade Iceberg version

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants