-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables. #1819
Conversation
QA tests have started for PR 1819. This patch merges cleanly. |
QA tests have started for PR 1819. This patch merges cleanly. |
* SerDe. | ||
*/ | ||
private[spark] def convertMetastoreParquet: Boolean = | ||
getConf("spark.sql.hive.convertMetastoreParquet", "false") == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to test this PR soon. In the meantime would it make sense to only put this in SQLConf
(as well as a field of the key string in the singleton object), making that class the central place that stores SQL configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mixed feelings about that. The problem being that this only applies to HiveContexts, so it doesn't really make much sense in a SQLContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a job for HiveConf extends SQLConf
! After all, there's nothing better than confusing users trying to use org.apache.hadoop.hive.conf.HiveConf
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When in doubt, make up longer names: SQLConfigOpts
, HiveConfigOpts
. But this is only possibly relevant in the future and should not block this PR.
QA results for PR 1819: |
QA results for PR 1819: |
@marmbrus - great to see this. Let's test the Hive 13 syntactic sugar too to make sure it still works ( |
.saveAsParquetFile(partDir.getCanonicalPath) | ||
} | ||
|
||
sql(s""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we execute setup queries in the constructor, will we introduce any issue to mvn tests? It looks similar with what we originally did for HiveTableScanSuite
. Then, we have to use createQueryTest
to atomically run setup and execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are okay as long as we don't use createQueryTest anywhere, since it runs reset()
. I can try to move the DDL into each test to be safe though.
Partitioning columns can be resolved.
QA tests have started for PR 1819. This patch merges cleanly. |
QA results for PR 1819: |
|
||
val unresolvedProjection = projectList.map(_ transform { | ||
// Handle non-partitioning columns | ||
case a: AttributeReference if !partitionKeyIds.contains(a.exprId) => UnresolvedAttribute(a.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad... My IDE was misconfigured on the right margin...
… partition values from the InputSplit.
QA tests have started for PR 1819. This patch merges cleanly. |
QA results for PR 1819: |
} | ||
}) | ||
|
||
hiveContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will that causes performance issue if there are lots of partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did due to the hadoopConf getting broadcasted over and over again. Hence: c0d9b72
QA tests have started for PR 1819. This patch merges cleanly. |
QA results for PR 1819: |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala
QA tests have started for PR 1819. This patch merges cleanly. |
QA results for PR 1819: |
QA tests have started for PR 1819 at commit
|
QA tests have started for PR 1819 at commit
|
QA tests have finished for PR 1819 at commit
|
QA tests have finished for PR 1819 at commit
|
QA tests have started for PR 1819 at commit
|
QA tests have finished for PR 1819 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
This reverts commit 41ebc5f. Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala
Jenkins, test this please. |
QA tests have started for PR 1819 at commit
|
QA tests have finished for PR 1819 at commit
|
This only failed the thrift server tests. I'm going to merge into master and 1.1 |
…HiveMetaStore tables. This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`. Author: Michael Armbrust <[email protected]> Author: Yin Huai <[email protected]> Closes #1819 from marmbrus/parquetMetastore and squashes the following commits: 1620079 [Michael Armbrust] Revert "remove hive parquet bundle" cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4f3d54f [Michael Armbrust] fix style 41ebc5f [Michael Armbrust] remove hive parquet bundle a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4c4dc19 [Michael Armbrust] Fix bug with tree splicing. ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later). c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve partition values from the InputSplit. 8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore a0baec7 [Yin Huai] Partitioning columns can be resolved. 1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening 212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables. (cherry picked from commit 3abd0c1) Signed-off-by: Michael Armbrust <[email protected]>
…HiveMetaStore tables. This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`. Author: Michael Armbrust <[email protected]> Author: Yin Huai <[email protected]> Closes apache#1819 from marmbrus/parquetMetastore and squashes the following commits: 1620079 [Michael Armbrust] Revert "remove hive parquet bundle" cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4f3d54f [Michael Armbrust] fix style 41ebc5f [Michael Armbrust] remove hive parquet bundle a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore 4c4dc19 [Michael Armbrust] Fix bug with tree splicing. ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later). c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve partition values from the InputSplit. 8cdc93c [Michael Armbrust] Merge pull request apache#8 from yhuai/parquetMetastore a0baec7 [Yin Huai] Partitioning columns can be resolved. 1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening 212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
…ch-3.4.0 (apache#1819) * rdar://112325953: Add Rio pipeline to run iceberg unit tests for branch-3.4.0 * Comment some shadow-test * For review * Upgrade Iceberg version Co-authored-by: Liang-Chi Hsieh <[email protected]>
This PR adds an experimental flag
spark.sql.hive.convertMetastoreParquet
that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's nativeParquetTableScan
.