-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14014] [SQL] Replace existing catalog with SessionCatalog #11836
Conversation
commit ad43a5f Author: Andrew Or <[email protected]> Date: Wed Mar 16 14:35:02 2016 -0700 Expand test scope + clean up test code commit 08969cd Author: Andrew Or <[email protected]> Date: Wed Mar 16 13:21:50 2016 -0700 Fix tests commit 6d9fa2f Author: Andrew Or <[email protected]> Date: Wed Mar 16 12:31:52 2016 -0700 Keep track of current database in SessionCatalog This allows us to not pass it into every single method like we used to before this commit. commit ff1c2c4 Author: Andrew Or <[email protected]> Date: Tue Mar 15 19:42:22 2016 -0700 Add TODO commit 8c84dd8 Author: Andrew Or <[email protected]> Date: Tue Mar 15 19:41:30 2016 -0700 Implement tests for functions commit 3da16fb Author: Andrew Or <[email protected]> Date: Tue Mar 15 19:04:03 2016 -0700 Implement tests for table partitions commit 7947445 Author: Andrew Or <[email protected]> Date: Tue Mar 15 18:52:30 2016 -0700 Implement tests for databases and tables commit 2f5121b Author: Andrew Or <[email protected]> Date: Tue Mar 15 16:59:38 2016 -0700 Fix infinite loop (woops) commit d3f252d Author: Andrew Or <[email protected]> Date: Tue Mar 15 16:12:55 2016 -0700 Refactor CatalogTestCases to make methods accessible commit caa4013 Author: Andrew Or <[email protected]> Date: Tue Mar 15 15:44:23 2016 -0700 Clean up duplicate code in Table/FunctionIdentifier commit 90ccdbb Author: Andrew Or <[email protected]> Date: Tue Mar 15 15:33:30 2016 -0700 Fix style commit 5587a49 Author: Andrew Or <[email protected]> Date: Tue Mar 15 15:32:38 2016 -0700 Implement SessionCatalog using ExternalCatalog commit 196f7ce Author: Andrew Or <[email protected]> Date: Tue Mar 15 14:39:22 2016 -0700 Document and clean up function methods commit 6d530a9 Author: Andrew Or <[email protected]> Date: Tue Mar 15 14:38:50 2016 -0700 Fix tests commit 2118212 Author: Andrew Or <[email protected]> Date: Tue Mar 15 14:33:20 2016 -0700 Refactor CatalogFunction to use FunctionIdentifier commit dd1fbae Author: Andrew Or <[email protected]> Date: Tue Mar 15 14:22:37 2016 -0700 Refactor CatalogTable to use TableIdentifier This is a standalone commit such that in the future we can split it out into a separate patch if preferrable. commit 39a153c Author: Andrew Or <[email protected]> Date: Tue Mar 15 13:53:42 2016 -0700 Take into account current database in table methods commit 5bf695c Author: Andrew Or <[email protected]> Date: Mon Mar 14 17:14:59 2016 -0700 Do the same for functions and partitions commit 1d12578 Author: Andrew Or <[email protected]> Date: Mon Mar 14 16:27:11 2016 -0700 Clean up table method signatures + add comments commit 98c8a3b Author: Andrew Or <[email protected]> Date: Thu Mar 10 16:35:35 2016 -0800 Merge in @yhuai's changes
We need to be able to pass in ExternalCatalog in the constructor of SQLContext and subclasses because these should be persistent across sessions. Unfortunately without significant refactoring in the HiveContext and TestHive code we cannot make this simple change happen.
This failed because SessionCatalog does not implement refreshTable. This is a bigger problem because SessionCatalog has no notion of caching tables in the first place and so it doesn't really make sense to implement refreshTable. More refactoring involving HiveMetastoreCatalog is required to make this work.
This commit deletes the trait analysis.Catalog and all of its subclasses, with one notable exception: HiveMetastoreCatalog is kept because a lot of existing functionality (like caching data source tables) are still needed. All other occurrences are now replaced with SessionCatalog. Unfortunately, because HiveMetastoreCatalog is a massive sprawl of unmaintainable code, there is no clean way to integrate it nicely with the new HiveCatalog. The path of least resistance, then, route previous usages of HiveMetastoreCatalog through HiveCatalog. This requires some whacky initialization order hacks because HMC takes in HiveContext but HiveContext takes in HiveCatalog.
The biggest change here is moving HiveMetastoreCatalog from HiveCatalog (the external one) to HiveSessionCatalog (the session specific one). This is needed because HMC depends on a lot of session specific things for, e.g. creating data source tables. This was failing tests that do things with multiple sessions, i.e. HiveQuerySuite.
There were some issues with case sensitivity analysis and error messages not being exactly as expected. The latter is now relaxed where possible.
Test build #53579 has finished for PR 11836 at commit
|
c0f2733
to
c439280
Compare
34948a5
to
f089e2b
Compare
Test build #53582 has finished for PR 11836 at commit
|
Test build #53587 has finished for PR 11836 at commit
|
Test build #53585 has finished for PR 11836 at commit
|
} | ||
|
||
private def existsPartition(db: String, table: String, spec: TablePartitionSpec): Boolean = { | ||
private def partitionExists(db: String, table: String, spec: TablePartitionSpec): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future let's separate renames likes this from large prs.
The changes look pretty good. We'd need to make Jenkins pass though. |
all good, let's try this again test this please |
Test build #2664 has finished for PR 11836 at commit
|
looks really close ... |
Test build #2662 has finished for PR 11836 at commit
|
Test build #2665 has finished for PR 11836 at commit
|
Test build #53900 has finished for PR 11836 at commit
|
Test build #53906 has finished for PR 11836 at commit
|
Ah, we also need to update the python test after we changed the behavior of list tables in a non-existent db. |
Looks like the conflict is introduced by https://github.com/apache/spark/pull/11889/files. Should be pretty minor. But, let me see. |
looks like the conflicts are very easy to resolve. |
dir | ||
} | ||
|
||
private lazy val temporaryConfig = newTemporaryConfiguration(useInMemoryDerby = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why we have to remove
TestHiveContext.hiveClientConfigurations(hiveconf, warehousePath, scratchDirPath)
.foreach { case (k, v) => metadataHive.runSqlHive(s"SET $k=$v") }
in 16a54ba. temporaryConfig was originally a lazy val. But, if we use hiveClientConfigurations, we are actually generate new confs every time we call reset.
ea1645b
to
350bd2e
Compare
Test build #53952 has finished for PR 11836 at commit
|
Test build #53955 has finished for PR 11836 at commit
|
Merging in master - thanks! |
Test build #53956 has finished for PR 11836 at commit
|
## What changes were proposed in this pull request? `SessionCatalog`, introduced in apache#11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <[email protected]> Author: Yin Huai <[email protected]> Closes apache#11836 from andrewor14/use-session-catalog.
## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <[email protected]> Closes #11938 from andrewor14/session-catalog-again.
What changes were proposed in this pull request?
SessionCatalog
, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations toExternalCatalog
. This functionality overlaps a lot with the existinganalysis.Catalog
.As of this commit,
SessionCatalog
andExternalCatalog
will no longer be dead code. There are still things that need to be done after this patch, namely:SessionCatalog
SessionCatalog
.How was this patch tested?
This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are
SessionCatalogSuite
andExternalCatalogSuite
.