[SPARK-19667][SQL]create table with hiveenabled in default database use warehouse path instead of the location of default database #17001

windpiger · 2017-02-20T14:52:25Z

What changes were proposed in this pull request?

Currently, when we create a managed table with HiveEnabled in default database, Spark will use the location of default database as the table's location, this is ok in non-shared metastore.

While if we use a shared metastore between different clusters， for example，

there is a hive metastore in Cluster-A, and the metastore use a remote mysql as its db, and create a default database in metastore, then the location of the default database is the path in Cluster-A
then we create another Cluster-B, and Cluster-B also use the same remote mysql as its metastore's db, so the default database conf in Cluster-B download from mysql, which location is the path of Cluster-A
then we create a table in Cluster-B in default database, it will throw an exception, that UnknowHost Cluster-A

In Hive2.0.0, it is allowed to create a table in default database which shared between clusters , and this action is not allowed in other database, just for default.

As a spark User, we will want to have the same action as Hive, thus we can create table in default database using a shared mysql in metastore.

How was this patch tested?

unit test added

…se warehouse path instead of the location of default database

SparkQA · 2017-02-20T16:06:44Z

Test build #73171 has finished for PR 17001 at commit 825c0ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T07:47:57Z

Test build #73201 has finished for PR 17001 at commit a2c9168.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-21T07:51:39Z

cc @gatorsmile @cloud-fan

cloud-fan · 2017-02-22T01:54:00Z

I'd like to treat this as a workaround, the location of default database is still invalid in cluster-
B.

We can make this logic more clear and consistent: the default database should not have a location, when we try to get the location of default DB, we should use the warehouse path.

…tore

windpiger · 2017-02-22T04:58:34Z

Agreed, I process the logic in create/get database in HiveClientImpl

SparkQA · 2017-02-22T05:17:09Z

Test build #73260 has finished for PR 17001 at commit bacd528.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T05:21:20Z

Test build #73261 has finished for PR 17001 at commit 3f6e061.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-22T05:36:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

    client.createDatabase(
      new HiveDatabase(
        database.name,
        database.description,
-        database.locationUri,
+        if (database.name == SessionCatalog.DEFAULT_DATABASE) "" else database.locationUri,


If it is empty, metastore will set it for us, right?

sorry, actually it will throw an exception, my local default has created, so it does not hit the exception, I will just replace the default database location when reload from metastore, drop the logic when create database set location to empty string.

gatorsmile · 2017-02-22T05:38:02Z

In Hive2.0.0, it is allowed to create a table in default database which shared between clusters

Are you able to find the specific Hive JIRA for this?

windpiger · 2017-02-22T06:17:38Z

HIVE-1537 related jira PR
but it seems that is does not have related comments

gatorsmile · 2017-02-22T06:57:39Z

Personally, I think we should improve the test case. Instead of doing it in HiveDDLSuite, we can do it in HiveSparkSubmitSuite.scala. Basically, when using the same metastore, you just need to verify whether the table location is dependent on spark.sql.warehouse.dir. Then, you do not need to introduce the extra flag TEST_HIVE_CREATETABLE_DEFAULTDB_USEWAREHOUSE_PATH. Below is the test cases you can follow: #16388

windpiger · 2017-02-22T07:04:16Z

great, I will take a look at it~

SparkQA · 2017-02-22T07:42:49Z

Test build #73263 has finished for PR 17001 at commit 96dcc7d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T10:43:46Z

Test build #73270 has finished for PR 17001 at commit 58a0020.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-22T11:31:59Z

locally test BucketedWriteWithoutHiveSupportSuite is ok, let me find out why failed in jenkins

SparkQA · 2017-02-22T11:39:19Z

Test build #73273 has finished for PR 17001 at commit 1dce2d7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T13:17:33Z

Test build #73274 has finished for PR 17001 at commit 12f81d3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T06:07:19Z

Test build #73656 has finished for PR 17001 at commit 096ae63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-02T06:55:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala

@@ -30,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions.Expression
 *
 * Implementations should throw [[NoSuchDatabaseException]] when databases don't exist.
 */
-abstract class ExternalCatalog {
+abstract class ExternalCatalog(conf: SparkConf, hadoopConf: Configuration) {


how about we just pass in a defaultDB: CatalogDatabase? then we don't need to add the protected def warehousePath: String

I think conf/hadoopConf is more useful, later logic can use it. and it's subclass also has these two conf

we still have conf/hadoopConf in InMemoryCatalog and HiveExternalCatalog, we can just add one more parameter.

if we pass a defaultDB, it seems like we introduce an instance of defaultDB as we discussed above

but it will be only used in getDatabase, and we can save a metastore call to get the default database.

ok~ let me fix it~

@cloud-fan I found it that if we add a parameter defaultDB for ExternalCatalog and its subclass InMemoryCatalog and HiveExternalCatalog, this change will cause a lot of related code to be modified, such as test cases ,and other logic where create InMemoryCatalog and HiveExternalCatalog

For example:

currently all the parameters of InMemoryCatalog have its own default value

class InMemoryCatalog(conf: SparkConf = new SparkConf,hadoopConfig: Configuration = new Configuration)

we can create it without an parameters, but if we add a defaultDB, we should new a defaultDB in the parameter, while we can not create a legal deafultDB because we can not get the warehouse path for the defaultDB like this:

class InMemoryCatalog(conf: SparkConf = new SparkConf,hadoopConfig: Configuration = new Configuration, defaultDB: CatalogDatabase = CatalogDatabase("default","","${can not get the warehouse path}",Map.empty))

if we don't provide a default value for defautDB in the parameter, this will cause more code change which I think it is not proper.

what about we keep the provided def warehousePath in ExternalCatalog, and add a
lazy val defaultDB = { val qualifiedWarehousePath = SessionCatalog .makeQualifiedPath(warehousePath, hadoopConf).toString CatalogDatabase("default","", qualifiedWarehousePath, Map.empty) }

this can also avoid call getDatabase

I have modify the code by adding

lazy val defaultDB = { val qualifiedWarehousePath = SessionCatalog .makeQualifiedPath(warehousePath, hadoopConf).toString CatalogDatabase("default","", qualifiedWarehousePath, Map.empty) }

in ExternalCatalog

if it is not ok ,I will revert it, thanks~

SparkQA · 2017-03-02T10:36:19Z

Test build #73755 has finished for PR 17001 at commit badd61b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T13:46:51Z

Test build #73761 has finished for PR 17001 at commit 35d2b59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-03-03T07:36:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+  def main(args: Array[String]): Unit = {
+    val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
+    try {
+      val warehousePath = s"file:${spark.sharedState.warehousePath.stripSuffix("/")}"


I am doing this modify

SparkQA · 2017-03-03T07:47:33Z

Test build #73827 has started for PR 17001 at commit e3a467e.

cloud-fan · 2017-03-03T08:56:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala

@@ -74,7 +88,17 @@ abstract class ExternalCatalog {
   */
  def alterDatabase(dbDefinition: CatalogDatabase): Unit

-  def getDatabase(db: String): CatalogDatabase
+  final def getDatabase(db: String): CatalogDatabase = {
+    val database = getDatabaseInternal(db)


put this in the else branch

SparkQA · 2017-03-03T10:21:26Z

Test build #73834 has finished for PR 17001 at commit ae9938a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-03T11:23:49Z

Test build #73838 has finished for PR 17001 at commit 7739ccd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-03-08T02:55:27Z

retest this please

SparkQA · 2017-03-08T05:21:34Z

Test build #74170 has finished for PR 17001 at commit f93f5d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-12T18:28:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      spark.sql("CREATE TABLE t4(e string)")
+      val table4 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t4"))
+      // the table created in the database which created in this job, it will use the location
+      // of the database.


->

The table created in the non-default database (created in this job) is under the database location.

gatorsmile · 2017-03-12T18:32:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      assert(new Path(table2.location) != fs.makeQualified(
+        new Path(warehousePath, "not_default.db/t2")))
+
+      spark.sql("CREATE DATABASE not_default_1")


-> non_default_db1

gatorsmile · 2017-03-12T18:33:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      spark.sql("CREATE TABLE t2(c string)")
+      val table2 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t2"))
+      // the table in not default database created here in this job, it will use the location
+      // of the database as its location, not the warehouse path in this job


->

The table created in the non-default database (created in another job) is under the database location.

gatorsmile · 2017-03-12T18:37:36Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      // the location when it's created.
+      assert(new Path(table1.location) != fs.makeQualified(
+        new Path(warehousePath, "not_default.db/t1")))
+      assert(!new File(warehousePath.toString, "not_default.db/t1").exists())


This scenario (line 993-1000) is not needed to test, IMO. Most of the test cases already cover it.

gatorsmile · 2017-03-12T18:43:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      spark.sql("CREATE TABLE t3(d string)")
+      val table3 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t3"))
+      // the table in default database created here in this job, it will use the warehouse path
+      // of this job as its location


->

When a job creates a table in the default database, the table location is under the warehouse path that is configured for the local job.

gatorsmile · 2017-03-12T18:48:44Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+      // the table in default database created in job(SPARK_19667_CREATE_TABLE) above,
+      // which has different warehouse path from this job, its location still equals to
+      // the location when it's created.


How about?

For the table created by another job in the default database, the location of this table is not changed, even if the current job has a different warehouse path.

gatorsmile · 2017-03-12T18:50:50Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

+      val warehousePath = new Path(spark.sharedState.warehousePath)
+      val fs = warehousePath.getFileSystem(spark.sessionState.newHadoopConf())
+      val defaultDB = spark.sessionState.catalog.getDatabaseMetadata("default")
+      // default database use warehouse path as its location


default -> The default
use warehouse path -> uses the warehouse path

gatorsmile · 2017-03-12T19:05:47Z

A general suggestion in the table names of the test case. We can name the database based on the database type. (e.g., default, non_default_db1). We can name the tables based on local/remote, database name. For example, tab1_local_default_db1, and tab1_remote_non_default_db.

windpiger · 2017-03-13T01:25:31Z

@gatorsmile thanks for your suggestion~

gatorsmile · 2017-03-21T16:31:26Z

any update? ping @windpiger

HyukjinKwon · 2017-05-11T14:32:36Z

any update? ping @windpiger

windpiger added 2 commits February 20, 2017 21:01

[SPARK-19667][SQL]create table with hiveenabled in default database u…

aebdfc6

…se warehouse path instead of the location of default database

rename a conf name

825c0ad

fix test faile

a2c9168

windpiger added 2 commits February 22, 2017 12:53

process default database location when create/get database from metas…

bacd528

…tore

remove an redundant line

3f6e061

gatorsmile reviewed Feb 22, 2017

View reviewed changes

fix empty string location of database

96dcc7d

windpiger added 3 commits February 22, 2017 17:08

modify the test case

f329387

Merge branch 'master' into defaultDBPathInHive

83dba73

fix test failed

58a0020

add log to find out why jenkins failed

1dce2d7

add scalastyle:off for println

12f81d3

cloud-fan reviewed Mar 2, 2017

View reviewed changes

modify some code

badd61b

add lazy flag

35d2b59

windpiger commented Mar 3, 2017

View reviewed changes

modify test case

e3a467e

modify test case

ae9938a

cloud-fan reviewed Mar 3, 2017

View reviewed changes

mv getdatabase

7739ccd

merge with master

f93f5d3

gatorsmile reviewed Mar 12, 2017

View reviewed changes

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

[SPARK-19667][SQL]create table with hiveenabled in default database use warehouse path instead of the location of default database #17001

[SPARK-19667][SQL]create table with hiveenabled in default database use warehouse path instead of the location of default database #17001

Conversation

windpiger commented Feb 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 20, 2017

SparkQA commented Feb 21, 2017

windpiger commented Feb 21, 2017

cloud-fan commented Feb 22, 2017

windpiger commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 22, 2017

windpiger commented Feb 22, 2017

gatorsmile commented Feb 22, 2017 • edited Loading

windpiger commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

windpiger commented Feb 22, 2017 • edited Loading

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

windpiger Mar 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2017

SparkQA commented Mar 2, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2017

SparkQA commented Mar 3, 2017

windpiger commented Mar 8, 2017

SparkQA commented Mar 8, 2017

gatorsmile Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 12, 2017

windpiger commented Mar 13, 2017

gatorsmile commented Mar 21, 2017

HyukjinKwon commented May 11, 2017

gatorsmile commented Feb 22, 2017 •

edited

Loading

windpiger commented Feb 22, 2017 •

edited

Loading

windpiger Mar 2, 2017 •

edited

Loading

gatorsmile Mar 12, 2017 •

edited

Loading

gatorsmile Mar 12, 2017 •

edited

Loading

gatorsmile Mar 12, 2017 •

edited

Loading