-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19667][SQL]create table with hiveenabled in default database use warehouse path instead of the location of default database #17001
Conversation
…se warehouse path instead of the location of default database
Test build #73171 has finished for PR 17001 at commit
|
Test build #73201 has finished for PR 17001 at commit
|
I'd like to treat this as a workaround, the location of default database is still invalid in cluster- We can make this logic more clear and consistent: the default database should not have a location, when we try to get the location of default DB, we should use the warehouse path. |
Agreed, I process the logic in create/get database in HiveClientImpl |
Test build #73260 has finished for PR 17001 at commit
|
Test build #73261 has finished for PR 17001 at commit
|
client.createDatabase( | ||
new HiveDatabase( | ||
database.name, | ||
database.description, | ||
database.locationUri, | ||
if (database.name == SessionCatalog.DEFAULT_DATABASE) "" else database.locationUri, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is empty, metastore will set it for us, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, actually it will throw an exception, my local default has created, so it does not hit the exception, I will just replace the default database location when reload from metastore, drop the logic when create database set location to empty string.
Are you able to find the specific Hive JIRA for this? |
Personally, I think we should improve the test case. Instead of doing it in HiveDDLSuite, we can do it in HiveSparkSubmitSuite.scala. Basically, when using the same metastore, you just need to verify whether the table location is dependent on |
great, I will take a look at it~ |
Test build #73263 has finished for PR 17001 at commit
|
Test build #73270 has finished for PR 17001 at commit
|
locally test BucketedWriteWithoutHiveSupportSuite is ok, let me find out why failed in jenkins |
Test build #73273 has finished for PR 17001 at commit
|
Test build #73274 has finished for PR 17001 at commit
|
Test build #73656 has finished for PR 17001 at commit
|
@@ -30,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions.Expression | |||
* | |||
* Implementations should throw [[NoSuchDatabaseException]] when databases don't exist. | |||
*/ | |||
abstract class ExternalCatalog { | |||
abstract class ExternalCatalog(conf: SparkConf, hadoopConf: Configuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we just pass in a defaultDB: CatalogDatabase
? then we don't need to add the protected def warehousePath: String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think conf/hadoopConf is more useful, later logic can use it. and it's subclass also has these two conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still have conf/hadoopConf in InMemoryCatalog
and HiveExternalCatalog
, we can just add one more parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we pass a defaultDB, it seems like we introduce an instance of defaultDB as we discussed above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it will be only used in getDatabase
, and we can save a metastore call to get the default database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok~ let me fix it~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I found it that if we add a parameter defaultDB
for ExternalCatalog
and its subclass InMemoryCatalog
and HiveExternalCatalog
, this change will cause a lot of related code to be modified, such as test cases ,and other logic where create InMemoryCatalog
and HiveExternalCatalog
For example:
currently all the parameters of InMemoryCatalog
have its own default value
class InMemoryCatalog(conf: SparkConf = new SparkConf,hadoopConfig: Configuration = new Configuration)
we can create it without an parameters, but if we add a defaultDB
, we should new a defaultDB in the parameter, while we can not create a legal deafultDB because we can not get the warehouse path for the defaultDB like this:
class InMemoryCatalog(conf: SparkConf = new SparkConf,hadoopConfig: Configuration = new Configuration, defaultDB: CatalogDatabase = CatalogDatabase("default","","${can not get the warehouse path}",Map.empty))
if we don't provide a default value for defautDB in the parameter, this will cause more code change which I think it is not proper.
what about we keep the provided def warehousePath
in ExternalCatalog
, and add a
lazy val defaultDB = { val qualifiedWarehousePath = SessionCatalog .makeQualifiedPath(warehousePath, hadoopConf).toString CatalogDatabase("default","", qualifiedWarehousePath, Map.empty) }
this can also avoid call getDatabase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modify the code by adding
lazy val defaultDB = { val qualifiedWarehousePath = SessionCatalog .makeQualifiedPath(warehousePath, hadoopConf).toString CatalogDatabase("default","", qualifiedWarehousePath, Map.empty) }
in ExternalCatalog
if it is not ok ,I will revert it, thanks~
Test build #73755 has finished for PR 17001 at commit
|
Test build #73761 has finished for PR 17001 at commit
|
def main(args: Array[String]): Unit = { | ||
val spark = SparkSession.builder().enableHiveSupport().getOrCreate() | ||
try { | ||
val warehousePath = s"file:${spark.sharedState.warehousePath.stripSuffix("/")}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am doing this modify
Test build #73827 has started for PR 17001 at commit |
@@ -74,7 +88,17 @@ abstract class ExternalCatalog { | |||
*/ | |||
def alterDatabase(dbDefinition: CatalogDatabase): Unit | |||
|
|||
def getDatabase(db: String): CatalogDatabase | |||
final def getDatabase(db: String): CatalogDatabase = { | |||
val database = getDatabaseInternal(db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this in the else branch
Test build #73834 has finished for PR 17001 at commit
|
Test build #73838 has finished for PR 17001 at commit
|
retest this please |
Test build #74170 has finished for PR 17001 at commit
|
spark.sql("CREATE TABLE t4(e string)") | ||
val table4 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t4")) | ||
// the table created in the database which created in this job, it will use the location | ||
// of the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
->
The table created in the non-default database (created in this job) is under the database location.
assert(new Path(table2.location) != fs.makeQualified( | ||
new Path(warehousePath, "not_default.db/t2"))) | ||
|
||
spark.sql("CREATE DATABASE not_default_1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> non_default_db1
spark.sql("CREATE TABLE t2(c string)") | ||
val table2 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t2")) | ||
// the table in not default database created here in this job, it will use the location | ||
// of the database as its location, not the warehouse path in this job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
->
The table created in the non-default database (created in another job) is under the database location.
// the location when it's created. | ||
assert(new Path(table1.location) != fs.makeQualified( | ||
new Path(warehousePath, "not_default.db/t1"))) | ||
assert(!new File(warehousePath.toString, "not_default.db/t1").exists()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scenario (line 993-1000) is not needed to test, IMO. Most of the test cases already cover it.
spark.sql("CREATE TABLE t3(d string)") | ||
val table3 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t3")) | ||
// the table in default database created here in this job, it will use the warehouse path | ||
// of this job as its location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
->
When a job creates a table in the default database, the table location is under the warehouse path
that is configured for the local job.
val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t")) | ||
// the table in default database created in job(SPARK_19667_CREATE_TABLE) above, | ||
// which has different warehouse path from this job, its location still equals to | ||
// the location when it's created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
For the table created by another job in the default database, the location of this table is not changed, even if the current job has a different warehouse path.
val warehousePath = new Path(spark.sharedState.warehousePath) | ||
val fs = warehousePath.getFileSystem(spark.sessionState.newHadoopConf()) | ||
val defaultDB = spark.sessionState.catalog.getDatabaseMetadata("default") | ||
// default database use warehouse path as its location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default
-> The default
use warehouse path
-> uses the warehouse path
A general suggestion in the table names of the test case. We can name the database based on the database type. (e.g., |
@gatorsmile thanks for your suggestion~ |
any update? ping @windpiger |
1 similar comment
any update? ping @windpiger |
What changes were proposed in this pull request?
Currently, when we create a managed table with HiveEnabled in default database, Spark will use the location of default database as the table's location, this is ok in non-shared metastore.
While if we use a shared metastore between different clusters, for example,
there is a hive metastore in Cluster-A, and the metastore use a remote mysql as its db, and create a default database in metastore, then the location of the default database is the path in Cluster-A
then we create another Cluster-B, and Cluster-B also use the same remote mysql as its metastore's db, so the default database conf in Cluster-B download from mysql, which location is the path of Cluster-A
then we create a table in Cluster-B in default database, it will throw an exception, that UnknowHost Cluster-A
In Hive2.0.0, it is allowed to create a table in default database which shared between clusters , and this action is not allowed in other database, just for default.
As a spark User, we will want to have the same action as Hive, thus we can create table in default database using a shared mysql in metastore.
How was this patch tested?
unit test added