-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20594][SQL]The staging directory should be a child directory starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. #17858
Conversation
…ging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "."
In this case, Hive will create the staging directory under the table directory, and when moving staging directory to table directory, Hive will still empty the table directory, but will exclude the staging directory which start with "." or "_"
|
Do you really need to force this? or, is it just that any path relative to the output dir has to be a hidden directory starting with "." or "_"? For example, right now this prevents me from making the staging dir "/foo/bar" but I don't see a reason to disallow that. |
This sounds a bug in Hive metastore. Could you try the same thing in Hive? Do you hit the same error? Let us see how Hive behaves and then we can decide what is the best way to handle it. Thanks! BTW, you need to create a test case. For example, |
if this is a hive bug, this patch seems a valid workaround for Spark SQL. |
yes i tried the same thing in Hive(version 2.10), got the same error: 2017-05-08T13:48:04,635 ERROR ql.Driver (:()) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://nameservice/hive/test_table1/test_hive_2017-05-08_13-47-40_660_5235248825413690559-1/-ext-10000 to destination hdfs://nameservice/hive/test_table1` |
@@ -222,7 +222,7 @@ case class InsertIntoHiveTable( | |||
val externalCatalog = sparkSession.sharedState.externalCatalog | |||
val hiveVersion = externalCatalog.asInstanceOf[HiveExternalCatalog].client.version | |||
val hadoopConf = sessionState.newHadoopConf() | |||
val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") | |||
val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") + "/.hive-staging" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
// SPARK-20594 After Hive 1.1, Hive will create the staging directory under the table directory,
// and when moving staging directory to table directory, Hive will still empty the table
// directory, but will exclude the staging directory, which start with "." or "_".
val stagingDir =
new Path(hadoopConf.get("hive.exec.stagingdir", ".hive-staging"), ".hive-staging").toString
This will not pass the test cases, because we only deleted the child directory |
ok to test |
Test build #76566 has finished for PR 17858 at commit
|
@gatorsmile it seems my mistake, i will try to fix this. |
… with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory.
// SPARK-20594: The staging directory should be a child directory starts with "." to avoid | ||
// being deleted if we set hive.exec.stagingdir under the table directory. | ||
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs) | ||
&& !stagingPathName.stripPrefix(inputPathName).startsWith(".")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to hide the issue and make the test cases passed, right?
We need to drop the created staging directory no matter what is the value users set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i do not follow your logic. Correct me if I'm wrong, but isn't the logic of dropping the created staging directory was already there before with fs.deleteOnExit(dir)
?
As @cloud-fan said this patch seems a valid workaround in Spark SQL for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs.deleteOnExit(dir)
deletes dir
, but the parent directory is still there.
Test build #76617 has finished for PR 17858 at commit
|
Test build #76644 has finished for PR 17858 at commit
|
Test build #76655 has finished for PR 17858 at commit
|
// SPARK-20594: The staging directory should be a child directory starts with "." to avoid | ||
// being deleted if we set hive.exec.stagingdir under the table directory. | ||
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs) | ||
&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs.deleteOnExit(dir)
deletes dir
, but the parent directory is still there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dir with executionId
is the staging dir which seems to me should be deleted exactly.
BTW, if we set hive.exec.stagingdir=/test/a/b
, the dir /test/a
will also not be deleted by the logic before.
Should i need to delete the parent directory only for this patch? and it seems not safe to do this since the parent directory could be used by other processes. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. This fix does not make things worse. I might accept it. cc @cloud-fan
if (inputPathName.indexOf(stagingDir) == -1) { | ||
new Path(inputPathName, stagingDir).toString | ||
} else { | ||
inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) | ||
} | ||
|
||
// SPARK-20594: The staging directory should be a child directory starts with "." to avoid | ||
// being deleted if we set hive.exec.stagingdir under the table directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
SPARK-20594: This is a walk-around fix to resolve a Hive bug. Hive requires that the staging directory needs to avoid being deleted when users set
hive.exec.stagingdir
under the table directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will update it . Thanks!
|to avoid being deleted if we set hive.exec.stagingdir under the table directory | ||
|without start with "."""".stripMargin) { | ||
|
||
dropTables("test_table", "test_table1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withTable("test_table", "test_table1") {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right. :)
tableNames.foreach { name => | ||
sql(s"DROP TABLE IF EXISTS $name") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. You can call withTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
sql("CREATE TABLE test_table (key int, value string)") | ||
|
||
// Add some data. | ||
testData.write.mode(SaveMode.Append).insertInto("test_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify the above two lines by spark.range(1).write.saveAsTable("test_table")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, as i tested we must create table rather than simplify the above two lines by spark.range(1).write.saveAsTable("test_table")
.
Test build #76744 has finished for PR 17858 at commit
|
sql("CREATE TABLE test_table1 (key int)") | ||
|
||
// Set hive.exec.stagingdir under the table directory without start with ".". | ||
sql("set hive.exec.stagingdir=./test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you set it back after this test case.
&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) { | ||
logDebug(s"The staging dir '$stagingPathName' should be a child directory starts " + | ||
s"with '.' to avoid being deleted if we set hive.exec.stagingdir under the table " + | ||
s"directory.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please remove the above TWO string Interpolation s
// staging directory needs to avoid being deleted when users set hive.exec.stagingdir | ||
// under the table directory. | ||
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs) | ||
&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please move &&
in line 110
sql("set hive.exec.stagingdir=./test") | ||
|
||
// Now overwrite. | ||
sql("INSERT OVERWRITE TABLE test_table1 SELECT * FROM test_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need test_table
, right?
INSERT OVERWRITE TABLE test_table1 SELECT 1
Test build #76765 has finished for PR 17858 at commit
|
test( | ||
"""SPARK-20594: This is a walk-around fix to resolve a Hive bug. Hive requires that the | ||
|staging directory needs to avoid being deleted when users set hive.exec.stagingdir | ||
|under the table directory.""".stripMargin) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without your fix, this test case still passes. Could you please check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as i tested again, we must create table rather than simplify by spark.range(1).write.saveAsTable("test_table")
. Thanks again. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, because that is not to create a Hive table. How about simplifying the test case to?
test("SPARK-20594: hive.exec.stagingdir was deleted by Hive") {
// Set hive.exec.stagingdir under the table directory without start with ".".
withSQLConf("hive.exec.stagingdir" -> "./test") {
withTable("test_table") {
sql("CREATE TABLE test_table (key int)")
sql("INSERT OVERWRITE TABLE test_table SELECT 1")
checkAnswer(sql("SELECT * FROM test_table"), Row(1))
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good choice. thanks for your time!
Test build #76806 has finished for PR 17858 at commit
|
Test build #76839 has finished for PR 17858 at commit
|
LGTM |
…starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594 ## What changes were proposed in this pull request? The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory. ## How was this patch tested? Added unit tests Author: zuotingbing <[email protected]> Closes #17858 from zuotingbing/spark-stagingdir. (cherry picked from commit e3d2022) Signed-off-by: Xiao Li <[email protected]>
Thanks! Merging to master/2.2 |
Thank you all. Delete the branch. |
…starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594 ## What changes were proposed in this pull request? The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory. ## How was this patch tested? Added unit tests Author: zuotingbing <[email protected]> Closes apache#17858 from zuotingbing/spark-stagingdir.
…starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594 ## What changes were proposed in this pull request? The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory. ## How was this patch tested? Added unit tests Author: zuotingbing <[email protected]> Closes apache#17858 from zuotingbing/spark-stagingdir.
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594
What changes were proposed in this pull request?
The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory.
How was this patch tested?
Added unit tests