-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20594][SQL]The staging directory should be a child directory starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. #17858
Changes from 6 commits
c154f3a
de938ed
6b22d3e
6b1b153
2a542e4
9f41436
bf1b4ec
4e1b6a0
639d63a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,7 @@ | |
|
||
package org.apache.spark.sql.hive.execution | ||
|
||
import java.io.IOException | ||
import java.io.{File, IOException} | ||
import java.net.URI | ||
import java.text.SimpleDateFormat | ||
import java.util.{Date, Locale, Random} | ||
|
@@ -97,12 +97,24 @@ case class InsertIntoHiveTable( | |
val inputPathUri: URI = inputPath.toUri | ||
val inputPathName: String = inputPathUri.getPath | ||
val fs: FileSystem = inputPath.getFileSystem(hadoopConf) | ||
val stagingPathName: String = | ||
var stagingPathName: String = | ||
if (inputPathName.indexOf(stagingDir) == -1) { | ||
new Path(inputPathName, stagingDir).toString | ||
} else { | ||
inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) | ||
} | ||
|
||
// SPARK-20594: This is a walk-around fix to resolve a Hive bug. Hive requires that the | ||
// staging directory needs to avoid being deleted when users set hive.exec.stagingdir | ||
// under the table directory. | ||
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs) | ||
&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: Please move |
||
logDebug(s"The staging dir '$stagingPathName' should be a child directory starts " + | ||
s"with '.' to avoid being deleted if we set hive.exec.stagingdir under the table " + | ||
s"directory.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: please remove the above TWO string Interpolation |
||
stagingPathName = new Path(inputPathName, ".hive-staging").toString | ||
} | ||
|
||
val dir: Path = | ||
fs.makeQualified( | ||
new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -494,4 +494,34 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef | |
spark.table("t").write.insertInto(tableName) | ||
} | ||
} | ||
|
||
test( | ||
"""SPARK-20594: This is a walk-around fix to resolve a Hive bug. Hive requires that the | ||
|staging directory needs to avoid being deleted when users set hive.exec.stagingdir | ||
|under the table directory.""".stripMargin) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Without your fix, this test case still passes. Could you please check it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, as i tested again, we must create table rather than simplify by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. uh, because that is not to create a Hive table. How about simplifying the test case to? test("SPARK-20594: hive.exec.stagingdir was deleted by Hive") {
// Set hive.exec.stagingdir under the table directory without start with ".".
withSQLConf("hive.exec.stagingdir" -> "./test") {
withTable("test_table") {
sql("CREATE TABLE test_table (key int)")
sql("INSERT OVERWRITE TABLE test_table SELECT 1")
checkAnswer(sql("SELECT * FROM test_table"), Row(1))
}
}
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good choice. thanks for your time! |
||
|
||
withTable("test_table", "test_table1") { | ||
spark.range(1).write.saveAsTable("test_table") | ||
|
||
// Make sure the table has also been updated. | ||
checkAnswer( | ||
sql("SELECT * FROM test_table"), | ||
Row(0) | ||
) | ||
|
||
sql("CREATE TABLE test_table1 (key int)") | ||
|
||
// Set hive.exec.stagingdir under the table directory without start with ".". | ||
sql("set hive.exec.stagingdir=./test") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you set it back after this test case. |
||
|
||
// Now overwrite. | ||
sql("INSERT OVERWRITE TABLE test_table1 SELECT * FROM test_table") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do not need
|
||
|
||
// Make sure the table has also been updated. | ||
checkAnswer( | ||
sql("SELECT * FROM test_table1"), | ||
Row(0) | ||
) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs.deleteOnExit(dir)
deletesdir
, but the parent directory is still there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dir with
executionId
is the staging dir which seems to me should be deleted exactly.BTW, if we
set hive.exec.stagingdir=/test/a/b
, the dir/test/a
will also not be deleted by the logic before.Should i need to delete the parent directory only for this patch? and it seems not safe to do this since the parent directory could be used by other processes. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. This fix does not make things worse. I might accept it. cc @cloud-fan