Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20594][SQL]The staging directory should be a child directory starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. #17858

Closed
wants to merge 9 commits into from

Conversation

zuotingbing
Copy link

@zuotingbing zuotingbing commented May 4, 2017

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594

What changes were proposed in this pull request?

The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory.

How was this patch tested?

Added unit tests

…ging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "."
@zuotingbing zuotingbing changed the title [SPARK-20594]The staging directory should be appended with ".hive-staging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "." [SPARK-20594][SQL]The staging directory should be appended with ".hive-staging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "." May 4, 2017
@zuotingbing
Copy link
Author

In this case, Hive will create the staging directory under the table directory, and when moving staging directory to table directory, Hive will still empty the table directory, but will exclude the staging directory which start with "." or "_"

public static final PathFilter HIDDEN_FILES_PATH_FILTER = new PathFilter() { public boolean accept(Path p) { String name = p.getName(); return !name.startsWith("_") && !name.startsWith("."); } };

@srowen
Copy link
Member

srowen commented May 4, 2017

Do you really need to force this? or, is it just that any path relative to the output dir has to be a hidden directory starting with "." or "_"? For example, right now this prevents me from making the staging dir "/foo/bar" but I don't see a reason to disallow that.

@gatorsmile
Copy link
Member

This sounds a bug in Hive metastore. Could you try the same thing in Hive? Do you hit the same error? Let us see how Hive behaves and then we can decide what is the best way to handle it. Thanks!

BTW, you need to create a test case. For example, InsertIntoHiveTableSuite.scala.

@cloud-fan
Copy link
Contributor

if this is a hive bug, this patch seems a valid workaround for Spark SQL.

@zuotingbing
Copy link
Author

zuotingbing commented May 8, 2017

yes i tried the same thing in Hive(version 2.10), got the same error:
`2017-05-08T13:48:04,634 ERROR exec.Task (:()) - Failed with exception Unable to move source hdfs://nameservice/hive/test_table1/test_hive_2017-05-08_13-47-40_660_5235248825413690559-1/-ext-10000 to destination hdfs://nameservice/hive/test_table1
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://nameservice/hive/test_table1/test_hive_2017-05-08_13-47-40_660_5235248825413690559-1/-ext-10000 to destination hdfs://nameservice/hive/test_table1
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2959)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3198)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1805)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:355)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1917)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1586)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1331)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1092)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1080)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.FileNotFoundException: File hdfs://nameservice/hive/test_table1/test_hive_2017-05-08_13-47-40_660_5235248825413690559-1/-ext-10000 does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:697)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2896)
... 22 more

2017-05-08T13:48:04,635 ERROR ql.Driver (:()) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://nameservice/hive/test_table1/test_hive_2017-05-08_13-47-40_660_5235248825413690559-1/-ext-10000 to destination hdfs://nameservice/hive/test_table1`

@@ -222,7 +222,7 @@ case class InsertIntoHiveTable(
val externalCatalog = sparkSession.sharedState.externalCatalog
val hiveVersion = externalCatalog.asInstanceOf[HiveExternalCatalog].client.version
val hadoopConf = sessionState.newHadoopConf()
val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") + "/.hive-staging"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

    // SPARK-20594 After Hive 1.1, Hive will create the staging directory under the table directory,
    // and when moving staging directory to table directory, Hive will still empty the table
    // directory, but will exclude the staging directory, which start with "." or "_".
    val stagingDir =
      new Path(hadoopConf.get("hive.exec.stagingdir", ".hive-staging"), ".hive-staging").toString

@gatorsmile
Copy link
Member

gatorsmile commented May 8, 2017

This will not pass the test cases, because we only deleted the child directory .hive-staging of the stagingDir. We should clean it up. Let me trigger the test to show you the test results.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76566 has finished for PR 17858 at commit de938ed.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zuotingbing
Copy link
Author

@gatorsmile it seems my mistake, i will try to fix this.

… with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory.
@zuotingbing zuotingbing changed the title [SPARK-20594][SQL]The staging directory should be appended with ".hive-staging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "." [SPARK-20594][SQL]The staging directory should be a child directory starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory. May 9, 2017
// SPARK-20594: The staging directory should be a child directory starts with "." to avoid
// being deleted if we set hive.exec.stagingdir under the table directory.
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs)
&& !stagingPathName.stripPrefix(inputPathName).startsWith(".")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to hide the issue and make the test cases passed, right?

We need to drop the created staging directory no matter what is the value users set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i do not follow your logic. Correct me if I'm wrong, but isn't the logic of dropping the created staging directory was already there before with fs.deleteOnExit(dir)?
As @cloud-fan said this patch seems a valid workaround in Spark SQL for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs.deleteOnExit(dir) deletes dir, but the parent directory is still there.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76617 has finished for PR 17858 at commit 6b22d3e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76644 has finished for PR 17858 at commit 6b1b153.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76655 has finished for PR 17858 at commit 2a542e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// SPARK-20594: The staging directory should be a child directory starts with "." to avoid
// being deleted if we set hive.exec.stagingdir under the table directory.
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs)
&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) {
Copy link
Member

@gatorsmile gatorsmile May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs.deleteOnExit(dir) deletes dir, but the parent directory is still there.

Copy link
Author

@zuotingbing zuotingbing May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dir with executionId is the staging dir which seems to me should be deleted exactly.
BTW, if we set hive.exec.stagingdir=/test/a/b , the dir /test/a will also not be deleted by the logic before.
Should i need to delete the parent directory only for this patch? and it seems not safe to do this since the parent directory could be used by other processes. Thanks.

Copy link
Member

@gatorsmile gatorsmile May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This fix does not make things worse. I might accept it. cc @cloud-fan

if (inputPathName.indexOf(stagingDir) == -1) {
new Path(inputPathName, stagingDir).toString
} else {
inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length)
}

// SPARK-20594: The staging directory should be a child directory starts with "." to avoid
// being deleted if we set hive.exec.stagingdir under the table directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

SPARK-20594: This is a walk-around fix to resolve a Hive bug. Hive requires that the staging directory needs to avoid being deleted when users set hive.exec.stagingdir under the table directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will update it . Thanks!

|to avoid being deleted if we set hive.exec.stagingdir under the table directory
|without start with "."""".stripMargin) {

dropTables("test_table", "test_table1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 withTable("test_table", "test_table1") {
    ...
  }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right. :)

tableNames.foreach { name =>
sql(s"DROP TABLE IF EXISTS $name")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. You can call withTable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

sql("CREATE TABLE test_table (key int, value string)")

// Add some data.
testData.write.mode(SaveMode.Append).insertInto("test_table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify the above two lines by spark.range(1).write.saveAsTable("test_table")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, as i tested we must create table rather than simplify the above two lines by spark.range(1).write.saveAsTable("test_table").

@SparkQA
Copy link

SparkQA commented May 10, 2017

Test build #76744 has finished for PR 17858 at commit 9f41436.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sql("CREATE TABLE test_table1 (key int)")

// Set hive.exec.stagingdir under the table directory without start with ".".
sql("set hive.exec.stagingdir=./test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you set it back after this test case.

&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) {
logDebug(s"The staging dir '$stagingPathName' should be a child directory starts " +
s"with '.' to avoid being deleted if we set hive.exec.stagingdir under the table " +
s"directory.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please remove the above TWO string Interpolation s

// staging directory needs to avoid being deleted when users set hive.exec.stagingdir
// under the table directory.
if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs)
&& !stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please move && in line 110

sql("set hive.exec.stagingdir=./test")

// Now overwrite.
sql("INSERT OVERWRITE TABLE test_table1 SELECT * FROM test_table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need test_table, right?

INSERT OVERWRITE TABLE test_table1 SELECT 1

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76765 has finished for PR 17858 at commit bf1b4ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test(
"""SPARK-20594: This is a walk-around fix to resolve a Hive bug. Hive requires that the
|staging directory needs to avoid being deleted when users set hive.exec.stagingdir
|under the table directory.""".stripMargin) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without your fix, this test case still passes. Could you please check it?

Copy link
Author

@zuotingbing zuotingbing May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as i tested again, we must create table rather than simplify by spark.range(1).write.saveAsTable("test_table"). Thanks again. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, because that is not to create a Hive table. How about simplifying the test case to?

  test("SPARK-20594: hive.exec.stagingdir was deleted by Hive") {
    // Set hive.exec.stagingdir under the table directory without start with ".".
    withSQLConf("hive.exec.stagingdir" -> "./test") {
      withTable("test_table") {
        sql("CREATE TABLE test_table (key int)")
        sql("INSERT OVERWRITE TABLE test_table SELECT 1")
        checkAnswer(sql("SELECT * FROM test_table"), Row(1))
      }
    }
  }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good choice. thanks for your time!

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76806 has finished for PR 17858 at commit 4e1b6a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76839 has finished for PR 17858 at commit 639d63a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

asfgit pushed a commit that referenced this pull request May 12, 2017
…starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory.

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594

## What changes were proposed in this pull request?

The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory.

## How was this patch tested?

Added unit tests

Author: zuotingbing <[email protected]>

Closes #17858 from zuotingbing/spark-stagingdir.

(cherry picked from commit e3d2022)
Signed-off-by: Xiao Li <[email protected]>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2

@asfgit asfgit closed this in e3d2022 May 12, 2017
@zuotingbing
Copy link
Author

Thank you all. Delete the branch.

@zuotingbing zuotingbing deleted the spark-stagingdir branch May 15, 2017 01:51
robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory.

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594

## What changes were proposed in this pull request?

The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory.

## How was this patch tested?

Added unit tests

Author: zuotingbing <[email protected]>

Closes apache#17858 from zuotingbing/spark-stagingdir.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory.

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594

## What changes were proposed in this pull request?

The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory.

## How was this patch tested?

Added unit tests

Author: zuotingbing <[email protected]>

Closes apache#17858 from zuotingbing/spark-stagingdir.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants