[SPARK-26936][SQL] Fix bug of insert overwrite local dir can not create temporary path in local staging directory #23841

beliefer · 2019-02-20T03:13:18Z

What changes were proposed in this pull request?

Th environment of my cluster as follows:

OS:Linux version 2.6.32-220.7.1.el6.x86_64 ([email protected]) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 7 00:52:02 GMT 2012
Hadoop: 2.7.2
Spark: 2.3.0 or 3.0.0(master branch)
Hive: 1.2.1

My spark run on deploy mode yarn-client.

If I execute the SQL insert overwrite local directory '/home/test/call_center/' select * from call_center, a HiveException will appear as follows:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-10000/_temporary/0/_temporary/attempt_20190219173233_0002_m_000000_3 (exists=false, cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_000011) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
Current spark sql generate a local temporary path in local staging directory.The schema of local temporary path start with file, so the HiveException appears.
This PR change the local temporary path to HDFS temporary path, and use DistributedFileSystem instance copy the data from HDFS temporary path to local directory.
If Spark run on local deploy mode, 'insert overwrite local directory' works fine.

How was this patch tested?

UT cannot support yarn-client mode.The test is in my product environment.

maropu · 2019-02-20T12:48:45Z

Can you add tests before runing tests in Jenkins?

beliefer · 2019-02-21T02:36:20Z

Can you add tests before runing tests in Jenkins?
Existing unit tests
And add a junit of create not exists local directory

maropu · 2019-02-22T01:52:54Z

ok to test

SparkQA · 2019-02-22T01:59:37Z

Test build #102610 has finished for PR 23841 at commit 8799a86.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-22T02:31:08Z

Test build #102612 has finished for PR 23841 at commit 747296e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala

SparkQA · 2019-02-22T06:00:34Z

Test build #102614 has finished for PR 23841 at commit 19698df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-22T13:54:45Z

Test build #102642 has finished for PR 23841 at commit 41215c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-02-26T01:38:02Z

@maropu Please review this pr again,thanks!

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala

maropu · 2019-02-26T05:57:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala


-    val tmpPath = getExternalTmpPath(sparkSession, hadoopConf, writeToPath)
+    // The temporary path must be a HDFS path, not a local path.
+    val tmpPath = getExternalTmpPath(sparkSession, hadoopConf, qualifiedPath)


In case of inserts from non-hive tables, we still need to use a non-local path?

If target path is local, we still need to use a non-local path.

maropu · 2019-02-26T06:04:00Z