-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26936][SQL] Fix bug of insert overwrite local dir can not create temporary path in local staging directory #23841
[SPARK-26936][SQL] Fix bug of insert overwrite local dir can not create temporary path in local staging directory #23841
Conversation
Can you add tests before runing tests in Jenkins? |
|
ok to test |
Test build #102610 has finished for PR 23841 at commit
|
Test build #102612 has finished for PR 23841 at commit
|
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala
Outdated
Show resolved
Hide resolved
Test build #102614 has finished for PR 23841 at commit
|
Test build #102642 has finished for PR 23841 at commit
|
@maropu Please review this pr again,thanks! |
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala
Show resolved
Hide resolved
|
||
val tmpPath = getExternalTmpPath(sparkSession, hadoopConf, writeToPath) | ||
// The temporary path must be a HDFS path, not a local path. | ||
val tmpPath = getExternalTmpPath(sparkSession, hadoopConf, qualifiedPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of inserts from non-hive tables, we still need to use a non-local path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If target path is local, we still need to use a non-local path.
val path = dir.toURI.getPath | ||
val notExistsPath = s"${path}/src/result" | ||
|
||
sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${notExistsPath}' SELECT * FROM src where key < 10") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ${notExistsPath}
-> $notExistsPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have adjust it.
|
||
sql( | ||
s""" | ||
|INSERT OVERWRITE LOCAL DIRECTORY '${notExistsPath}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have adjust it.
|CREATE TEMPORARY VIEW orc_source | ||
|USING org.apache.spark.sql.hive.orc | ||
|OPTIONS ( | ||
| PATH '${notExistsPath}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have adjust it.
|
||
checkAnswer( | ||
sql("select * from orc_source"), | ||
sql("select * from src where key < 10")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set literals in the expected answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These code refers to the writing of other UT.
@@ -581,6 +581,38 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter | |||
} | |||
} | |||
|
|||
test("insert overwrite to not exists dir from hive metastore table") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference from the "insert overwrite to not exist local dir"
test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is used for test result correct or not, another is used for test the not exist path created or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you gather into a single test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,I can remove "insert overwrite to not exist local dir" test and retain another.
cc: @dongjoon-hyun |
Test build #102786 has finished for PR 23841 at commit
|
Test build #102788 has finished for PR 23841 at commit
|
I realized that the test you added passed in the master without your fix... Can you check again?
|
There parent path |
Test build #102810 has finished for PR 23841 at commit
|
My test case was that
./bin/spark-master --master=local[*] If you have any precondition for the failuare, can you update the PR description? |
OK,I have supplement the deploy mode for the PR description. |
I have run these SQL you provided on local[*] deploy mode and still appear inconsistent behavior too.
'/tmp/noexistdir/t' is not a directory but a file. |
Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again.
The |
The origin code of |
I have update the PR and associated JIRA. |
@maropu Could you review the PR again?Thanks. |
I have check the source of Hadoop
If target path is a directory, |
Does this issue only happen in yarn-client mode? |
This SQL will occur |
You need to first check the test you added in this pr... it still passes without your fix. |
Maybe UT run on local and depend on operating system or other difference of environment.
|
Test build #102907 has finished for PR 23841 at commit
|
Test build #102911 has finished for PR 23841 at commit
|
You're changing all code paths though, ones that work. I am not clear that's valid. Why not work out why this happens only in yarn-client mode? |
I paste the full stack as follows: |
@srowen |
Will we hit this bug when we deploy spark in cluster? Seems to me it's not specific to yarn. |
Yes, If spark runs in |
That makes more sense if this isn't YARN-specific, but isn't this still using a local path as if it's remote, or am I misreading? |
The SQL start with |
OK it's probably that I don't know this code well. Maybe my question is this: when |
It is great.I have merged it into my production environment. |
Yes, when |
Yes I get all that, but here: |
I add some code to guarantee the Robustness. |
In The variables on our production environment as follows: |
Test build #104033 has finished for PR 23841 at commit
|
After the change, |
The key issue is not the Before the change of this PR, the variables on our production environment in After the change of this PR , the variables on our production environment in |
If |
Yeah, I think that's a potential problem. We don't know whether that same path is valid on a distributed store. I agree there's a problem to fix here. Is there any standard tmp dir you can instead write to in this case rather than reusing that local path as a distributed path? |
The purpose of executing this command is to write the data to a local path. So the |
OK, if you're saying this change doesn't cause the temp path to be some function of the (local) write path, then I get it. If it just comes up with a standard tmp location that is now on distributed storage, I get it. |
Yes,it is. |
@maropu @cloud-fan does this sound reasonable to you? given the extended discussion above and the last couple comments here |
@maropu @cloud-fan srowen and I have a lot discussion above,can you pay attention to last couple comments? |
@cloud-fan Please look this PR, thanks! |
This PR has been going on for more than 40 days. Please help me advance this PR. Thanks. |
Given the discussion, my understanding, passed tests, and reports that this fixes the problem in a prod env, I'm merging this to master. |
Thank you very mush.Your rigor and attitude towards the discussion is very helpful to the contributors. |
What changes were proposed in this pull request?
Th environment of my cluster as follows:
My spark run on deploy mode yarn-client.
If I execute the SQL
insert overwrite local directory '/home/test/call_center/' select * from call_center
, a HiveException will appear as follows:Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-10000/_temporary/0/_temporary/attempt_20190219173233_0002_m_000000_3 (exists=false, cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_000011) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
Current spark sql generate a local temporary path in local staging directory.The schema of local temporary path start with
file
, so the HiveException appears.This PR change the local temporary path to HDFS temporary path, and use DistributedFileSystem instance copy the data from HDFS temporary path to local directory.
If Spark run on local deploy mode, 'insert overwrite local directory' works fine.
How was this patch tested?
UT cannot support yarn-client mode.The test is in my product environment.