Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141

Closed
wants to merge 2 commits into from

Conversation

devaraj-kavali
Copy link

@devaraj-kavali devaraj-kavali commented Sep 6, 2017

What changes were proposed in this pull request?

When the libraries temp directory(i.e. spark_libs.zip dir) file system and staging dir(destination) file systems are the same then the spark_libs.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.

With this change, client copies the files to remote always when the source scheme is "file".

How was this patch tested?

I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.

@jerryshao
Copy link
Contributor

Can you please describe your usage scenario and steps to reproduce your issue, from my understanding. Did you configure your default FS to a local FS?

@jerryshao
Copy link
Contributor

Also looks like this is not a Spark 2.2 issue, would you please fix the PR title be more accurate about the problem?

@devaraj-kavali devaraj-kavali changed the title [SPARK-21384] [YARN] Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails [SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS Sep 6, 2017
@devaraj-kavali
Copy link
Author

Thanks @jerryshao for looking into this PR.

Can you please describe your usage scenario and steps to reproduce your issue, from my understanding. Did you configure your default FS to a local FS?

Yea, this can be reproduced with one node YARN cluster and LocalFileSystem as default FS. All the spark applications fail with YARN with LocalFileSystem. This issue can be avoided by setting any one of the spark.yarn.jars / spark.yarn.archive configurations.

Also looks like this is not a Spark 2.2 issue, would you please fix the PR title be more accurate about the problem?

I have updated the PR.

@jerryshao
Copy link
Contributor

I see, thanks for the explanation.

@jerryshao
Copy link
Contributor

jerryshao commented Sep 7, 2017

OK to test.

(I may not have the permission to trigger Jenkins test 😞 , let me try it)

@@ -565,7 +565,6 @@ private[spark] class Client(
distribute(jarsArchive.toURI.getPath,
resType = LocalResourceType.ARCHIVE,
destName = Some(LOCALIZED_LIB_DIR))
jarsArchive.delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're undoing the fix for SPARK-20741. If this is causing a problem and you want to fix it, you need to make it so that you don't do this only when the specific scenario that's causing the problem happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Marcelo, this is a valid concern, we should not avoid such regression here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vanzin for the pointer. It was my mistake, I missed the change reason while looking at the history of the file.

I still see that SPARK-20741 has fixed the issue partially, it leaves __spark_conf__*.zip file to delete as part of shutdownhook.

I see these approaches to fix it further,

  1. Delete __spark_conf__.zip and __spark_libs__.zip files after completing the application similar to cleanupStagingDir.
    (Or)
  2. Add a configuration whether to delete __spark_conf__.zip and __spark_libs__.zip files after copying to dest dir, so that users can decide whether to delete these immediately or as part of process exit. In case of SPARK-20741, this new configuration can be enabled to delete these files immediately.

@vanzin & @jerryshao Please let me know your thoughts on this or if you have any other way to do this. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked.

I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy.

Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jerryshao for the comment.

What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked.

Can you provide some information why you think it doesn't work? If we delete the spark_libs.zip after completing the application(similar to staging dir deletion), it would not stack up till the process exit which solves SPARK-20741 and also becomes available during the execution for this current issue.

I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy.

If there are multiple jobs submitted/running concurrently, we would be overwriting the existing with the latest spark_libs.zip which may lead to apps failure during the copy-in-progress and also would become ambiguous to delete the file by which application.

Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster).

I agree, this is a corner case and can be handled with workaround.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about this again, I think you're right. But I'm not sure if the program will be crashed or not if we delete the dependencies in the run-time.

Copy link
Contributor

@vanzin vanzin Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a configuration is rarely the right fix for anything. You shouldn't have to choose to have the correct behavior or not.

You can probably fix this by always uploading things to the distributed cache for resources with a "file:" scheme. This will penalize those running things with a local default FS, but it will work. And the 99.9% of the world that does not do that will not be affected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vanzin for the comment, I will update the PR as per your suggestion.

@vanzin
Copy link
Contributor

vanzin commented Sep 19, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81953 has finished for PR 19141 at commit d2d13fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 20, 2017

LGTM, merging to master.

@vanzin
Copy link
Contributor

vanzin commented Sep 20, 2017

(Also merging to 2.2.)

asfgit pushed a commit that referenced this pull request Sep 20, 2017
…t FS

## What changes were proposed in this pull request?

When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.

With this change, client copies the files to remote always when the source scheme is "file".

## How was this patch tested?

I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.

Author: Devaraj K <[email protected]>

Closes #19141 from devaraj-kavali/SPARK-21384.

(cherry picked from commit 55d5fa7)
Signed-off-by: Marcelo Vanzin <[email protected]>
@asfgit asfgit closed this in 55d5fa7 Sep 20, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…t FS

## What changes were proposed in this pull request?

When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.

With this change, client copies the files to remote always when the source scheme is "file".

## How was this patch tested?

I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.

Author: Devaraj K <[email protected]>

Closes apache#19141 from devaraj-kavali/SPARK-21384.

(cherry picked from commit 55d5fa7)
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants