[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. #2020

vanzin · 2014-08-18T21:10:23Z

This change modifies the Yarn module so that all the logic related
to running the ApplicationMaster is localized. Instead of, previously,
4 different classes with mostly identical code, now we have:

A single, shared ApplicationMaster class, which can operate both in
client and cluster mode, and substitutes the old ApplicationMaster
(for cluster mode) and ExecutorLauncher (for client mode).

The benefit here is that all different execution modes for all supported
yarn versions use the same shared code for monitoring executor allocation,
setting up configuration, and monitoring the process's lifecycle.

A new YarnRMClient interface, which defines basic RM functionality needed
by the ApplicationMaster. This interface has concrete implementations for
each supported Yarn version.
A new YarnAllocator interface, which just abstracts the existing interface
of the YarnAllocationHandler class. This is to avoid having to touch the
allocator code too much in this change, although it might benefit from a
similar effort in the future.

The end result is much easier to understand code, with much less duplication,
making it much easier to fix bugs, add features, and test everything knowing
that all supported versions will behave the same.

vanzin · 2014-08-18T21:12:26Z

Man, I just merged with master and there's already a conflict... will update shortly.

This change modifies the Yarn module so that all the logic related to running the ApplicationMaster is localized. Instead of, previously, 4 different classes with mostly identical code, now we have: - A single, shared ApplicationMaster class, which can operate both in client and cluster mode, and substitutes the old ApplicationMaster (for cluster mode) and ExecutorLauncher (for client mode). The benefit here is that all different execution modes for all supported yarn versions use the same shared code for monitoring executor allocation, setting up configuration, and monitoring the process's lifecycle. - A new YarnRMClient interface, which defines basic RM functionality needed by the ApplicationMaster. This interface has concrete implementations for each supported Yarn version. - A new YarnAllocator interface, which just abstracts the existing interface of the YarnAllocationHandler class. This is to avoid having to touch the allocator code too much in this change, although it might benefit from a similar effort in the future. The end result is much easier to understand code, with much less duplication, making it much easier to fix bugs, add features, and test everything knowing that all supported versions will behave the same.

Made some tweaks to the YarnAllocator interface to cover both APIs more easily. There's still a lot of cleanup possible on that front, but I'll leave that as a separate task.

This avoids the NPEs that would happen if code just kept going.

Makes code a little cleaner and easier to follow.

SparkQA · 2014-08-18T21:15:19Z

QA tests have started for PR 2020 at commit 5100474.

This patch does not merge cleanly!

vanzin · 2014-08-18T21:17:39Z

I tried to keep the allocator changes minimal; a lot of it is just removing dead code and moving shared code to a shared place, but I did make some changes to make the alpha and stable interfaces work similarly.

@tgravescs, it would be great if you guys could test the alpha changes; they compile, but I don't have a 0.23.x cluster to test them on. That would be greatly appreciated!

I tested both client and cluster mode in stable yarn (CDH 5.1 = 2.3.0), including killing the AM and killing executors while jobs were running.

tgravescs · 2014-08-18T21:30:56Z

I'll take a look and hopefully try it out tomorrow. You didn't happen to change the client mode to be unmanaged am did you?

SparkQA · 2014-08-18T22:22:14Z

QA tests have finished for PR 2020 at commit 5100474.

This patch fails unit tests.
This patch does not merge cleanly!

vanzin · 2014-08-18T22:24:19Z

No, I don't believe I changed anything in that regard. Goal was to keep all the current functionality intact.

vanzin · 2014-08-18T22:51:49Z

Jenkins, test this please.

SparkQA · 2014-08-18T22:55:57Z

QA tests have started for PR 2020 at commit 557fdeb.

This patch merges cleanly.

SparkQA · 2014-08-18T23:49:57Z

QA tests have finished for PR 2020 at commit 557fdeb.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2014-08-19T15:04:31Z

yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala

+    resourceManager: AMRMProtocol,
+    appAttemptId: ApplicationAttemptId,
+    args: ApplicationMasterArguments,
+    map: collection.Map[String, collection.Set[SplitInfo]])


It might be nice to have more descriptive name then map.

tgravescs · 2014-08-19T16:22:31Z

I took one pass through this and made some minor comment. I also ran a few small tests on 0.23 and they worked ok.

I want to take another more detailed pass over this though.

tgravescs · 2014-08-19T16:23:40Z

for some reason github isn't let me comment on the files anymore so here are a couple more.

allocateExecutors() isn't being called.
in ClientBase. amClass isn't used.

Conflicts: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala

tgravescs · 2014-08-25T19:58:08Z

yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

@@ -19,7 +19,7 @@ package org.apache.spark.scheduler.cluster

 import org.apache.hadoop.yarn.api.records.{ApplicationId, YarnApplicationState}
 import org.apache.spark.{SparkException, Logging, SparkContext}
-import org.apache.spark.deploy.yarn.{Client, ClientArguments, ExecutorLauncher, YarnSparkHadoopUtil}
+import org.apache.spark.deploy.yarn.{ApplicationMaster, Client, ClientArguments, YarnSparkHadoopUtil}


I don't see that ApplicationMaster is used here.

SparkQA · 2014-08-25T21:10:51Z

QA tests have started for PR 2020 at commit 0f5142c.

This patch merges cleanly.

SparkQA · 2014-08-25T22:04:54Z

QA tests have finished for PR 2020 at commit 0f5142c.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2014-08-25T22:23:09Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+    }
+  }
+
+  // Note: this needs to happen before allocateExecutors.


nit, comment not valid anymore

tgravescs · 2014-08-26T17:49:02Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+        Option(diagnostics).map(msg => s" (diag message: $msg)").getOrElse(""))
+      finished = true
+      finalStatus = status
+      reporterThread.interrupt()


This is being called from the reporterThread in the case of checkNumExecutorsFailed. So interrupting that thread then causes it to not finish cleaning up. Like setting diagnostics message and calling sc.stop().

On hadoop 0.23 this results in the application hanging.

Hmmm. Works fine in stable, and interrupting "self" should be ok in general. But I guess it's safer to not do it (and avoid the sleep in the reporter thread when the AM is finished).

I was testing on stable too and it didn't report the diagnostics properly and didn't do the sc stop.

fyi - The way I tested this was to use jdk32 but try to allocate executors with >4G memory.

Fixed and re-tested (and re-merged). I generally run this in a loop to try this out:

jps | grep CoarseGrained | awk '{print $1}' | xargs kill

tgravescs · 2014-08-26T18:17:24Z

Other then the one comment I think this looks good.

Conflicts: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala

SparkQA · 2014-08-26T19:51:29Z

QA tests have started for PR 2020 at commit ff389ed.

This patch does not merge cleanly!

SparkQA · 2014-08-26T21:02:22Z

QA tests have finished for PR 2020 at commit ff389ed.

This patch fails unit tests.
This patch does not merge cleanly!

tgravescs · 2014-08-27T14:28:31Z

Jenkins, test this please

SparkQA · 2014-08-27T14:31:00Z

QA tests have started for PR 2020 at commit 3bbf3e7.

This patch merges cleanly.

SparkQA · 2014-08-27T15:26:46Z

QA tests have finished for PR 2020 at commit 3bbf3e7.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2014-08-27T16:01:19Z

Ok this looks good. +1. thanks for taking this on @vanzin !

tgravescs · 2014-08-27T16:04:58Z

I committed this just to master.

This change modifies the Yarn module so that all the logic related to running the ApplicationMaster is localized. Instead of, previously, 4 different classes with mostly identical code, now we have: - A single, shared ApplicationMaster class, which can operate both in client and cluster mode, and substitutes the old ApplicationMaster (for cluster mode) and ExecutorLauncher (for client mode). The benefit here is that all different execution modes for all supported yarn versions use the same shared code for monitoring executor allocation, setting up configuration, and monitoring the process's lifecycle. - A new YarnRMClient interface, which defines basic RM functionality needed by the ApplicationMaster. This interface has concrete implementations for each supported Yarn version. - A new YarnAllocator interface, which just abstracts the existing interface of the YarnAllocationHandler class. This is to avoid having to touch the allocator code too much in this change, although it might benefit from a similar effort in the future. The end result is much easier to understand code, with much less duplication, making it much easier to fix bugs, add features, and test everything knowing that all supported versions will behave the same. Author: Marcelo Vanzin <[email protected]> Closes apache#2020 from vanzin/SPARK-2933 and squashes the following commits: 3bbf3e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933 ff389ed [Marcelo Vanzin] Do not interrupt reporter thread from within itself. 3a8ed37 [Marcelo Vanzin] Remote stale comment. 0f5142c [Marcelo Vanzin] Review feedback. 41f8c8a [Marcelo Vanzin] Fix app status reporting. c0794be [Marcelo Vanzin] Correctly clean up staging directory. 92770cc [Marcelo Vanzin] Merge branch 'master' into SPARK-2933 ecaf332 [Marcelo Vanzin] Small fix to shutdown code. f02d3f8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933 f581122 [Marcelo Vanzin] Review feedback. 557fdeb [Marcelo Vanzin] Cleanup a couple more constants. be6068d [Marcelo Vanzin] Restore shutdown hook to clean up staging dir. 5150993 [Marcelo Vanzin] Some more cleanup. b6289ab [Marcelo Vanzin] Move cluster/client code to separate methods. ecb23cd [Marcelo Vanzin] More trivial cleanup. 34f1e63 [Marcelo Vanzin] Fix some questionable error handling. 5657c7d [Marcelo Vanzin] Finish app if SparkContext initialization times out. 0e4be3d [Marcelo Vanzin] Keep "ExecutorLauncher" as the main class for client-mode AM. 91beabb [Marcelo Vanzin] Fix UI filter registration. 8c72239 [Marcelo Vanzin] Trivial cleanups. 99a52d5 [Marcelo Vanzin] Changes to the yarn-alpha project to use common AM code. 848ca6d [Marcelo Vanzin] [SPARK-2933] [yarn] Refactor and cleanup Yarn AM code.

This is part of a broader effort to clean up the Yarn integration code after #2020. The high-level changes in this PR include: - Removing duplicate code, especially across the alpha and stable APIs - Simplify unnecessarily complex method signatures and hierarchies - Rename unclear variable and method names - Organize logging output produced when the user runs Spark on Yarn - Extensively add documentation - Privatize classes where possible I have tested the stable API on a Hadoop 2.4 cluster. I tested submitting a jar that references classes in other jars in both client and cluster mode. I also made changes in the alpha API, though I do not have access to an alpha cluster. I have verified that it compiles, but it would be ideal if others can help test it. For those interested in some examples in detail, please read on. -------------------------------------------------------------------------------------------------------- ***Appendix*** - The loop to `getApplicationReport` from the RM is duplicated in 4 places: in the stable `Client`, alpha `Client`, and twice in `YarnClientSchedulerBackend`. We should not have different loops for client and cluster deploy modes. - There are many fragmented small helper methods that are only used once and should just be inlined. For instance, `ClientBase#getLocalPath` returns `null` on certain conditions, and its only caller `ClientBase#addFileToClasspath` checks whether the value returned is `null`. We could just have the caller check on that same condition to avoid passing `null`s around. - In `YarnSparkHadoopUtil#addToEnvironment`, we take in an argument `classpathSeparator` that always has the same value upstream (i.e. `File.pathSeparator`). This argument is now removed from the signature and all callers of this method upstream. - `ClientBase#copyRemoteFile` is now renamed to `copyFileToRemote`. It was unclear whether we are copying a remote file to our local file system, or copying a locally visible file to a remote file system. Also, even the content of the method has inaccurately named variables. We use `val remoteFs` to signify the file system of the locally visible file and `val fs` to signify the remote, destination file system. These are now renamed `srcFs` and `destFs` respectively. - We currently log the AM container's environment and resource mappings directly as Scala collections. This is incredibly hard to read and probably too verbose for the average Spark user. In other modes (e.g. standalone), we also don't log the launch commands by default, so the logging level of these information is now set to `DEBUG`. - None of these classes (`Client`, `ClientBase`, `YarnSparkHadoopUtil` etc.) is intended to be used by a Spark application (the user should go through Spark submit instead). At the very least they should be `private[spark]`. Author: Andrew Or <[email protected]> Closes #2350 from andrewor14/yarn-cleanup and squashes the following commits: 39e8c7b [Andrew Or] Address review comments 6619f9b [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup 2ca6d64 [Andrew Or] Improve logging in application monitor a3b9693 [Andrew Or] Minor changes 7dd6298 [Andrew Or] Simplify ClientBase#monitorApplication 547487c [Andrew Or] Provide default values for null application report entries a0ad1e9 [Andrew Or] Fix class not found error 1590141 [Andrew Or] Address review comments 45ccdea [Andrew Or] Remove usages of getAMMemory d8e33b6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup ed0b42d [Andrew Or] Fix alpha compilation error c0587b4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup 6d74888 [Andrew Or] Minor comment changes 6573c1d [Andrew Or] Clean up, simplify and document code for setting classpaths e4779b6 [Andrew Or] Clean up log messages + variable naming in ClientBase 8766d37 [Andrew Or] Heavily add documentation to Client* classes + various clean-ups 6c94d79 [Andrew Or] Various cleanups in ClientBase and ClientArguments ef7069a [Andrew Or] Clean up YarnClientSchedulerBackend more 6de9072 [Andrew Or] Guard against potential NPE in debug logging mode fabe4c4 [Andrew Or] Reuse more code in YarnClientSchedulerBackend 3f941dc [Andrew Or] First cut at simplifying the Client (stable and alpha)

### What changes were proposed in this pull request? Currently if any Xmx string is available in the driver java options even if not related to Max Heap setting the job sumission failed For Ex. this command failed `./bin/spark-submit --class org.apache.spark.examples.SparkPi --conf "spark.driver.extraJavaOptions=-Dtest=Xmx" examples/jars/spark-examples_2.12-3.4.1.jar 100` ### Why are the changes needed? The command should any failed if the -Xmx argument is set not if it is part of another parameter or string ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unitary test Closes apache#41806 from ashangit/nfraison/SPARK-44242. Authored-by: Nicolas Fraison <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Co-authored-by: Nicolas Fraison <[email protected]>

Marcelo Vanzin added 12 commits August 18, 2014 14:14

Changes to the yarn-alpha project to use common AM code.

99a52d5

Made some tweaks to the YarnAllocator interface to cover both APIs more easily. There's still a lot of cleanup possible on that front, but I'll leave that as a separate task.

Trivial cleanups.

8c72239

Fix UI filter registration.

91beabb

Keep "ExecutorLauncher" as the main class for client-mode AM.

0e4be3d

Finish app if SparkContext initialization times out.

5657c7d

This avoids the NPEs that would happen if code just kept going.

Fix some questionable error handling.

34f1e63

More trivial cleanup.

ecb23cd

Move cluster/client code to separate methods.

b6289ab

Makes code a little cleaner and easier to follow.

Some more cleanup.

5150993

Restore shutdown hook to clean up staging dir.

be6068d

Cleanup a couple more constants.

557fdeb

tgravescs reviewed Aug 19, 2014
View reviewed changes

Marcelo Vanzin added 2 commits August 19, 2014 10:09

Review feedback.

f581122

tgravescs reviewed Aug 25, 2014
View reviewed changes

Review feedback.

0f5142c

tgravescs reviewed Aug 25, 2014
View reviewed changes

Remote stale comment.

3a8ed37

tgravescs reviewed Aug 26, 2014
View reviewed changes

Marcelo Vanzin added 2 commits August 26, 2014 12:43

Do not interrupt reporter thread from within itself.

ff389ed

asfgit closed this in b92d823 Aug 27, 2014

vanzin deleted the SPARK-2933 branch August 27, 2014 20:51

This was referenced Sep 10, 2014

[SPARK-3477] Clean up code in Yarn Client / ClientBase #2350

Closed

[HOTFIX] Fix compilation errors in branch-1.1 #2372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. #2020

[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. #2020

vanzin commented Aug 18, 2014

vanzin commented Aug 18, 2014

SparkQA commented Aug 18, 2014

vanzin commented Aug 18, 2014

tgravescs commented Aug 18, 2014

SparkQA commented Aug 18, 2014

vanzin commented Aug 18, 2014

vanzin commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

tgravescs Aug 19, 2014

tgravescs commented Aug 19, 2014

tgravescs commented Aug 19, 2014

tgravescs Aug 25, 2014

SparkQA commented Aug 25, 2014

SparkQA commented Aug 25, 2014

tgravescs Aug 25, 2014

tgravescs Aug 26, 2014

vanzin Aug 26, 2014

tgravescs Aug 26, 2014

tgravescs Aug 26, 2014

vanzin Aug 26, 2014

tgravescs commented Aug 26, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 26, 2014

tgravescs commented Aug 27, 2014

SparkQA commented Aug 27, 2014

SparkQA commented Aug 27, 2014

tgravescs commented Aug 27, 2014

tgravescs commented Aug 27, 2014

[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. #2020

[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. #2020

Conversation

vanzin commented Aug 18, 2014

vanzin commented Aug 18, 2014

SparkQA commented Aug 18, 2014

vanzin commented Aug 18, 2014

tgravescs commented Aug 18, 2014

SparkQA commented Aug 18, 2014

vanzin commented Aug 18, 2014

vanzin commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

Choose a reason for hiding this comment

tgravescs commented Aug 19, 2014

tgravescs commented Aug 19, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 25, 2014

SparkQA commented Aug 25, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Aug 26, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 26, 2014

tgravescs commented Aug 27, 2014

SparkQA commented Aug 27, 2014

SparkQA commented Aug 27, 2014

tgravescs commented Aug 27, 2014

tgravescs commented Aug 27, 2014