[SPARK-611] Display executor thread dumps in web UI #2944

JoshRosen · 2014-10-26T03:09:57Z

This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI.

The thread dumps are collected using Thread.getAllStackTraces(). To allow remote thread dumps to be triggered from the web UI, I added a new ExecutorActor that runs inside of the Executor actor system and responds to RPCs from the driver. The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor. Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication.

Screenshots:

This patch allows executor thread dumps to be viewed in the Spark web UI. Thread dumps obtained from Thread.getAllStackTraces() are piggybacked on the periodic executor -> driver heartbeats. JobProgressListener stores these heartbeats for display in the UI. One current limitation is that the driver thread dumps are not viewable except when running in local mode.

JoshRosen · 2014-10-26T03:13:18Z

core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala

@@ -47,7 +47,7 @@ private[spark] class LocalActor(

  private var freeCores = totalCores

-  private val localExecutorId = "localhost"
+  private val localExecutorId = "<driver>"


While working on this, I noticed that the links to view the driver thread dumps weren't working. In local mode, it looks like we weren't using the same name for the local executor id in this Executor and in the driver's SparkEnv, so I fixed that here.

SparkQA · 2014-10-26T03:14:57Z

Test build #22228 has started for PR 2944 at commit 8c10216.

This patch merges cleanly.

JoshRosen · 2014-10-26T03:22:13Z

One subtle issue that I've run into is that the driver always runs a block manager but only runs an Executor in local mode. So, the "executors" tab in the web UI is slightly misleading when running in a cluster mode, since the driver doesn't run a regular executor. It would be nice to have driver / application thread dumps in this UI, too, so I wonder if there's a clean way to fix this.

JoshRosen · 2014-10-26T03:27:34Z

Executor IDs are strings, so I should probably check whether they'll need to be url-encoded; I guess this depends on which components create these strings.

SparkQA · 2014-10-26T03:47:48Z

Test build #22228 has finished for PR 2944 at commit 8c10216.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage("threadDump")

AmplabJenkins · 2014-10-26T03:47:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22228/
Test FAILed.

SparkQA · 2014-10-27T19:04:55Z

Test build #22301 has started for PR 2944 at commit cc3e6b3.

This patch merges cleanly.

AmplabJenkins · 2014-10-27T20:37:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22301/
Test FAILed.

shaneknapp · 2014-10-27T21:01:46Z

jenkins, test this please

SparkQA · 2014-10-27T21:02:36Z

Test build #22305 has started for PR 2944 at commit cc3e6b3.

This patch merges cleanly.

SparkQA · 2014-10-27T22:29:06Z

Test build #22305 has finished for PR 2944 at commit cc3e6b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerExecutorThreadDump(
- class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage("threadDump")

AmplabJenkins · 2014-10-27T22:29:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22305/
Test PASSed.

andrewor14 · 2014-10-28T19:42:46Z

Wow, awesome!!

andrewor14 · 2014-10-28T19:43:12Z

This is even easier to read than the raw jstack output

shivaram · 2014-10-28T19:44:17Z

@JoshRosen This is super awesome !

JoshRosen · 2014-10-28T21:29:11Z

It looks like executorIds are assigned by the cluster manager, so in principle they could be arbitrary strings but in practice they seem to not contain special characters that would need special escaping (such as spaces). The application master log viewer (LogPage.scala) doesn't perform any URL-encoding or escaping of executorIds, so I'm not sure that we need to do it here.

shivaram · 2014-10-28T21:43:25Z

Do you know how large the threadDump is typically ? I'm concerned this might make the heartbeat too large

shivaram · 2014-10-28T21:49:39Z

The other idea I had was that we could just open a port on the executor and have a web ui on it. This could also display the executor's stderr (Which is very painful to get to right now) and have links to get thread stack trace, memory histo etc. However this might be a larger change

JoshRosen · 2014-10-28T22:25:21Z

@shivaram That's a good point RE: the size of the thread dumps. I can now imagine problems where a thread-leak in an executor causes the heartbeat to become huge and leads to a job failure when the heartbeat exceeds the Akka frame size.

SparkQA · 2014-10-28T22:29:55Z

Test build #22383 has started for PR 2944 at commit 2b8bdf3.

This patch merges cleanly.

JoshRosen · 2014-10-28T22:44:55Z

I like the idea of running a separate UI server on the executor, but this seems like a much more involved change that will take a lot more design review. For example, we'd have to consider how those web UIs will be secured, which ports they will bind to, etc.

As a shorter-term fix, how about de-coupling the thread dumps from the heartbeats so that huge thread dumps won't cause heartbeats to be lost? If we do this, I might be able to add a driver -> executor RPC path to allow thread-dumps to be triggered from the web UI.

shivaram · 2014-10-28T22:56:58Z

Yes - I think having a separate RPC sounds good for now.

JoshRosen · 2014-10-28T22:57:33Z

Upon closer inspection, there's not a general driver -> executor RPC path that I can use to send arbitrary Akka messages to executors. To keep this PR simple and narrow in scope, I'm just going to add a separate RPC.

SparkQA · 2014-10-28T23:51:22Z

Test build #22383 has finished for PR 2944 at commit 2b8bdf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerExecutorThreadDump(
- class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage("threadDump")

AmplabJenkins · 2014-10-28T23:51:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22383/
Test PASSed.

SparkQA · 2014-10-31T18:43:47Z

Test build #22616 has finished for PR 2944 at commit 19707b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
- class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage("threadDump")

AmplabJenkins · 2014-10-31T18:43:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22616/
Test FAILed.

andrewor14 · 2014-10-31T21:46:30Z

retest this please

SparkQA · 2014-10-31T21:50:21Z

Test build #22644 has started for PR 2944 at commit 19707b0.

This patch merges cleanly.

SparkQA · 2014-10-31T23:03:42Z

Test build #22644 has finished for PR 2944 at commit 19707b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
- class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage("threadDump")

AmplabJenkins · 2014-10-31T23:03:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22644/
Test PASSed.

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

JoshRosen · 2014-11-02T05:24:18Z

I've pushed a new commit to fix the merge conflict here. Could someone review this latest revision?

SparkQA · 2014-11-02T05:24:42Z

Test build #22744 has started for PR 2944 at commit f719266.

This patch merges cleanly.

SparkQA · 2014-11-02T06:42:49Z

Test build #22744 has finished for PR 2944 at commit f719266.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
- class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage("threadDump")
- class DecimalType(DataType):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-11-02T06:42:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22744/
Test PASSed.

andrewor14 · 2014-11-03T19:21:17Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

@@ -131,7 +131,8 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging {
      // Create a new ActorSystem using driver's Spark properties to run the backend.
      val driverConf = new SparkConf().setAll(props)
      val (actorSystem, boundPort) = AkkaUtils.createActorSystem(
-        "sparkExecutor", hostname, port, driverConf, new SecurityManager(driverConf))
+        SparkEnv.executorActorSystemName,


Nice, we had this variable but we just never used it.

andrewor14 · 2014-11-03T19:59:14Z

Hey I just left a few relatively minor comments but the overall approach looks good. It's great to see that we're not doing this through the block manager interface. In the long run we may want to move more of the existing stuff into ExecutorActor so we don't do everything through the block manager actors.

JoshRosen · 2014-11-03T21:51:37Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+    Thread.getAllStackTraces.toArray.sortBy(_._1.getId).map {
+      case (thread, stackElements) =>
+        val stackTrace = stackElements.map(_.toString).mkString("\n")
+        ThreadStackTrace(thread.getId, thread.getName, thread.getState.toString, stackTrace)


Just spotted a consistency issue here: thread.getState is grabbed at a different time than the stack elements.

- Rename ThreadDumpPage -> ExecutorThreadDumpPage - Make page private[ui] - Make TriggerThreadDump into a case object - Rename fields in ThreadStackTrace - Use ThreadMxBean to obtain thread dumps instead of Thread.getAllStackTraces() - Remove documentation of spark.ui.threadDumpsEnabled configuration, but leave the option as an internal configuration - Guard against exceptions in SparkContext.getExecutorThreadDump() - Disable thread dump page and button in history server.

JoshRosen · 2014-11-03T22:48:24Z

Alright, I took another pass on this:

Rename ThreadDumpPage -> ExecutorThreadDumpPage
Make page private[ui]
Make TriggerThreadDump into a case object
Rename fields in ThreadStackTrace
Use ThreadMxBean to obtain thread dumps instead of Thread.getAllStackTraces()
Remove documentation of spark.ui.threadDumpsEnabled configuration, but leave
the option as an internal configuration
Guard against exceptions in SparkContext.getExecutorThreadDump()
Disable thread dump page and button in history server.

SparkQA · 2014-11-03T22:49:59Z

Test build #22831 has started for PR 2944 at commit 3c21a5d.

This patch merges cleanly.

SparkQA · 2014-11-04T00:10:43Z

Test build #22831 has finished for PR 2944 at commit 3c21a5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster

AmplabJenkins · 2014-11-04T00:10:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22831/
Test PASSed.

andrewor14 · 2014-11-04T00:19:53Z

LGTM, I can't wait to use this feature myself!

andrewor14 · 2014-11-04T02:18:06Z

Ok I'm merging this.

This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI. The thread dumps are collected using Thread.getAllStackTraces(). To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver. The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor. Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication. Screenshots: ![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png) ![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png) ![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png) Author: Josh Rosen <[email protected]> Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following commits: 3c21a5d [Josh Rosen] Address review comments: 880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui 19707b0 [Josh Rosen] Add one comment. 127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui 3dfc2d4 [Josh Rosen] Add missing file. bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach. f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps dfec08b [Josh Rosen] Add option to disable thread dumps in UI. 4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps. 2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode. cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite. 87b8b65 [Josh Rosen] Add new listener event for thread dumps. 8c10216 [Josh Rosen] Add missing file. 0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI

JoshRosen added 2 commits October 25, 2014 19:59

Add missing file.

8c10216

JoshRosen reviewed Oct 26, 2014
View reviewed changes

JoshRosen added 2 commits October 27, 2014 11:55

Add new listener event for thread dumps.

87b8b65

Fix test code in DAGSchedulerSuite.

cc3e6b3

Enable thread dumps from the driver when running in non-local mode.

2b8bdf3

Merge remote-tracking branch 'origin/master' into jstack-in-web-ui

f719266

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

andrewor14 reviewed Nov 3, 2014
View reviewed changes

Merge remote-tracking branch 'origin/master' into jstack-in-web-ui

880f7f7

JoshRosen reviewed Nov 3, 2014
View reviewed changes

asfgit closed this in 4f035dd Nov 4, 2014

[SPARK-611] Display executor thread dumps in web UI #2944

[SPARK-611] Display executor thread dumps in web UI #2944

Conversation

JoshRosen commented Oct 26, 2014

JoshRosen Oct 26, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 26, 2014

JoshRosen commented Oct 26, 2014

JoshRosen commented Oct 26, 2014

SparkQA commented Oct 26, 2014

AmplabJenkins commented Oct 26, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

shaneknapp commented Oct 27, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

andrewor14 commented Oct 28, 2014

andrewor14 commented Oct 28, 2014

shivaram commented Oct 28, 2014

JoshRosen commented Oct 28, 2014

shivaram commented Oct 28, 2014

shivaram commented Oct 28, 2014

JoshRosen commented Oct 28, 2014

SparkQA commented Oct 28, 2014

JoshRosen commented Oct 28, 2014

shivaram commented Oct 28, 2014

JoshRosen commented Oct 28, 2014

SparkQA commented Oct 28, 2014

AmplabJenkins commented Oct 28, 2014

SparkQA commented Oct 31, 2014

AmplabJenkins commented Oct 31, 2014

andrewor14 commented Oct 31, 2014

SparkQA commented Oct 31, 2014

SparkQA commented Oct 31, 2014

AmplabJenkins commented Oct 31, 2014

JoshRosen commented Nov 2, 2014

SparkQA commented Nov 2, 2014

SparkQA commented Nov 2, 2014

AmplabJenkins commented Nov 2, 2014

andrewor14 Nov 3, 2014

Choose a reason for hiding this comment

andrewor14 commented Nov 3, 2014

JoshRosen Nov 3, 2014

Choose a reason for hiding this comment

JoshRosen commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

andrewor14 commented Nov 4, 2014

andrewor14 commented Nov 4, 2014