[SPARK-11035][core] Add in-process Spark app launcher. #19591

vanzin · 2017-10-27T22:47:07Z

This change adds a new launcher that allows applications to be run
in a separate thread in the same process as the calling code. To
achieve that, some code from the child process implementation was
moved to abstract classes that implement the common functionality,
and the new launcher inherits from those.

The new launcher was added as a new class, instead of implemented
as a new option to the existing SparkLauncher, to avoid ambigous
APIs. For example, SparkLauncher has ways to set the child app's
environment, modify SPARK_HOME, or control the logging of the
child process, none of which apply to in-process apps.

The in-process launcher has limitations: it needs Spark in the
context class loader of the calling thread, and it's bound by
Spark's current limitation of a single client-mode application
per JVM. It also relies on the recently added SparkApplication
trait to make sure different apps don't mess up each other's
configuration, so config isolation is currently limited to cluster mode.

I also chose to keep the same socket-based communication for in-process
apps, even though it might be possible to avoid it for in-process
mode. That helps both implementations share more code.

Tested with new and existing unit tests, and with a simple app that
uses the launcher; also made sure the app ran fine with older launcher
jar to check binary compatibility.

This change adds a new launcher that allows applications to be run in a separate thread in the same process as the calling code. To achieve that, some code from the child process implementation was moved to abstract classes that implement the common functionality, and the new launcher inherits from those. The new launcher was added as a new class, instead of implemented as a new option to the existing SparkLauncher, to avoid ambigous APIs. For example, SparkLauncher has ways to set the child app's environment, modify SPARK_HOME, or control the logging of the child process, none of which apply to in-process apps. The in-process launcher has limitations: it needs Spark in the context class loader of the calling thread, and it's bound by Spark's current limitation of a single client-mode application per JVM. It also relies on the recently added SparkApplication trait to make sure different apps don't mess up each other's configuration, but currently no cluster manager client implements that. I also chose to keep the same socket-based communication for in-process apps, even though it might be possible to avoid it for in-process mode. That helps both implementations share more code. Tested with new and existing unit tests, and with a simple app that uses the launcher; also made sure the app ran fine with older launcher jar to check binary compatibility.

SparkQA · 2017-10-28T02:23:18Z

Test build #83137 has finished for PR 19591 at commit 928a052.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class AbstractAppHandle implements SparkAppHandle
public abstract class AbstractLauncher<T extends AbstractLauncher>
* It is safe to add arguments modified by other methods in this class (such as
class ChildProcAppHandle extends AbstractAppHandle
class InProcessAppHandle extends AbstractAppHandle
public class InProcessLauncher extends AbstractLauncher<InProcessLauncher>
public class SparkLauncher extends AbstractLauncher<SparkLauncher>

justinuang · 2017-10-30T10:39:50Z

Really looking forward to this PR! For our use case, it will reduce our spark launch times by ~4 seconds.

vanzin · 2017-11-02T21:44:00Z

A note about the implementation: since this is executing SparkSubmit under the covers, it's possible to call the new InProcessLauncher and cause it to exit the current JVM because there's an error with the user-provided args. That's not optimal, but to avoid growing the current PR too much I'll leave the proper fix for that to a separate change.

SparkQA · 2017-11-10T00:22:11Z

Test build #83655 has finished for PR 19591 at commit 8496024.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-11-13T21:28:00Z

I'll leave this up a little longer to see if anyone volunteers to review, otherwise I'll ping some random people.

vanzin · 2017-11-16T00:18:09Z

@tgravescs (who reviewed the original change for this bug), @srowen @jerryshao

tgravescs · 2017-11-20T22:40:28Z

ack will try to get to this tomorrow

vanzin · 2017-11-30T20:43:44Z

Ping.

vanzin · 2017-12-04T21:04:15Z

retest this please

SparkQA · 2017-12-04T23:43:35Z

Test build #84442 has finished for PR 19591 at commit 8496024.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-12-05T18:10:39Z

retest this please

SparkQA · 2017-12-05T21:33:36Z

Test build #84498 has finished for PR 19591 at commit 8496024.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-08T04:20:20Z

Test build #84629 has finished for PR 19591 at commit 8496024.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-08T22:23:34Z

Test build #84657 has finished for PR 19591 at commit 8013766.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-12-08T23:02:52Z

retest this please

SparkQA · 2017-12-09T01:49:14Z

Test build #84666 has finished for PR 19591 at commit 8013766.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-12-09T03:24:40Z

Looks like a legitimate flaky test. Will take a look.

SparkQA · 2017-12-09T06:43:08Z

Test build #84677 has finished for PR 19591 at commit ee4098b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-12-11T17:46:05Z

retest this please

SparkQA · 2017-12-11T21:12:53Z

Test build #84721 has finished for PR 19591 at commit ee4098b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-12-12T18:37:50Z

retest this please

SparkQA · 2017-12-12T21:29:04Z

Test build #84783 has finished for PR 19591 at commit ee4098b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-12-12T21:32:11Z

Failure looks unrelated... retest this please

SparkQA · 2017-12-12T21:34:36Z

Test build #84792 has finished for PR 19591 at commit ee4098b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rezasafi

This PR is generally well-written and easy to understand. Just a single comment

rezasafi · 2017-12-13T00:25:22Z

launcher/src/main/java/org/apache/spark/launcher/InProcessLauncher.java

+    if (builder.isClientMode(builder.getEffectiveConfig())) {
+      LOG.warning("It's not recommended to run client-mode applications using InProcessLauncher.");
+    }
+


Just maybe a LOG.debug that shows an in-process app is started will be useful.

You'll already get a ton of logs from SparkSubmit (or an exception if it doesn't run).

vanzin · 2017-12-13T18:43:08Z

retest this please

tgravescs · 2017-12-13T19:02:45Z

@vanzin, sorry I've been swamped and haven't had a chance to get to this. Still on my list but can't guarantee time frame.

SparkQA · 2017-12-13T22:15:04Z

Test build #84877 has finished for PR 19591 at commit ee4098b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

overall looks good, mostly I just have small questions to make sure I understand what is going on. Will do another pass tomorrow, need to look more closely at tests.

I assume by this:

since this is executing SparkSubmit under the covers, it's possible to call the new InProcessLauncher and cause it to exit the current JVM because there's an error with the user-provided args. That's not optimal, but to avoid growing the current PR too much I'll leave the proper fix for that to a separate change.

you're talking about removing the System.exit() in SparkSubmit?

squito · 2017-12-21T05:11:24Z

launcher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java

+class InProcessAppHandle extends AbstractAppHandle {
+
+  private static final Logger LOG = Logger.getLogger(ChildProcAppHandle.class.getName());
+  private static final ThreadFactory THREAD_FACTORY = new NamedThreadFactory("spark-app-%d");


how about the thread name including builder.appName? might be useful if you are trying to monitor a bunch of threads?

(though you'd also be launching in cluster mode in that case, so the thread wouldn't be doing much ...)

squito · 2017-12-21T05:34:01Z

launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java

        return;
      }

+      State currState = getState();


this was added just because currState is no longer accessible, right? You're not particularly trying to grab the state before the call to disconnect()? Might be clearly to move it after, otherwise on first read it looks like it is trying to grab the state before its modified by disconnect()

squito · 2017-12-21T05:35:21Z

launcher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java

+    LOG.warning("kill() may leave the underlying app running in in-process mode.");
+    disconnect();
+
+    // Interrupt the thread. This is not guaranteed to kill the app, though.


in cluster mode, shouldn't this be pretty safe? If not, wouldn't it be a bug in spark-submit?

It depends on the implementation of the client that does the submission, not spark-submit, but it should be safe and you could consider it a bug if it doesn't work.

sorry I don't understand. I don't see why the client that does the submission would matter. I thought you'd have problems if the interrupt was caught and swallowed by spark-submit.

SparkSubmit just runs some other class, in this case, the class that submits the app in cluster mode (or the user class in client mode). And that class could swallow these interrupts, just as SparkSubmit also could (but doesn't?).

ok, just a misunderstanding then. that is what I thought I was saying in the first place.

anyway, I guess we can leave the warning here for now in all cases and see if we get any reports from users in cluster mode ...

squito · 2017-12-21T05:36:19Z

launcher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java

+      synchronized (InProcessAppHandle.this) {
+        if (!isDisposed()) {
+          State currState = getState();
+          disconnect();


same comment here on ordering of getState() & disconnect()

squito · 2017-12-21T05:51:45Z

launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java

-      int ival = b >= 0 ? b : Byte.MAX_VALUE - b;
-      if (ival < 0x10) {
-        sb.append("0");
+    while (true) {


checking my understanding -- extra bug fix here? even in old code, if by chance two apps had same secret, you'd end up losing one handle?

Not really, it's mostly moving the logic that existed before (look for while (server.pending.containsKey(secret))).

squito · 2017-12-21T05:54:28Z

launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java

-  void unregister(ChildProcAppHandle handle) {
-    pending.remove(handle.getSecret());
+  void unregister(AbstractAppHandle handle) {
+    for (Map.Entry<String, AbstractAppHandle> e : pending.entrySet()) {


~~you could add a system.identityHashCode "secret" to the InProcessAppHandle to keep the old version, though this is fine too~~

nevermind, stupid idea

ok one more try --

you are generating a secret whether its in-process or a child process, so why not store that secret in AbstractAppHandle? this isn't really a big deal, the efficiency difference doesn't matter, but the one line pending.remove(handle.getSecret()) is also easier to follow.

also could you rename pending to secretToPendingApps so its a more clear what the key is.

I think I had something like that at some point, but it was more code than the current version... there's a little bit of a chicken & egg problem between handles and secrets, and keeping them separate simplified things a bit at least for me.

I'll do the rename.

squito · 2017-12-21T05:56:28Z

launcher/src/main/java/org/apache/spark/launcher/package-info.java

@@ -49,6 +49,15 @@
 * </pre>
 *
 * <p>
+ * Applications can also be launched in-process by using
+ * {@link org.apache.spark.launcher.InProcessLauncher} instead. Launching applications in-process


comment above about "there is only one entry point" is a bit incorrect now

squito

ok after a fresh look I think this is pretty much fine. couple of questions and suggestions for changes for code clarity

vanzin · 2017-12-21T19:18:38Z

you're talking about removing the System.exit() in SparkSubmit?

Yes, and also potentially changing error messages currently printed to the terminal into exceptions when running through the launcher.

squito · 2017-12-21T20:43:01Z

lgtm

SparkQA · 2017-12-21T22:48:36Z

Test build #85279 has finished for PR 19591 at commit 5dd5b5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2017-12-28T23:01:54Z

merged to master

Avoid a thread just to wait for the user's app to finish.

8496024

Nit.

8013766

Clean up system properties after test.

ee4098b

rezasafi suggested changes Dec 13, 2017

View reviewed changes

squito reviewed Dec 21, 2017

View reviewed changes

Feedback.

5dd5b5d

asfgit closed this in cfcd746 Dec 28, 2017

vanzin deleted the SPARK-11035 branch January 5, 2018 22:34

[SPARK-11035][core] Add in-process Spark app launcher. #19591

[SPARK-11035][core] Add in-process Spark app launcher. #19591

Conversation

vanzin commented Oct 27, 2017 • edited Loading

SparkQA commented Oct 28, 2017

justinuang commented Oct 30, 2017

vanzin commented Nov 2, 2017

SparkQA commented Nov 10, 2017

vanzin commented Nov 13, 2017

vanzin commented Nov 16, 2017

tgravescs commented Nov 20, 2017

vanzin commented Nov 30, 2017

vanzin commented Dec 4, 2017

SparkQA commented Dec 4, 2017

vanzin commented Dec 5, 2017

SparkQA commented Dec 5, 2017

SparkQA commented Dec 8, 2017

SparkQA commented Dec 8, 2017

vanzin commented Dec 8, 2017

SparkQA commented Dec 9, 2017

vanzin commented Dec 9, 2017

SparkQA commented Dec 9, 2017

vanzin commented Dec 11, 2017

SparkQA commented Dec 11, 2017

vanzin commented Dec 12, 2017

SparkQA commented Dec 12, 2017

vanzin commented Dec 12, 2017

SparkQA commented Dec 12, 2017

rezasafi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Dec 13, 2017

tgravescs commented Dec 13, 2017

SparkQA commented Dec 13, 2017

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito left a comment

Choose a reason for hiding this comment

vanzin commented Dec 21, 2017

squito commented Dec 21, 2017

SparkQA commented Dec 21, 2017

squito commented Dec 28, 2017

vanzin commented Oct 27, 2017 •

edited

Loading

rezasafi left a comment •

edited

Loading

squito Dec 21, 2017 •

edited

Loading