Spark on Kubernetes - basic scheduler backend #498

foxish · 2017-09-20T08:43:41Z

Continuing #492

Stripped out a lot of extraneous things, to create this. Our first PR upstream will likely be similar to this. (note that it is created against the master branch which is up-to-date.)

Following PRs will have:

Submission Client
Dynamic allocation
HDFS locality fixes
etc

TODO before we can retarget this to apache/spark:master:

Renaming package to k8s (Rename package to k8s #497)
Unit testing for executorpodfactory (Add unit-testing for executorpodfactory #491)

cc @ash211 @mccheah @apache-spark-on-k8s/contributors

mccheah

I think this is excellent content-wise for an initial push. We could have considered moving the executor failure handling to its separate commit, but I'm not too strongly opinionated one way or another.

mccheah · 2017-09-21T18:14:28Z

...s/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/ConfigurationUtils.scala

+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.internal.Logging
+
+object ConfigurationUtils extends Logging {


Mark with private[spark]

mccheah · 2017-09-21T18:15:06Z

...s/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/ConfigurationUtils.scala

+import org.apache.spark.internal.Logging
+
+object ConfigurationUtils extends Logging {
+  def parseKeyValuePairs(


Where do we use this at least in the context of this PR?

mccheah · 2017-09-21T18:27:13Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      requestExecutorsService)
+
+  private val driverPod = try {
+    kubernetesClient.pods().inNamespace(kubernetesNamespace).


Might prefer to stylize a bit differently:

kubernetesClient.pods() .inNamespace(kubernetesNamespace) .withName(kubernetesDriverPodName) .get

mccheah · 2017-09-21T18:35:24Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      // by the executor pod watcher. If the loss reason was discovered by the watcher,
+      // inform the parent class with removeExecutor.
+      val disconnectedPodsByExecutorIdPendingRemovalCopy =
+          Map.empty ++ disconnectedPodsByExecutorIdPendingRemoval


I think there might be a more idiomatic way to copy here. Can you try disconnectedPodsByExecutorIdPendingRemoval.toMap - as an experiment, explicitly set the type of disconnectedPodsByExecutorIdPendingRemovalCopy to Map[String, Pod]? Basically as long as this collection is not mutable then we'll be safe.

mccheah · 2017-09-21T18:39:22Z

...e/src/test/scala/org/apache/spark/scheduler/cluster/kubernetes/ExecutorPodFactorySuite.scala

+import org.apache.spark.deploy.kubernetes.constants
+import org.apache.spark.network.netty.SparkTransportConf
+
+class ExecutorPodFactoryImplSuite extends SparkFunSuite with BeforeAndAfter {


Nit - ExecutorPodFactorySuite - our convention at least so far has been to name the test after the trait and not the impl since there's only ever one impl for each trait. But this might change upstream.

mccheah · 2017-09-21T20:05:25Z

Checkstyle on Java is what's causing the build failures.

kimoonkim

Compared this with branch-2.2-kubernetes. Looks reasonable to me.

Just one minor comment below. PTAL.

kimoonkim · 2017-09-25T17:22:06Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  private val EXECUTOR_PODS_BY_IPS_LOCK = new Object
+  // Indexed by executor IP addrs and guarded by EXECUTOR_PODS_BY_IPS_LOCK
+  private val executorPodsByIPs = new mutable.HashMap[String, Pod]
+  private val podsWithKnownExitReasons: concurrent.Map[String, ExecutorExited] =


@foxish @varunkatta This part seems a bit outdated compared with branch-2.2-kubernetes, which use ConcurrentHashMap for this line.

foxish · 2017-10-10T04:16:17Z

Addressed comments. Thanks @mccheah and @kimoonkim!

foxish · 2017-10-10T04:17:30Z

Unit tests look good. There's not much else that can be tested at this point. If there are no further comments, I'll turn this into a PR against upstream tomorrow at 5pm.

felixcheung

sounds good.
btw, is there a reason for the private[spark] scope? I think it would be better to start with a more constrained scope, like deploy or k8s even. but we could see if anyone complains :)

mccheah · 2017-10-10T21:46:58Z

@felixcheung private[spark] has been canonically used everywhere but I don't know if there is a set convention to scope these down further.

mccheah · 2017-10-10T21:49:05Z

@foxish cherry-picking #491 for this too, correct?

foxish · 2017-10-10T21:50:03Z

The relevant parts of #491 are already covered in this PR.

mccheah

So I see - this is good to go.

mccheah · 2017-10-10T22:40:41Z

...Once the build is fixed, of course =)

felixcheung · 2017-10-10T22:45:03Z

It should, for example sql has private[sql] and mesos has private[mesos] https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala

tnachen · 2017-10-18T16:32:04Z

...ubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala

+ * options for different components.
+ */
+private[spark] object SparkKubernetesClientFactory {
+


nit: Clean up these white spaces, or at least be consistent

foxish · 2017-10-18T16:39:19Z

This is being reviewed in apache#19468

- Move Kubernetes client calls out of synchronized blocks to prevent locking with HTTP connection lag - Fix a bug where pods that fail to launch through the APi are not retried - Remove the map from executor pod name to executor ID by using the Pod's labels to get the same information without having to track extra state.

…rable

felixcheung · 2017-11-28T05:48:34Z

could you check the Integration Test? it seems to have failed to build [ERROR] Could not find the selected project in the reactor: resource-managers/kubernetes/integration-tests @

mccheah · 2017-11-28T18:32:15Z

@felixcheung we shouldn't be running integration tests when we submit upstream. Also, shouldn't we instead be tracking apache#19717 and thus this PR be closed?

foxish · 2017-11-28T18:33:52Z

This PR is tracking the changes in the upstream one. It's the same branch used for both PRs, and is simply for tracking.

…

On Nov 28, 2017 10:32 AM, "mccheah" ***@***.***> wrote: @felixcheung <https://github.com/felixcheung> we shouldn't be running integration tests when we submit upstream. Also, shouldn't we instead be tracking apache#19717 <apache#19717> and thus this PR be closed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#498 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3U5xdA-jTh_3QQGwd4KGQ-1C15LdWyks5s7FGxgaJpZM4PdiJJ> .

liyinan926 · 2017-11-29T19:32:32Z

Closing this one as the upstream PR has been merged.

…ache-spark-on-k8s#498) * add initial bypass merge sort shuffle writer benchmarks * dd unsafe shuffle writer benchmarks * changes in bypassmergesort benchmarks * cleanup * add circle script * add this branch for testing * fix circle attempt 1 * checkout code * add some caches? * why is it not pull caches... * save as artifact instead of publishing * mkdir * typo * try uploading artifacts again * try print per iteration to avoid circle erroring out on idle * blah (apache-spark-on-k8s#495) * make a PR comment * actually delete files * run benchmarks on test build branch * oops forgot to enable upload * add sort shuffle writer benchmarks * add stdev * cleanup sort a bit * fix stdev text * fix sort shuffle * initial code for read side * format * use times and sample stdev * add assert for at least one iteration * cleanup shuffle write to use fewer mocks and single base interface * shuffle read works with transport client... needs lots of cleaning * test running in cicle * scalastyle * dont publish results yet * cleanup writer code * get only git message * fix command to get PR number * add SortshuffleWriterBenchmark * writer code * cleanup * fix benchmark script * use ArgumentMatchers * also in shufflewriterbenchmarkbase * scalastyle * add apache license * fix some scale stuff * fix up tests * only copy benchmarks we care about * increase size for reader again * delete two writers and reader for PR * SPARK-25299: Add shuffle reader benchmarks (apache-spark-on-k8s#506) * Revert "SPARK-25299: Add shuffle reader benchmarks (apache-spark-on-k8s#506)" This reverts commit 9d46fae. * add -e to bash script * blah * enable upload as a PR comment and prevent running benchmarks on this branch * Revert "enable upload as a PR comment and prevent running benchmarks on this branch" This reverts commit 13703fa. * try machine execution * try uploading benchmarks (apache-spark-on-k8s#498) * only upload results when merging into the feature branch * lock down machine image * don't write input data to disk * run benchmark test * stop creating file cleanup threads for every block manager * use alphanumeric again * use a new random everytime * close the writers -__________- * delete branch and publish results as comment * close in finally

This was referenced Sep 20, 2017

Spark on Kubernetes - basic scheduler backend [WIP] #492

Closed

Upstreaming and pull request strategy for Spark on Kubernetes #441

Open

mccheah reviewed Sep 21, 2017

View reviewed changes

kimoonkim reviewed Sep 25, 2017

View reviewed changes

erikerlandson mentioned this pull request Sep 28, 2017

Add unit-testing for executorpodfactory #491

Merged

felixcheung reviewed Oct 10, 2017

View reviewed changes

mccheah approved these changes Oct 10, 2017

View reviewed changes

Spark on Kubernetes - basic scheduler backend

f6fdd6a

foxish force-pushed the spark-kubernetes-3 branch from 8b57934 to f6fdd6a Compare October 10, 2017 23:58

foxish changed the base branch from master to branch-2.2 October 11, 2017 00:01

foxish changed the base branch from branch-2.2 to master October 11, 2017 00:01

foxish changed the title ~~Spark on Kubernetes - basic scheduler backend [WIP]~~ Spark on Kubernetes - basic scheduler backend Oct 11, 2017

foxish added 4 commits October 17, 2017 12:14

Adding to modules.py and SparkBuild.scala

75e31a9

Exclude from unidoc, update travis

cf82b21

Address a bunch of style and other comments

488c535

Fix some style concerns

82b79a7

tnachen reviewed Oct 18, 2017

View reviewed changes

foxish and others added 4 commits October 20, 2017 14:24

Clean up YARN constants, unit test updates

c052212

Couple of more style comments

c565c9f

Address CR comments.

2fb596d

Extract initial executor count to utils class

992acbe

mccheah and others added 9 commits October 25, 2017 13:48

Fix scalastyle

b0a5839

Fix more scalastyle

a4f9797

Pin down app ID in tests. Fix test style.

2b5dcac

Address comments.

018f4d8

Address comments

6cf4ed7

Update fabric8 client version to 3.0.0

1f271be

Addressed more comments

71a971f

One more round of comments

0ab9ca7

liyinan926 force-pushed the spark-kubernetes-3 branch from df03462 to 0ab9ca7 Compare November 14, 2017 16:13

liyinan926 added 5 commits November 15, 2017 09:43

Added a comment regarding how failed executor pods are handled

7f14b71

Addressed more comments

7afce3f

Fixed Scala style error

b75b413

Removed unused parameter in parsePrefixedKeyValuePairs

3b587b4

Another round of comments

cb12fec

liyinan926 force-pushed the spark-kubernetes-3 branch from e5a6a67 to cb12fec Compare November 22, 2017 17:47

liyinan926 added 5 commits November 27, 2017 11:30

Addressed latest comments

ae396cf

Addressed comments around licensing on new dependencies

f8e3249

Fixed unit tests and made maximum executor lost reason checks configu…

a44c29e

…rable

Removed default value for executor Docker image

4bed817

Close the executor pod watcher before deleting the executor pods

c386186

Addressed more comments

b85cfc4

liyinan926 closed this Nov 29, 2017

foxish deleted the spark-kubernetes-3 branch December 20, 2017 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark on Kubernetes - basic scheduler backend #498

Spark on Kubernetes - basic scheduler backend #498

foxish commented Sep 20, 2017 •

edited

Loading

mccheah left a comment

mccheah Sep 21, 2017

mccheah Sep 21, 2017

mccheah Sep 21, 2017

mccheah Sep 21, 2017

mccheah Sep 21, 2017

mccheah commented Sep 21, 2017

kimoonkim left a comment

kimoonkim Sep 25, 2017

foxish commented Oct 10, 2017

foxish commented Oct 10, 2017

felixcheung left a comment

mccheah commented Oct 10, 2017

mccheah commented Oct 10, 2017

foxish commented Oct 10, 2017

mccheah left a comment

mccheah commented Oct 10, 2017

felixcheung commented Oct 10, 2017 via email

tnachen Oct 18, 2017

foxish commented Oct 18, 2017

felixcheung commented Nov 28, 2017

mccheah commented Nov 28, 2017

foxish commented Nov 28, 2017 via email

liyinan926 commented Nov 29, 2017

Spark on Kubernetes - basic scheduler backend #498

Spark on Kubernetes - basic scheduler backend #498

Conversation

foxish commented Sep 20, 2017 • edited Loading

mccheah left a comment

Choose a reason for hiding this comment

mccheah Sep 21, 2017

Choose a reason for hiding this comment

mccheah Sep 21, 2017

Choose a reason for hiding this comment

mccheah Sep 21, 2017

Choose a reason for hiding this comment

mccheah Sep 21, 2017

Choose a reason for hiding this comment

mccheah Sep 21, 2017

Choose a reason for hiding this comment

mccheah commented Sep 21, 2017

kimoonkim left a comment

Choose a reason for hiding this comment

kimoonkim Sep 25, 2017

Choose a reason for hiding this comment

foxish commented Oct 10, 2017

foxish commented Oct 10, 2017

felixcheung left a comment

Choose a reason for hiding this comment

mccheah commented Oct 10, 2017

mccheah commented Oct 10, 2017

foxish commented Oct 10, 2017

mccheah left a comment

Choose a reason for hiding this comment

mccheah commented Oct 10, 2017

felixcheung commented Oct 10, 2017 via email

tnachen Oct 18, 2017

Choose a reason for hiding this comment

foxish commented Oct 18, 2017

felixcheung commented Nov 28, 2017

mccheah commented Nov 28, 2017

foxish commented Nov 28, 2017 via email

liyinan926 commented Nov 29, 2017

foxish commented Sep 20, 2017 •

edited

Loading