Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12699][SPARKR] R driver process should start in a clean state #10652

Closed
wants to merge 3 commits into from

Conversation

felixcheung
Copy link
Member

Currently we have R worker process launched with the --vanilla option that brings it up in a clean state (without init profile or workspace data, https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html). However, the R process for the Spark driver is not.

We should do that because

  1. That would make driver consistent with the worker process in R - for instance, a library might be load in driver but not worker
  2. Since SparkR depends on .libPath and .First() it could be broken by something in the user workspace, for example

Here are the changes proposed:

  1. When starting sparkR shell (except: allow save/restore workspace, since the driver/shell is local)
  2. When launching R driver in cluster mode
  3. In cluster mode, when calling R to install shipped R package

This is discussed in PR #10171

@shivaram @sun-rui

@felixcheung
Copy link
Member Author

jenkins, retest this please

@@ -36,7 +36,8 @@ private[deploy] object RPackageUtils extends Logging {
private final val hasRPackage = "Spark-HasRPackage"

/** Base of the shell command used in order to install R packages. */
private final val baseInstallCmd = Seq("R", "CMD", "INSTALL", "-l")
private final val baseInstallCmd = Seq("R", "--no-save", "--no-site-file", "--no-environ",
"--no-restore", "CMD", "INSTALL", "-l")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these options do not make sense for R package installation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it does - it's easier to try to install package in a clean state than trying to debug when the job failed because the package failed to install.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just installation that will not start R session, so these options won't be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually would load the same site file, saved session etc when launching R with R CMD - look for R CMD in https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html

@shivaram
Copy link
Contributor

So I'm not completely sure this is a good idea. Users might have their own R environment setup scripts in their home directory (site-file or init-file as in the R docs you linked to) that they expect to work on the driver side. On the executor side it is much more limited in terms of what code runs (i.e. invisible to the user) so I don't think the same expectations can be matched with respect to that ?

@felixcheung
Copy link
Member Author

Driver could also be running in YARN cluster mode in which a clean state might make sense?
To me this is just to reduce the level of variability. And this was brought up in PR #10171
I could also change this to only for driver in cluster mode but not from sparkR shell.

@shivaram
Copy link
Contributor

Yeah doing it just for the cluster mode driver seems fine to me.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49735 has finished for PR 10652 at commit 78eb194.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor

sun-rui commented Jan 20, 2016

RRunner is not only for running driver on cluster, but also for running an R script locally in client mode.

@felixcheung
Copy link
Member Author

@sun-rui is it spark-submit foo.R?

@sun-rui
Copy link
Contributor

sun-rui commented Jan 20, 2016

@felixcheung, yes, something like that

@felixcheung
Copy link
Member Author

I don't know if there is a way to distinguish that.
It could be spark-submit or calling SparkSubmit class from Oozie and running the job in YARN client mode in which case the driver is actually running on a worker, which could be the same worker running executors.

I guess we could explicitly bypass this if the cluster manager is LOCAL?

@sun-rui
Copy link
Contributor

sun-rui commented Jan 20, 2016

It is possible to get deploy mode from "spark.submit.deployMode", and check if it is "client". You can take a look at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L49

@felixcheung
Copy link
Member Author

I realize that, my point is even in client mode the driver could be running on a worker machine, as in the case Spark job is submitted from another YARN app.

To elaborate more, one of the possible source of issue is running Spark job in YARN client mode from a workflow engine (eg. Oozie). In such a case, the driver/client is actually being run on a random worker node of the cluster.

If we think picking up random profile that way is ok, then I guess I could change it to add the flags only when deployMode is cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants