[SPARK-13587] [PYSPARK] Support virtualenv in pyspark #13599

zjffdu · 2016-06-10T11:40:17Z

What changes were proposed in this pull request?

Support virtualenv in pyspark as described in SPARK-13587
https://docs.google.com/document/d/1KB9RYW8_bSeOzwVqZFc_zy_vXqqqctwrU5TROP_16Ds/edit?usp=sharing

How was this patch tested?

Manually verified on centos and mac os, but not on windows yet.

Here's scenarios I have verified.

Run conda on yarn-client mode

bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/conda"  ~/work/virtualenv/spark.py

Run virtualenv on yarn-client mode

bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv"  ~/work/virtualenv/spark.py

SparkQA · 2016-06-10T13:44:51Z

Test build #60293 has finished for PR 13599 at commit 344c769.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-10T13:52:21Z

Test build #60294 has finished for PR 13599 at commit ef03af9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-06-12T04:55:05Z

core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala

+
+private[spark] class PythonWorkerFactory(pythonExec: String,
+                                         envVars: Map[String, String],
+                                         conf: SparkConf)


Meybe as below..

private[spark] class PythonWorkerFactory( pythonExec: String, envVars: Map[String, String], conf: SparkConf)

zjffdu · 2016-08-23T08:44:35Z

@davies @JoshRosen Do you have time to review it ? @stibbons has another PR #14180 based on this to support wheelhouse, but I just afraid putting these together would be too big to review. So it would be better to finish this PR and then work on wheelhouse support.

SparkQA · 2016-08-30T03:55:00Z

Test build #64616 has finished for PR 13599 at commit 5c1a183.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-30T04:34:46Z

Test build #64617 has finished for PR 13599 at commit 31cb42c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gsemet · 2016-09-06T08:43:41Z

@zjffdu any news?

zjffdu · 2016-09-07T07:04:31Z

ping @davies @JoshRosen, do you have time to review this approach of virtualenv support for pyspark ? Very appreciated.

holdenk · 2018-05-30T16:39:37Z

One point that I'm reminded of looking at this is the lack of tests for important functionality is probably not an acceptable solution. I understand testing this in Jenkins could be complicated but a test suite that requires a Yarn resource and is marked as skip for Jenkins usage would still be better than nothing.

SparkQA · 2018-06-04T07:05:02Z

Test build #91437 has finished for PR 13599 at commit 9f32a2f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
class DriverEndpoint(override val rpcEnv: RpcEnv)

HyukjinKwon · 2018-06-04T07:30:16Z

retest this please

SparkQA · 2018-06-04T12:16:08Z

Test build #91438 has finished for PR 13599 at commit 9f32a2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
class DriverEndpoint(override val rpcEnv: RpcEnv)

kokes · 2018-06-07T07:46:59Z

Hi, thanks for all the work on this! I see requirements.txt mentioned here and there and, browsing this and other JIRAs, it seems to be the proposed way to specify dependencies in PySpark. As you probably know, the community has rallied around Pipfiles as a replacement for requirements.txt.

This has a few upsides (including a lock file), the main one being that the reference implementation (Pipenv) allows for installing packages into a new virtualenv directly, without having to activate it or run other commands. So that combines dependency management, reproducibility, and environment isolation.

(Also, if one doesn't want said packages to be installed in a venv, there's an argument to install them system-wide.)

I'm not proposing this PR gets extended to support Pipfiles, I just wanted to ask if this has been considered and is on the roadmap, since it seems to be the successor to requirements.txt.

(We stumbled upon this as we were thinking of moving to Kubernetes and didn't know how dependencies were handled there [they aren't, yet, see #21092]. We could install dependencies in our target Docker images using Pipfiles, but submitting a Pipfile with our individual jobs would be a much cleaner solution.)

Thanks!

zjffdu · 2018-06-07T07:57:39Z

Thanks for the interest on this PR and the info about Pipfiles. I think we could support that after this PR get merged so that we can provide users more options for virtualenv based on their enviroment.

HyukjinKwon · 2018-06-07T08:04:49Z

@holdenk and @zjffdu, I believe manual tests are a-okay if it's difficult to write a test. We can manually test and expose this as an experimental feature too.

BTW, I believe we can still have some tests to check if, for example, at least the string is properly constructed - https://github.com/apache/spark/pull/13599/files#r175670974? I think that could be enough for now. Somehow I happened to look into this multiple times over few years and I think it's better to go ahead than just blocking here.

HyukjinKwon · 2018-06-07T08:05:25Z

@JoshRosen, I roughly heard that you took a look about this before. Do you have a concern to address maybe?

HyukjinKwon · 2018-06-07T08:09:58Z

docs/submitting-applications.md

@@ -218,6 +218,115 @@ These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to
 For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries
 to executors.

+# VirtualEnv for PySpark


@zjffdu, mind if I ask to describe this is an experimental feature and it's very likely to be unstable and it's still evolving?

Thanks @HyukjinKwon , doc is updated

HyukjinKwon

Will take a closer look within few days.

HyukjinKwon · 2018-06-07T08:55:00Z

docs/submitting-applications.md

+Here are several ways to install packages
+
+{% highlight python %}
+sc.install_packages("numpy")    		    # install the latest numpy


Seems there are tabs here. Shall we replace them to spaces?

HyukjinKwon · 2018-06-07T08:57:32Z

docs/submitting-applications.md

+in 2 scenarios:
+* Batch mode (submit spark app via spark-submit)
+* Interactive mode (PySpark shell or other third party Spark Notebook)
+


Ah, maybe we can leave a note at the end instead of adding it in the title.

Note that this is an experimental feature added from Spark 2.4.0 and may evolve in the future version.

HyukjinKwon · 2018-06-07T08:59:39Z

python/pyspark/context.py

+        Install python packages on all executors and driver through pip. pip will be installed
+        by default no matter using native virtualenv or conda. So it is guaranteed that pip is
+        available if virtualenv is enabled.
+        :param packages: string for single package or a list of string for multiple packages


Shall we add:

.. versionadded:: 2.3.0 .. note:: Experimental

HyukjinKwon · 2018-06-07T09:01:11Z

python/pyspark/context.py

+        else:
+            self._conf.set("spark.pyspark.virtualenv.packages", ":".join(packages))
+
+        import functools


Can we move this up within this function?

SparkQA · 2018-06-07T13:00:16Z

Test build #91519 has finished for PR 13599 at commit 3c02852.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
class DriverEndpoint(override val rpcEnv: RpcEnv)

SparkQA · 2018-06-15T06:49:33Z

Test build #91892 has finished for PR 13599 at commit d708997.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
class DriverEndpoint(override val rpcEnv: RpcEnv)

SparkQA · 2018-06-15T07:05:01Z

Test build #91891 has finished for PR 13599 at commit 3fb67cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
class DriverEndpoint(override val rpcEnv: RpcEnv)

ifilonenko

Thank you so much for this!!!
I would like to see appropriate changes to account for the 2 cluster-managers that are in need of such a solution. For Kubernetes, there need to be come considerations made towards options for baking in the requirements.txt file into the Docker image itself and maybe setting a conf value instead of user.home. I also think that if touching the SchedulerBackend could be avoided, as the change doesn't necessarily carry-over to Kubernetes, that would be best, but I don't see a work-around atm. Furthermore, this work is missing unit and integration tests that I think are important to show completeness and correctness. @holdenk to comment. I also included some NITs.

If it is alright, I might add some commits to this PR to allow for Kubernetes support (including the appropriate integration tests)

ifilonenko · 2018-06-17T06:38:59Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+    require(virtualEnvType == "native" || virtualEnvType == "conda",
+      s"VirtualEnvType: $virtualEnvType is not supported." )
+    require(new File(virtualEnvBinPath).exists(),
+      s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't exist.")


In addition how are we handling the case for an existing s"$virtualEnvBinPath/$virtualEnvName"?

ifilonenko · 2018-06-17T06:42:50Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+    // 2. created outside yarn container. Spark need to create temp directory and clean it after app
+    //    finish.
+    //      - driver of PySpark shell
+    //      - driver of yarn-client mode


In the case of Kubernetes:

This will be created in the base spark-py Docker image, which is shared between the driver and executors and the containers will be cleaned up upon termination of the job via owner-labels (for the executor) and the k8s API-Server (for the driver).

As such, (hopefully with client-mode support being completed soon), the below logic should hold as well.

Is this work going to be cluster-manage agnostic? Or is this supposed to only support Yarn? I would like to see this be applicable to all first-class cluster-management systems.

I can help with appending to this PR: k8s Support and the appropriate integration tests.

ifilonenko · 2018-06-17T06:53:02Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+    // Use the absolute path of requirement file in the following cases
+    // 1. driver of pyspark shell
+    // 2. driver of yarn-client mode
+    // otherwise just use filename as it would be downloaded to the working directory of Executor


In Kubernetes world, I might want to use a requirements.txt file that is stored locally in the base docker image, regardless of client or cluster mode. Is that something that you think should be supported? Maybe a config variable spark.pyspark.virtualenv.kubernetes.localrequirements that points to a file stored as local:///var/files/requirements.txt for example.

Furthermore, when we introduce a Resource Staging Server that allows us to stage files locally, this setting will be inter-changable between something that is locally baked in vs. staged.

ifilonenko · 2018-06-17T07:00:17Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+        //    2. requirement file is not specified. (Interactive mode).
+        //       In this case `spark.pyspark.virtualenv.python_version` must be specified.
+
+        if (pysparkRequirements.isDefined) {


Please re-write as a .map(...).getOrElse(..) as if(.isDefined) not idiomatic Scala

ifilonenko · 2018-06-17T07:00:58Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+        "install", "-r", pysparkRequirements.get))
+    }
+    // install additional packages
+    if (initPythonPackages.isDefined) {


.forEach not isDefined

ifilonenko · 2018-06-17T07:04:08Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+      // requirement file for native is not mandatory, run this only when requirement file
+      // is specified.
+      execCommand(List(virtualPythonExec, "-m", "pip",
+        "--cache-dir", System.getProperty("user.home"),


How does this logic carry across cluster managers? Has this been considered for use-cases for Mesos? In Kubernetes, this should be fine but we also should have some documentation about this somewhere and test this with an integration test,

ifilonenko · 2018-06-17T07:07:54Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+   * Create virtualenv using native virtualenv or conda
+   *
+   */
+  def setupVirtualEnv(): String = {


Might be better to pass args into this function so that is could be properly unit-tested. It seems that there are no unit-tests for this class, so that seems to be a necessary addition.

ifilonenko · 2018-06-17T07:08:12Z

core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala

      .orElse(sparkConf.get(PYSPARK_PYTHON))
      .orElse(sys.env.get("PYSPARK_DRIVER_PYTHON"))
      .orElse(sys.env.get("PYSPARK_PYTHON"))
      .getOrElse("python")

+    if (sparkConf.getBoolean("spark.pyspark.virtualenv.enabled", false)) {
+      val virtualEnvFactory = new VirtualEnvFactory(pythonExec, sparkConf, true)
+      pythonExec = virtualEnvFactory.setupVirtualEnv()


ifilonenko · 2018-06-17T07:08:32Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+    if (args.isPython) {
+      if (clusterManager != YARN &&
+        args.sparkProperties.getOrElse("spark.pyspark.virtualenv.enabled", "false") == "true") {
+        printErrorAndExit("virtualenv is only supported in yarn mode")


+1 for Kubernetes

ifilonenko · 2018-06-17T07:14:43Z

python/pyspark/context.py

+        """
+        Install python packages on all executors and driver through pip. pip will be installed
+        by default no matter using native virtualenv or conda. So it is guaranteed that pip is
+        available if virtualenv is enabled.


This will only be the case if in Kubernetes you specify the spark-py image. So this will be need to be expanded per cluster-manager.

ifilonenko · 2018-07-03T11:53:02Z

Is there any work being done on this PR at this point in time?

gsemet · 2018-07-03T19:42:10Z

Hi. I think until we get a core developer really interested in packaging python dependencies correctly, this PR won’t evolve a lot

holdenk · 2018-07-13T17:44:01Z

I'm interested in us fixing this, especially after yesterday when I spent several hours working with workaround hacks. But I want us to do something not YARN specific and not involve a large slow down on worker creation.

rvesse · 2018-07-17T08:57:14Z

@holdenk What we're doing in some of our products currently is that we require that users create their Python environments up front and that they be stored on a file system that is accessible to all physical nodes. This is partly for performance and partly because our compute nodes don't have external network connectivity i.e. we can't resolve dependencies from our workers.

Then when we spin up containers we volume mount the appropriate file system into our containers and have logic in our entry point scripts that activates the relevant environment prior to starting Spark, Dask Distributed or whatever Python job we're actually launching.

We're doing this with Spark standalone clusters currently but I expect much the same approach would work for Kubernetes and other resource managers.

mstreuhofer · 2018-07-28T18:15:53Z

core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala

+    if (isLauncher ||
+      (isDriver && conf.get("spark.submit.deployMode") == "client")) {
+      val virtualenvBasedir = Files.createTempDir()
+      virtualenvBasedir.deleteOnExit()


the temporary directory is not being deleted on exit.

vanzin · 2019-01-25T00:16:22Z

Since there hasn't been any updates for ~ 6 months, and it seems people haven't really agreed about the scope of the feature, I'm closing the PR. If can be reopened by just updating the branch.

Amitg1 · 2021-04-25T11:24:10Z

Hi,
What's the feature status? it's not clear if it's implemented or not.

HyukjinKwon · 2021-04-25T12:38:55Z

It's supported via a different approach now. See https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

Amitg1 · 2021-04-25T13:10:50Z

Yes, but this is a less dynamic approach since the user is expected to pre-pack all packages(~800MB file) instead of supply 10 line text file

rvesse · 2021-04-26T10:27:23Z

Sure, but no approach is going to be perfect.

If you want dynamic package resolution your best option would be to run in containers and build that capability into your containers entry points somehow. Even then this has drawbacks in that you have every driver/executor needing to download and install packages at startup which can create big increases in startup time and/or lead to application failure if the driver/executors take different amounts of time to start up leading to connection timeouts. The dynamic approach also fails for air-gapped environments (very common in my $dayjob).

With the "official" Spark approach you only have to pay that cost once, and can do so somewhere with the necessary network connectivity to download all the packages you need. Once your environment is packaged that cost is paid and you can re-use it as many times as you need.

Your only real gotcha is that the OS environment where you build the environment needs to match the OS environment where you want to run it or you can find you has OS dependent packages that won't work.

zjffdu force-pushed the SPARK-13587 branch from 344c769 to ef03af9 Compare June 10, 2016 11:46

HyukjinKwon reviewed Jun 12, 2016
View reviewed changes

zjffdu mentioned this pull request Aug 23, 2016

[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180

Closed

zjffdu force-pushed the SPARK-13587 branch 2 times, most recently from 6334b0b to 31cb42c Compare August 30, 2016 02:06

zjffdu force-pushed the SPARK-13587 branch from 44500fc to 9f32a2f Compare June 4, 2018 05:35

HyukjinKwon reviewed Jun 7, 2018

View reviewed changes

zjffdu force-pushed the SPARK-13587 branch from 9f32a2f to 3c02852 Compare June 7, 2018 08:26

HyukjinKwon reviewed Jun 7, 2018

View reviewed changes

zjffdu force-pushed the SPARK-13587 branch from 3c02852 to 3fb67cd Compare June 15, 2018 06:31

[SPARK-13587] [PYSPARK] Support virtualenv in pyspark

d708997

zjffdu force-pushed the SPARK-13587 branch from 3fb67cd to d708997 Compare June 15, 2018 06:40

ifilonenko reviewed Jun 17, 2018

View reviewed changes

mstreuhofer reviewed Jul 28, 2018

View reviewed changes

vanzin closed this Jan 25, 2019

[SPARK-13587] [PYSPARK] Support virtualenv in pyspark #13599

[SPARK-13587] [PYSPARK] Support virtualenv in pyspark #13599

Conversation

zjffdu commented Jun 10, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 10, 2016

SparkQA commented Jun 10, 2016

Choose a reason for hiding this comment

zjffdu commented Aug 23, 2016

SparkQA commented Aug 30, 2016

SparkQA commented Aug 30, 2016

gsemet commented Sep 6, 2016

zjffdu commented Sep 7, 2016

holdenk commented May 30, 2018

SparkQA commented Jun 4, 2018

HyukjinKwon commented Jun 4, 2018

SparkQA commented Jun 4, 2018

kokes commented Jun 7, 2018

zjffdu commented Jun 7, 2018

HyukjinKwon commented Jun 7, 2018

HyukjinKwon commented Jun 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 7, 2018

SparkQA commented Jun 15, 2018

SparkQA commented Jun 15, 2018

ifilonenko left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifilonenko commented Jul 3, 2018

gsemet commented Jul 3, 2018

holdenk commented Jul 13, 2018

rvesse commented Jul 17, 2018 • edited Loading

Choose a reason for hiding this comment

vanzin commented Jan 25, 2019

Amitg1 commented Apr 25, 2021

HyukjinKwon commented Apr 25, 2021

Amitg1 commented Apr 25, 2021

rvesse commented Apr 26, 2021

zjffdu commented Jun 10, 2016 •

edited

Loading

ifilonenko left a comment •

edited

Loading

rvesse commented Jul 17, 2018 •

edited

Loading