Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13587] [PYSPARK] Support virtualenv in pyspark #13599

Closed
wants to merge 1 commit into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Jun 10, 2016

What changes were proposed in this pull request?

Support virtualenv in pyspark as described in SPARK-13587
https://docs.google.com/document/d/1KB9RYW8_bSeOzwVqZFc_zy_vXqqqctwrU5TROP_16Ds/edit?usp=sharing

How was this patch tested?

Manually verified on centos and mac os, but not on windows yet.

Here's scenarios I have verified.

Run conda on yarn-client mode

bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/conda"  ~/work/virtualenv/spark.py

Run virtualenv on yarn-client mode

bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv"  ~/work/virtualenv/spark.py

@SparkQA
Copy link

SparkQA commented Jun 10, 2016

Test build #60293 has finished for PR 13599 at commit 344c769.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 10, 2016

Test build #60294 has finished for PR 13599 at commit ef03af9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private[spark] class PythonWorkerFactory(pythonExec: String,
envVars: Map[String, String],
conf: SparkConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meybe as below..

private[spark] class PythonWorkerFactory(
    pythonExec: String,
    envVars: Map[String, String],
    conf: SparkConf)

@zjffdu
Copy link
Contributor Author

zjffdu commented Aug 23, 2016

@davies @JoshRosen Do you have time to review it ? @stibbons has another PR #14180 based on this to support wheelhouse, but I just afraid putting these together would be too big to review. So it would be better to finish this PR and then work on wheelhouse support.

@zjffdu zjffdu force-pushed the SPARK-13587 branch 2 times, most recently from 6334b0b to 31cb42c Compare August 30, 2016 02:06
@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64616 has finished for PR 13599 at commit 5c1a183.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64617 has finished for PR 13599 at commit 31cb42c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet
Copy link
Contributor

gsemet commented Sep 6, 2016

@zjffdu any news?

@zjffdu
Copy link
Contributor Author

zjffdu commented Sep 7, 2016

ping @davies @JoshRosen, do you have time to review this approach of virtualenv support for pyspark ? Very appreciated.

@holdenk
Copy link
Contributor

holdenk commented May 30, 2018

One point that I'm reminded of looking at this is the lack of tests for important functionality is probably not an acceptable solution. I understand testing this in Jenkins could be complicated but a test suite that requires a Yarn resource and is marked as skip for Jenkins usage would still be better than nothing.

@SparkQA
Copy link

SparkQA commented Jun 4, 2018

Test build #91437 has finished for PR 13599 at commit 9f32a2f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
  • class DriverEndpoint(override val rpcEnv: RpcEnv)

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 4, 2018

Test build #91438 has finished for PR 13599 at commit 9f32a2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
  • class DriverEndpoint(override val rpcEnv: RpcEnv)

@kokes
Copy link

kokes commented Jun 7, 2018

Hi, thanks for all the work on this! I see requirements.txt mentioned here and there and, browsing this and other JIRAs, it seems to be the proposed way to specify dependencies in PySpark. As you probably know, the community has rallied around Pipfiles as a replacement for requirements.txt.

This has a few upsides (including a lock file), the main one being that the reference implementation (Pipenv) allows for installing packages into a new virtualenv directly, without having to activate it or run other commands. So that combines dependency management, reproducibility, and environment isolation.

(Also, if one doesn't want said packages to be installed in a venv, there's an argument to install them system-wide.)

I'm not proposing this PR gets extended to support Pipfiles, I just wanted to ask if this has been considered and is on the roadmap, since it seems to be the successor to requirements.txt.

(We stumbled upon this as we were thinking of moving to Kubernetes and didn't know how dependencies were handled there [they aren't, yet, see #21092]. We could install dependencies in our target Docker images using Pipfiles, but submitting a Pipfile with our individual jobs would be a much cleaner solution.)

Thanks!

@zjffdu
Copy link
Contributor Author

zjffdu commented Jun 7, 2018

Thanks for the interest on this PR and the info about Pipfiles. I think we could support that after this PR get merged so that we can provide users more options for virtualenv based on their enviroment.

@HyukjinKwon
Copy link
Member

@holdenk and @zjffdu, I believe manual tests are a-okay if it's difficult to write a test. We can manually test and expose this as an experimental feature too.

BTW, I believe we can still have some tests to check if, for example, at least the string is properly constructed - https://github.com/apache/spark/pull/13599/files#r175670974? I think that could be enough for now. Somehow I happened to look into this multiple times over few years and I think it's better to go ahead than just blocking here.

@HyukjinKwon
Copy link
Member

@JoshRosen, I roughly heard that you took a look about this before. Do you have a concern to address maybe?

@@ -218,6 +218,115 @@ These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to
For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries
to executors.

# VirtualEnv for PySpark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zjffdu, mind if I ask to describe this is an experimental feature and it's very likely to be unstable and it's still evolving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HyukjinKwon , doc is updated

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a closer look within few days.

Here are several ways to install packages

{% highlight python %}
sc.install_packages("numpy") # install the latest numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there are tabs here. Shall we replace them to spaces?

in 2 scenarios:
* Batch mode (submit spark app via spark-submit)
* Interactive mode (PySpark shell or other third party Spark Notebook)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe we can leave a note at the end instead of adding it in the title.

Note that this is an experimental feature added from Spark 2.4.0 and may evolve in the future version.

Install python packages on all executors and driver through pip. pip will be installed
by default no matter using native virtualenv or conda. So it is guaranteed that pip is
available if virtualenv is enabled.
:param packages: string for single package or a list of string for multiple packages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add:

.. versionadded:: 2.3.0
.. note:: Experimental

else:
self._conf.set("spark.pyspark.virtualenv.packages", ":".join(packages))

import functools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this up within this function?

@SparkQA
Copy link

SparkQA commented Jun 7, 2018

Test build #91519 has finished for PR 13599 at commit 3c02852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
  • class DriverEndpoint(override val rpcEnv: RpcEnv)

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91892 has finished for PR 13599 at commit d708997.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
  • class DriverEndpoint(override val rpcEnv: RpcEnv)

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91891 has finished for PR 13599 at commit 3fb67cd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)
  • class DriverEndpoint(override val rpcEnv: RpcEnv)

Copy link
Contributor

@ifilonenko ifilonenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this!!!
I would like to see appropriate changes to account for the 2 cluster-managers that are in need of such a solution. For Kubernetes, there need to be come considerations made towards options for baking in the requirements.txt file into the Docker image itself and maybe setting a conf value instead of user.home. I also think that if touching the SchedulerBackend could be avoided, as the change doesn't necessarily carry-over to Kubernetes, that would be best, but I don't see a work-around atm. Furthermore, this work is missing unit and integration tests that I think are important to show completeness and correctness. @holdenk to comment. I also included some NITs.

If it is alright, I might add some commits to this PR to allow for Kubernetes support (including the appropriate integration tests)

require(virtualEnvType == "native" || virtualEnvType == "conda",
s"VirtualEnvType: $virtualEnvType is not supported." )
require(new File(virtualEnvBinPath).exists(),
s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't exist.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition how are we handling the case for an existing s"$virtualEnvBinPath/$virtualEnvName"?

// 2. created outside yarn container. Spark need to create temp directory and clean it after app
// finish.
// - driver of PySpark shell
// - driver of yarn-client mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of Kubernetes:

This will be created in the base spark-py Docker image, which is shared between the driver and executors and the containers will be cleaned up upon termination of the job via owner-labels (for the executor) and the k8s API-Server (for the driver).

As such, (hopefully with client-mode support being completed soon), the below logic should hold as well.

Is this work going to be cluster-manage agnostic? Or is this supposed to only support Yarn? I would like to see this be applicable to all first-class cluster-management systems.

I can help with appending to this PR: k8s Support and the appropriate integration tests.

// Use the absolute path of requirement file in the following cases
// 1. driver of pyspark shell
// 2. driver of yarn-client mode
// otherwise just use filename as it would be downloaded to the working directory of Executor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubernetes world, I might want to use a requirements.txt file that is stored locally in the base docker image, regardless of client or cluster mode. Is that something that you think should be supported? Maybe a config variable spark.pyspark.virtualenv.kubernetes.localrequirements that points to a file stored as local:///var/files/requirements.txt for example.

Furthermore, when we introduce a Resource Staging Server that allows us to stage files locally, this setting will be inter-changable between something that is locally baked in vs. staged.

// 2. requirement file is not specified. (Interactive mode).
// In this case `spark.pyspark.virtualenv.python_version` must be specified.

if (pysparkRequirements.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-write as a .map(...).getOrElse(..) as if(.isDefined) not idiomatic Scala

"install", "-r", pysparkRequirements.get))
}
// install additional packages
if (initPythonPackages.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.forEach not isDefined

// requirement file for native is not mandatory, run this only when requirement file
// is specified.
execCommand(List(virtualPythonExec, "-m", "pip",
"--cache-dir", System.getProperty("user.home"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this logic carry across cluster managers? Has this been considered for use-cases for Mesos? In Kubernetes, this should be fine but we also should have some documentation about this somewhere and test this with an integration test,

* Create virtualenv using native virtualenv or conda
*
*/
def setupVirtualEnv(): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to pass args into this function so that is could be properly unit-tested. It seems that there are no unit-tests for this class, so that seems to be a necessary addition.

.orElse(sparkConf.get(PYSPARK_PYTHON))
.orElse(sys.env.get("PYSPARK_DRIVER_PYTHON"))
.orElse(sys.env.get("PYSPARK_PYTHON"))
.getOrElse("python")

if (sparkConf.getBoolean("spark.pyspark.virtualenv.enabled", false)) {
val virtualEnvFactory = new VirtualEnvFactory(pythonExec, sparkConf, true)
pythonExec = virtualEnvFactory.setupVirtualEnv()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

if (args.isPython) {
if (clusterManager != YARN &&
args.sparkProperties.getOrElse("spark.pyspark.virtualenv.enabled", "false") == "true") {
printErrorAndExit("virtualenv is only supported in yarn mode")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Kubernetes

"""
Install python packages on all executors and driver through pip. pip will be installed
by default no matter using native virtualenv or conda. So it is guaranteed that pip is
available if virtualenv is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be the case if in Kubernetes you specify the spark-py image. So this will be need to be expanded per cluster-manager.

@ifilonenko
Copy link
Contributor

Is there any work being done on this PR at this point in time?

@gsemet
Copy link
Contributor

gsemet commented Jul 3, 2018

Hi. I think until we get a core developer really interested in packaging python dependencies correctly, this PR won’t evolve a lot

@holdenk
Copy link
Contributor

holdenk commented Jul 13, 2018

I'm interested in us fixing this, especially after yesterday when I spent several hours working with workaround hacks. But I want us to do something not YARN specific and not involve a large slow down on worker creation.

@rvesse
Copy link
Member

rvesse commented Jul 17, 2018

@holdenk What we're doing in some of our products currently is that we require that users create their Python environments up front and that they be stored on a file system that is accessible to all physical nodes. This is partly for performance and partly because our compute nodes don't have external network connectivity i.e. we can't resolve dependencies from our workers.

Then when we spin up containers we volume mount the appropriate file system into our containers and have logic in our entry point scripts that activates the relevant environment prior to starting Spark, Dask Distributed or whatever Python job we're actually launching.

We're doing this with Spark standalone clusters currently but I expect much the same approach would work for Kubernetes and other resource managers.

if (isLauncher ||
(isDriver && conf.get("spark.submit.deployMode") == "client")) {
val virtualenvBasedir = Files.createTempDir()
virtualenvBasedir.deleteOnExit()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the temporary directory is not being deleted on exit.

@vanzin
Copy link
Contributor

vanzin commented Jan 25, 2019

Since there hasn't been any updates for ~ 6 months, and it seems people haven't really agreed about the scope of the feature, I'm closing the PR. If can be reopened by just updating the branch.

@vanzin vanzin closed this Jan 25, 2019
@Amitg1
Copy link

Amitg1 commented Apr 25, 2021

Hi,
What's the feature status? it's not clear if it's implemented or not.

@HyukjinKwon
Copy link
Member

It's supported via a different approach now. See https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

@Amitg1
Copy link

Amitg1 commented Apr 25, 2021

Yes, but this is a less dynamic approach since the user is expected to pre-pack all packages(~800MB file) instead of supply 10 line text file

@rvesse
Copy link
Member

rvesse commented Apr 26, 2021

Sure, but no approach is going to be perfect.

If you want dynamic package resolution your best option would be to run in containers and build that capability into your containers entry points somehow. Even then this has drawbacks in that you have every driver/executor needing to download and install packages at startup which can create big increases in startup time and/or lead to application failure if the driver/executors take different amounts of time to start up leading to connection timeouts. The dynamic approach also fails for air-gapped environments (very common in my $dayjob).

With the "official" Spark approach you only have to pay that cost once, and can do so somewhere with the necessary network connectivity to download all the packages you need. Once your environment is packaged that cost is paid and you can re-use it as many times as you need.

Your only real gotcha is that the OS environment where you build the environment needs to match the OS environment where you want to run it or you can find you has OS dependent packages that won't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.