[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180

gsemet · 2016-07-13T13:28:54Z

What changes were proposed in this pull request?

Support automatic deployment of Anaconda or virtualenv environment on executor, and add wheel and wheelhouse support to PySpark, based on SPARK-13587 (see #13599).

Full description in SPARK-16367

The developer write its Python code and it describes the dependencies of this script (numpy, pandas, requests,...) inside a requirements.txt file. When the job will be deployed on the executor (and when the driver will be deployed in cluster mode), the same environment will be automatically setup using pip (or conda).

Spark submit looks like:

    bin/spark-submit                                                              \
        --master spark://sparkcluster:7077                                        \
        --deploy-mode client                                                      \
        --verbose                                                                 \
        --files requirements.txt,/path/to/mypackage.tag.gz                        \
        --conf "spark.pyspark.virtualenv.enabled=true"                            \
        --conf "spark.pyspark.virtualenv.system_site_packages=false"              \
        --conf "spark.eventLog.enabled=true"                                      \
        --conf "spark.eventLog.dir=/tmp/spark-events"                             \
        --conf "spark.pyspark.virtualenv.bin.path=virtualenv"                     \
        --conf "spark.pyspark.virtualenv.use_index=true"                          \
        --conf "spark.pyspark.virtualenv.requirements=requirements.txt"           \
        --conf "spark.pyspark.virtualenv.trusted_host=internal.pypi.mirror"       \
        --conf "spark.pyspark.virtualenv.upgrade_pip=true"                        \
        --conf "spark.pyspark.virtualenv.install_package=mypackage.tag.gz"        \
        --conf "spark.pyspark.virtualenv.index_url=https://internal.pypi.mirror/" \
        /path/to/runner.py

Content of runner.py:

import mypackage
mypackage.main()

(this runner script is needed because --class does not work for Python, only java)

mypackage is built using the classic python setup.py sdist. Could be a wheel as well, eventhough distributing a source package for the job is safer.

I have described a bit more this approach here

In bonus, this PR adds support for "wheelhouse" deployment, where all dependencies are packaged into a single archive, so that no network connection is needed when Spark executors deploy the environment

How was this patch tested?

Manually tested on Ubuntu, Spark Standalone. Seems to work on YARN cluster as well.

zjffdu · 2016-08-02T07:54:27Z

.editorconfig

+indent_size = 4
+
+[*.scala]
+indent_size = 2


unnecessary file ?

I saw there are differences in the indentation of Scala files and Python, this allow most editors (sublime, atom,...) to adapt automatically

This might be better to do as a separate PR since I could forsee us having some issue with the tabs configuration and wanting to do a revert (or vice versa)

zjffdu · 2016-08-02T09:54:21Z

@stibbons, sorry for late review. Let's work together on this PR in the next few days if it is OK for you.

gsemet · 2016-08-02T12:02:40Z

Yes, I'll be glad ! It is not fully ready yet, I still need to figure out how the script is launched in each situation

zjffdu · 2016-08-09T03:04:11Z

ping @stibbons Any updates ?

gsemet · 2016-08-09T08:17:04Z

Yes I am back from vacation! Can work on it now :)

gsemet · 2016-08-09T16:46:19Z

Opened #14567 with Pep8, import reorganisations and editconfig.

gsemet · 2016-08-10T12:16:28Z

Rebased, without import reorg and editorconfig files. Still not fully validated.

zjffdu · 2016-08-18T00:50:57Z

@stibbons Do you have time to continue this work ?

gsemet · 2016-08-18T08:10:11Z

Actually I was waiting for #14567 to be reviewed and merged :(

I might have some questions on how Spark deploys Python script on YARN or Mesos if you know how it works

zjffdu · 2016-08-18T08:21:26Z

I can help if you have any question regarding spark on yarn. For mesos, since not so many people use it, we may put it another ticket.

gsemet · 2016-08-18T08:29:53Z

We are implementing Mesos here (may take a while). While not so many people use it, on the paper it looks great ;)

Please mail me at gaetan[a t]xeberon.net if is easier for you (it is for me), this patch does not do the job completely for the moment :(

zjffdu · 2016-08-23T08:28:41Z

python/pyspark/context.py

+        # where pip might not be installed
+        from pip.commands.install import InstallCommand as pip_InstallCommand
+        pip_InstallCommand().main(args=pip_args)
+


why install wheel Files here ? Shouldn't they been done in PythonWorkerFactory.scala ?

I need to dig a bit further, but there are some code that aren't executed in client mode (ie run the drive on the developers machine)

zjffdu · 2016-08-23T08:40:09Z

@stibbons After second review your PR, I have one concern that may be support wheelhouse and virtualenv in one PR is too big to review, it might be better to do it in 2 PRs. I will try to get some feedback on #13599 first, but will continue look at this PR.

gsemet · 2016-08-23T09:19:11Z

It makes sense. If you manage to get this merged I can rebase with only my diff.

Too bad we cannot stack pull requests on github :(

gsemet · 2016-08-24T15:31:02Z

This new version is meant to be rebased after #13599 is merged.

Here is my current state:

only Standalone and YARN supported. Mesos not supported
only tested with virtualenv/pip on standalone installation. Conda not tested
wheelhouse deployment works (ie, all dependencies can be packaged into a single zip file and automatically and quickly installed on workers).
for example, deploying a package with numpy + pandas + scitkit-learn is fast once the installation has been done at least once on all workers, and if the wheelhouse provides all wheel on for all version, pip will install everything without internet connection and very fast)

I'd like to have the same ability to specify the entry point in Python that we can do in Java/Scala with the --class argument of spark-submit. What's are your opinion (will be part of a futur PR of course)?

gsemet · 2016-08-30T15:17:59Z

Status for test 'standalone install, 'client' deployment":

virtualenv create and pip install Pypi repository: ok (1 min 30 exec)
wheelhouse (Pypi repositoy): ko, because 'cffi' refuses the built wheel. Not related to this patch, but require more effort on documentation.

Test deploys a job that depends from many popular Pypi packages, among them Pandas, Theano, Scikit-learn,...):

$ pip freeze
alabaster==0.7.9
arrow==0.8.0
astroid==1.4.8
attrs==16.1.0
autopep8==1.2.4
Babel==2.3.4
backports.functools-lru-cache==1.2.1
BeautifulSoup==3.2.1
bolt-python==0.7.1
boto==2.42.0
cffi==1.7.0
click==6.6
configparser==3.5.0
cryptography==1.5
cycler==0.10.0
dask==0.11.0
decorator==4.0.10
docutils==0.12
enum34==1.1.6
findspark==1.1.0
first==2.0.1
flake8==3.0.4
funcsigs==1.0.2
futures==3.0.5
hypothesis==3.4.2
idna==2.1
imagesize==0.7.1
ipaddress==1.0.16
isort==4.2.5
Jinja2==2.8
jira==1.0.3
kerberos-sspi===0.1
lazy-object-proxy==1.2.2
linecache2==1.0.0
MarkupSafe==0.23
matplotlib==1.5.2
mccabe==0.5.2
mock==2.0.0
mpmath==0.19
ndg-httpsclient==0.4.2
networkx==1.11
nltk==3.2.1
nose==1.3.7
numpy==1.11.1
oauthlib==1.1.2
panda==0.3.1
pandas==0.18.1
pathlib==1.0.1
pbr==1.10.0
pep8==1.7.0
Pillow==3.3.1
pip-tools==1.7.0
py==1.4.31
pyasn1==0.1.9
pyasn1-modules==0.0.8
PyBrain==0.3
pycodestyle==2.0.0
pycparser==2.14
pycrypto==2.6.1
pyflakes==1.2.3
PyGithub==1.26.0
Pygments==2.1.3
pylint==1.6.4
pyOpenSSL==16.1.0
pyparsing==2.1.8
pytest==3.0.1
python-dateutil==2.5.3
python-ntlm==1.1.0
pytz==2016.6.1
PyYAML==3.12
-e git+ssh://internal.server.com/a/project/name@1af5c148f8f2d55f6f26a067a822f722528e13b9#egg=qsi_jobs
requests==2.11.1
requests-aws4auth==0.9
requests-kerberos===0.6.1
requests-oauthlib==0.6.2
requests-toolbelt==0.7.0
scikit-image==0.12.3
scikit-learn==0.17.1
scipy==0.18.0
service-identity==16.0.0
singledispatch==3.4.0.3
six==1.10.0
sklearn-pandas==1.1.0
snowballstemmer==1.2.1
spark-testing-base==0.0.7.post2
Sphinx==1.4.6
suds==0.4
sympy==1.0
Theano==0.8.2
thunder-python==1.4.2
tifffile==0.9.2
tlslite==0.4.9
toolz==0.8.0
traceback2==1.4.0
Unidecode==0.4.19
unittest2==1.1.0
urllib3==1.16
wrapt==1.10.8
-e git+ssh://internal.server.com/a/dependency/project/name@7f4a7623aa219743e9b96b228b4cd86fe9bc5595#egg=dependency_projectname
yapf==0.11.1

Execution of this installation takes 1 min on each executor, thanks to pip and wheels being downloaded from our internal pypi.python.org mirror:

16/08/30 17:07:47 DEBUG PythonWorkerFactory: Running command: virtualenv_app-20160830170740-0000_0/bin/pip install -r requirements.txt --index-url https://internal.pypmirror/artifactory/api/pypi/pypi-prod/simple --trusted-host internal.pypmirror qsi-jobs-0.0.1.dev15.tar.gz
16/08/30 17:08:58 DEBUG PythonWorkerFactory: Starting daemon with pythonExec: virtualenv_app-20160830170740-0000_0/bin/python
pandas imported!pandas imported!

pandas imported!pandas imported!

"pandas imported!" means "import panda" worked from the executor (4 times means each executor worked), that has been started outside of any virtualenv.

Works pretty well :)

Important: in "client deployment mode", the 'driver' should be execute from a virtual environment with all the Pypi packages that the driver will require to use.

gsemet · 2016-09-22T07:33:17Z

Rebased.

…kSession" This reverts commit 2ab64b41137374b935f939d919fec7cb2f56cd63.

- Merge of apache#13599 ("virtualenv in pyspark", Bug SPARK-13587) - and apache#5408 (wheel package support for Pyspark", bug SPARK-6764) - Documentation updated - only Standalone and YARN supported. Mesos not supported - only tested with virtualenv/pip. Conda not tested - client deployment + pip install w/ index: ok (1 min 30 exec) - client deployment + wheelhouse w/o index: ko (cffi refuse the builded wheel) Signed-off-by: Gaetan Semet <[email protected]>

holdenk · 2016-10-08T14:41:08Z

I was going through the old PySpark JIRAs and there is one about an unexpected pandas failure, which could be semi-related (e.g. good virtual env support with a reasonable requirements file could help avoid that) - but I still don't see the reviewer interest required to take this PR forward (I'm really sorry).

gatorsmile · 2017-06-16T23:38:36Z

cc @holdenk Any thought about this PR?

holdenk · 2017-06-17T00:16:34Z

So I think we need some support for virtualenv/anaconda but this feels a little overly complicated as the first step (and mostly untested) -- maybe starting with a simple supporting running from a specific virtualenv/conda env that is already set up?

What are your thoughts @gatorsmile / @davies ? If there is a consesus with the other people working on Python I'm happy to do some more reviews :)

zjffdu · 2017-06-17T00:33:05Z

Here's my approach #13599 for virtualenv and conda support, welcome any comments and reviews

https://docs.google.com/document/d/1EGNEf4vFmpGXSd2DPOLu_HL23Xhw9aWKeUrzzxsEbQs/edit?usp=sharing

gatorsmile · 2017-06-18T22:10:25Z

cc @ueshin

gatorsmile · 2017-10-28T05:05:27Z

ping @ueshin Should we continue this PR?

jiangxb1987 · 2018-01-01T17:12:57Z

gentle ping @ueshin

ueshin · 2018-01-08T07:06:45Z

@gatorsmile @jiangxb1987 Maybe we should review and merge #13599 first because this pr is based on it.

gsemet · 2018-01-08T09:20:27Z

Hello. Been a long time, it probably needs a full rework. Maybe we need to take a step back and have a talk between several person interested in this feature to see what is the more suitable for the Spark project. I work a lot on Python packaging nowdays, so I have a pretty good idea on different distribution solutions we have for python (anaconda, pip/virtualenv, now Pipfile), and not only barely generating a python package and throwing it in the wild, I mean ensuring my package work in the targetted environment: pyexecutable is also a solution eventhough it is more complex, wheelhouse + some tricks might also do the job for Spark. Ultimately, the goal is to have something cool and easy to use for PySpark users willing to distribute any kind of work without having to ask the IT guys to install this numpy version on the cluster.

HyukjinKwon · 2018-01-08T09:39:47Z

Can we move the discussion into #13599 one place? BTW, I prefer a simplest approach first. Let's go ahead unless you think we are unable to incrementally improve in the way of #13599. If that's the case, well formed doc with some arguments would be helpful to open up the discussion.

AmplabJenkins · 2018-06-09T00:25:10Z

Can one of the admins verify this patch?

CollinRea · 2019-03-19T15:37:41Z

Is this still being looked into? Or possibly has wheel support been added in another PR?

gsemet · 2019-03-19T15:57:37Z

Today the best option would be to package into « pex » format

holdenk · 2019-04-01T16:23:38Z

Or using a customer docker image with Spark on Kubernetes. We could document these in the website?

gsemet · 2019-04-03T14:46:28Z

Yes spark on kube seems cool, never had the chance to try ! Is there a helm chart ?

vanzin · 2019-04-29T23:16:54Z

This doesn't seem active so closing the PR for now.

gsemet force-pushed the wheelhouse_support branch from 791d2bc to a378b8b Compare July 13, 2016 16:19

zjffdu reviewed Aug 2, 2016
View reviewed changes

gsemet changed the title ~~Wheelhouse and VirtualEnv support~~ [SPARK-16367] Wheelhouse and VirtualEnv support Aug 10, 2016

gsemet changed the title ~~[SPARK-16367] Wheelhouse and VirtualEnv support~~ [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support Aug 10, 2016

gsemet force-pushed the wheelhouse_support branch from a378b8b to a37b9ee Compare August 10, 2016 11:26

gsemet mentioned this pull request Aug 10, 2016

[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

Closed

zjffdu reviewed Aug 23, 2016
View reviewed changes

zjffdu mentioned this pull request Aug 23, 2016

[SPARK-13587] [PYSPARK] Support virtualenv in pyspark #13599

Closed

gsemet force-pushed the wheelhouse_support branch from a37b9ee to c9bb018 Compare August 24, 2016 14:58

gsemet force-pushed the wheelhouse_support branch 3 times, most recently from e0f55aa to db65b61 Compare August 30, 2016 15:11

gsemet force-pushed the wheelhouse_support branch from 9fff24a to 63264ae Compare September 22, 2016 07:31

gsemet force-pushed the wheelhouse_support branch from 63264ae to bfd5a09 Compare September 28, 2016 16:36

zjffdu and others added 8 commits October 5, 2016 14:16

temp save

9f0ccc9

change it to java 7 stule

9386c31

minor fix

2e9c9fe

fix shebang line limitation

567126c

minor refactoring

12fb773

fix cache_dir issue

6397c56

Revert "[SPARK-15803][PYSPARK] Support with statement syntax for Spar…

7e3063e

…kSession" This reverts commit 2ab64b41137374b935f939d919fec7cb2f56cd63.

gsemet force-pushed the wheelhouse_support branch from bfd5a09 to cf977e4 Compare October 5, 2016 12:16

gsemet changed the title ~~[SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support~~ [SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors Oct 12, 2016

vanzin closed this Apr 29, 2019

[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180

[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180

Conversation

gsemet commented Jul 13, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

zjffdu Aug 2, 2016

Choose a reason for hiding this comment

gsemet Aug 2, 2016

Choose a reason for hiding this comment

holdenk Aug 3, 2016

Choose a reason for hiding this comment

zjffdu commented Aug 2, 2016

gsemet commented Aug 2, 2016

zjffdu commented Aug 9, 2016

gsemet commented Aug 9, 2016

gsemet commented Aug 9, 2016

gsemet commented Aug 10, 2016

zjffdu commented Aug 18, 2016

gsemet commented Aug 18, 2016

zjffdu commented Aug 18, 2016

gsemet commented Aug 18, 2016

zjffdu Aug 23, 2016

Choose a reason for hiding this comment

gsemet Aug 23, 2016

Choose a reason for hiding this comment

zjffdu commented Aug 23, 2016

gsemet commented Aug 23, 2016

gsemet commented Aug 24, 2016 • edited Loading

gsemet commented Aug 30, 2016 • edited Loading

gsemet commented Sep 22, 2016

holdenk commented Oct 8, 2016

gatorsmile commented Jun 16, 2017

holdenk commented Jun 17, 2017

zjffdu commented Jun 17, 2017

gatorsmile commented Jun 18, 2017

gatorsmile commented Oct 28, 2017

jiangxb1987 commented Jan 1, 2018

ueshin commented Jan 8, 2018

gsemet commented Jan 8, 2018

HyukjinKwon commented Jan 8, 2018

AmplabJenkins commented Jun 9, 2018

CollinRea commented Mar 19, 2019

gsemet commented Mar 19, 2019

holdenk commented Apr 1, 2019

gsemet commented Apr 3, 2019

vanzin commented Apr 29, 2019

gsemet commented Jul 13, 2016 •

edited

Loading

gsemet commented Aug 24, 2016 •

edited

Loading

gsemet commented Aug 30, 2016 •

edited

Loading