Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180

Closed
wants to merge 8 commits into from

Conversation

gsemet
Copy link
Contributor

@gsemet gsemet commented Jul 13, 2016

What changes were proposed in this pull request?

Support automatic deployment of Anaconda or virtualenv environment on executor, and add wheel and wheelhouse support to PySpark, based on SPARK-13587 (see #13599).

Full description in SPARK-16367

The developer write its Python code and it describes the dependencies of this script (numpy, pandas, requests,...) inside a requirements.txt file. When the job will be deployed on the executor (and when the driver will be deployed in cluster mode), the same environment will be automatically setup using pip (or conda).

Spark submit looks like:

    bin/spark-submit                                                              \
        --master spark://sparkcluster:7077                                        \
        --deploy-mode client                                                      \
        --verbose                                                                 \
        --files requirements.txt,/path/to/mypackage.tag.gz                        \
        --conf "spark.pyspark.virtualenv.enabled=true"                            \
        --conf "spark.pyspark.virtualenv.system_site_packages=false"              \
        --conf "spark.eventLog.enabled=true"                                      \
        --conf "spark.eventLog.dir=/tmp/spark-events"                             \
        --conf "spark.pyspark.virtualenv.bin.path=virtualenv"                     \
        --conf "spark.pyspark.virtualenv.use_index=true"                          \
        --conf "spark.pyspark.virtualenv.requirements=requirements.txt"           \
        --conf "spark.pyspark.virtualenv.trusted_host=internal.pypi.mirror"       \
        --conf "spark.pyspark.virtualenv.upgrade_pip=true"                        \
        --conf "spark.pyspark.virtualenv.install_package=mypackage.tag.gz"        \
        --conf "spark.pyspark.virtualenv.index_url=https://internal.pypi.mirror/" \
        /path/to/runner.py

Content of runner.py:

import mypackage
mypackage.main()

(this runner script is needed because --class does not work for Python, only java)

mypackage is built using the classic python setup.py sdist. Could be a wheel as well, eventhough distributing a source package for the job is safer.

I have described a bit more this approach here

In bonus, this PR adds support for "wheelhouse" deployment, where all dependencies are packaged into a single archive, so that no network connection is needed when Spark executors deploy the environment

How was this patch tested?

Manually tested on Ubuntu, Spark Standalone. Seems to work on YARN cluster as well.

@gsemet gsemet force-pushed the wheelhouse_support branch from 791d2bc to a378b8b Compare July 13, 2016 16:19
indent_size = 4

[*.scala]
indent_size = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there are differences in the indentation of Scala files and Python, this allow most editors (sublime, atom,...) to adapt automatically

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better to do as a separate PR since I could forsee us having some issue with the tabs configuration and wanting to do a revert (or vice versa)

@zjffdu
Copy link
Contributor

zjffdu commented Aug 2, 2016

@stibbons, sorry for late review. Let's work together on this PR in the next few days if it is OK for you.

@gsemet
Copy link
Contributor Author

gsemet commented Aug 2, 2016

Yes, I'll be glad ! It is not fully ready yet, I still need to figure out how the script is launched in each situation

@zjffdu
Copy link
Contributor

zjffdu commented Aug 9, 2016

ping @stibbons Any updates ?

@gsemet
Copy link
Contributor Author

gsemet commented Aug 9, 2016

Yes I am back from vacation! Can work on it now :)

@gsemet
Copy link
Contributor Author

gsemet commented Aug 9, 2016

Opened #14567 with Pep8, import reorganisations and editconfig.

@gsemet gsemet changed the title Wheelhouse and VirtualEnv support [SPARK-16367] Wheelhouse and VirtualEnv support Aug 10, 2016
@gsemet gsemet changed the title [SPARK-16367] Wheelhouse and VirtualEnv support [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support Aug 10, 2016
@gsemet gsemet force-pushed the wheelhouse_support branch from a378b8b to a37b9ee Compare August 10, 2016 11:26
@gsemet
Copy link
Contributor Author

gsemet commented Aug 10, 2016

Rebased, without import reorg and editorconfig files. Still not fully validated.

@zjffdu
Copy link
Contributor

zjffdu commented Aug 18, 2016

@stibbons Do you have time to continue this work ?

@gsemet
Copy link
Contributor Author

gsemet commented Aug 18, 2016

Actually I was waiting for #14567 to be reviewed and merged :(

I might have some questions on how Spark deploys Python script on YARN or Mesos if you know how it works

@zjffdu
Copy link
Contributor

zjffdu commented Aug 18, 2016

I can help if you have any question regarding spark on yarn. For mesos, since not so many people use it, we may put it another ticket.

@gsemet
Copy link
Contributor Author

gsemet commented Aug 18, 2016

We are implementing Mesos here (may take a while). While not so many people use it, on the paper it looks great ;)

Please mail me at gaetan[a t]xeberon.net if is easier for you (it is for me), this patch does not do the job completely for the moment :(

# where pip might not be installed
from pip.commands.install import InstallCommand as pip_InstallCommand
pip_InstallCommand().main(args=pip_args)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why install wheel Files here ? Shouldn't they been done in PythonWorkerFactory.scala ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to dig a bit further, but there are some code that aren't executed in client mode (ie run the drive on the developers machine)

@zjffdu
Copy link
Contributor

zjffdu commented Aug 23, 2016

@stibbons After second review your PR, I have one concern that may be support wheelhouse and virtualenv in one PR is too big to review, it might be better to do it in 2 PRs. I will try to get some feedback on #13599 first, but will continue look at this PR.

@gsemet
Copy link
Contributor Author

gsemet commented Aug 23, 2016

It makes sense. If you manage to get this merged I can rebase with only my diff.

Too bad we cannot stack pull requests on github :(

@gsemet gsemet force-pushed the wheelhouse_support branch from a37b9ee to c9bb018 Compare August 24, 2016 14:58
@gsemet
Copy link
Contributor Author

gsemet commented Aug 24, 2016

This new version is meant to be rebased after #13599 is merged.

Here is my current state:

  • only Standalone and YARN supported. Mesos not supported
  • only tested with virtualenv/pip on standalone installation. Conda not tested
  • wheelhouse deployment works (ie, all dependencies can be packaged into a single zip file and automatically and quickly installed on workers).
  • for example, deploying a package with numpy + pandas + scitkit-learn is fast once the installation has been done at least once on all workers, and if the wheelhouse provides all wheel on for all version, pip will install everything without internet connection and very fast)

I'd like to have the same ability to specify the entry point in Python that we can do in Java/Scala with the --class argument of spark-submit. What's are your opinion (will be part of a futur PR of course)?

@gsemet gsemet force-pushed the wheelhouse_support branch 3 times, most recently from e0f55aa to db65b61 Compare August 30, 2016 15:11
@gsemet
Copy link
Contributor Author

gsemet commented Aug 30, 2016

Status for test 'standalone install, 'client' deployment":

  • virtualenv create and pip install Pypi repository: ok (1 min 30 exec)
  • wheelhouse (Pypi repositoy): ko, because 'cffi' refuses the built wheel. Not related to this patch, but require more effort on documentation.

Test deploys a job that depends from many popular Pypi packages, among them Pandas, Theano, Scikit-learn,...):

$ pip freeze
alabaster==0.7.9
arrow==0.8.0
astroid==1.4.8
attrs==16.1.0
autopep8==1.2.4
Babel==2.3.4
backports.functools-lru-cache==1.2.1
BeautifulSoup==3.2.1
bolt-python==0.7.1
boto==2.42.0
cffi==1.7.0
click==6.6
configparser==3.5.0
cryptography==1.5
cycler==0.10.0
dask==0.11.0
decorator==4.0.10
docutils==0.12
enum34==1.1.6
findspark==1.1.0
first==2.0.1
flake8==3.0.4
funcsigs==1.0.2
futures==3.0.5
hypothesis==3.4.2
idna==2.1
imagesize==0.7.1
ipaddress==1.0.16
isort==4.2.5
Jinja2==2.8
jira==1.0.3
kerberos-sspi===0.1
lazy-object-proxy==1.2.2
linecache2==1.0.0
MarkupSafe==0.23
matplotlib==1.5.2
mccabe==0.5.2
mock==2.0.0
mpmath==0.19
ndg-httpsclient==0.4.2
networkx==1.11
nltk==3.2.1
nose==1.3.7
numpy==1.11.1
oauthlib==1.1.2
panda==0.3.1
pandas==0.18.1
pathlib==1.0.1
pbr==1.10.0
pep8==1.7.0
Pillow==3.3.1
pip-tools==1.7.0
py==1.4.31
pyasn1==0.1.9
pyasn1-modules==0.0.8
PyBrain==0.3
pycodestyle==2.0.0
pycparser==2.14
pycrypto==2.6.1
pyflakes==1.2.3
PyGithub==1.26.0
Pygments==2.1.3
pylint==1.6.4
pyOpenSSL==16.1.0
pyparsing==2.1.8
pytest==3.0.1
python-dateutil==2.5.3
python-ntlm==1.1.0
pytz==2016.6.1
PyYAML==3.12
-e git+ssh://internal.server.com/a/project/name@1af5c148f8f2d55f6f26a067a822f722528e13b9#egg=qsi_jobs
requests==2.11.1
requests-aws4auth==0.9
requests-kerberos===0.6.1
requests-oauthlib==0.6.2
requests-toolbelt==0.7.0
scikit-image==0.12.3
scikit-learn==0.17.1
scipy==0.18.0
service-identity==16.0.0
singledispatch==3.4.0.3
six==1.10.0
sklearn-pandas==1.1.0
snowballstemmer==1.2.1
spark-testing-base==0.0.7.post2
Sphinx==1.4.6
suds==0.4
sympy==1.0
Theano==0.8.2
thunder-python==1.4.2
tifffile==0.9.2
tlslite==0.4.9
toolz==0.8.0
traceback2==1.4.0
Unidecode==0.4.19
unittest2==1.1.0
urllib3==1.16
wrapt==1.10.8
-e git+ssh://internal.server.com/a/dependency/project/name@7f4a7623aa219743e9b96b228b4cd86fe9bc5595#egg=dependency_projectname
yapf==0.11.1

Execution of this installation takes 1 min on each executor, thanks to pip and wheels being downloaded from our internal pypi.python.org mirror:

16/08/30 17:07:47 DEBUG PythonWorkerFactory: Running command: virtualenv_app-20160830170740-0000_0/bin/pip install -r requirements.txt --index-url https://internal.pypmirror/artifactory/api/pypi/pypi-prod/simple --trusted-host internal.pypmirror qsi-jobs-0.0.1.dev15.tar.gz
16/08/30 17:08:58 DEBUG PythonWorkerFactory: Starting daemon with pythonExec: virtualenv_app-20160830170740-0000_0/bin/python
pandas imported!pandas imported!

pandas imported!pandas imported!

"pandas imported!" means "import panda" worked from the executor (4 times means each executor worked), that has been started outside of any virtualenv.

Works pretty well :)

Important: in "client deployment mode", the 'driver' should be execute from a virtual environment with all the Pypi packages that the driver will require to use.

@gsemet
Copy link
Contributor Author

gsemet commented Sep 22, 2016

Rebased.

zjffdu and others added 8 commits October 5, 2016 14:16
…kSession"

This reverts commit 2ab64b41137374b935f939d919fec7cb2f56cd63.
- Merge of apache#13599 ("virtualenv in pyspark", Bug SPARK-13587)
- and apache#5408 (wheel package support for Pyspark", bug SPARK-6764)
- Documentation updated
- only Standalone and YARN supported. Mesos not supported
- only tested with virtualenv/pip. Conda not tested
   - client deployment + pip install w/ index: ok (1 min 30 exec)
   - client deployment + wheelhouse w/o index: ko
     (cffi refuse the builded wheel)

Signed-off-by: Gaetan Semet <[email protected]>
@gsemet gsemet force-pushed the wheelhouse_support branch from bfd5a09 to cf977e4 Compare October 5, 2016 12:16
@holdenk
Copy link
Contributor

holdenk commented Oct 8, 2016

I was going through the old PySpark JIRAs and there is one about an unexpected pandas failure, which could be semi-related (e.g. good virtual env support with a reasonable requirements file could help avoid that) - but I still don't see the reviewer interest required to take this PR forward (I'm really sorry).

@gsemet gsemet changed the title [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support [SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors Oct 12, 2016
@gatorsmile
Copy link
Member

cc @holdenk Any thought about this PR?

@holdenk
Copy link
Contributor

holdenk commented Jun 17, 2017

So I think we need some support for virtualenv/anaconda but this feels a little overly complicated as the first step (and mostly untested) -- maybe starting with a simple supporting running from a specific virtualenv/conda env that is already set up?

What are your thoughts @gatorsmile / @davies ? If there is a consesus with the other people working on Python I'm happy to do some more reviews :)

@zjffdu
Copy link
Contributor

zjffdu commented Jun 17, 2017

Here's my approach #13599 for virtualenv and conda support, welcome any comments and reviews

https://docs.google.com/document/d/1EGNEf4vFmpGXSd2DPOLu_HL23Xhw9aWKeUrzzxsEbQs/edit?usp=sharing

@gatorsmile
Copy link
Member

cc @ueshin

@gatorsmile
Copy link
Member

ping @ueshin Should we continue this PR?

@jiangxb1987
Copy link
Contributor

gentle ping @ueshin

@ueshin
Copy link
Member

ueshin commented Jan 8, 2018

@gatorsmile @jiangxb1987 Maybe we should review and merge #13599 first because this pr is based on it.

@gsemet
Copy link
Contributor Author

gsemet commented Jan 8, 2018

Hello. Been a long time, it probably needs a full rework. Maybe we need to take a step back and have a talk between several person interested in this feature to see what is the more suitable for the Spark project. I work a lot on Python packaging nowdays, so I have a pretty good idea on different distribution solutions we have for python (anaconda, pip/virtualenv, now Pipfile), and not only barely generating a python package and throwing it in the wild, I mean ensuring my package work in the targetted environment: pyexecutable is also a solution eventhough it is more complex, wheelhouse + some tricks might also do the job for Spark. Ultimately, the goal is to have something cool and easy to use for PySpark users willing to distribute any kind of work without having to ask the IT guys to install this numpy version on the cluster.

@HyukjinKwon
Copy link
Member

Can we move the discussion into #13599 one place? BTW, I prefer a simplest approach first. Let's go ahead unless you think we are unable to incrementally improve in the way of #13599. If that's the case, well formed doc with some arguments would be helpful to open up the discussion.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@CollinRea
Copy link

Is this still being looked into? Or possibly has wheel support been added in another PR?

@gsemet
Copy link
Contributor Author

gsemet commented Mar 19, 2019

Today the best option would be to package into « pex » format

@holdenk
Copy link
Contributor

holdenk commented Apr 1, 2019

Or using a customer docker image with Spark on Kubernetes. We could document these in the website?

@gsemet
Copy link
Contributor Author

gsemet commented Apr 3, 2019

Yes spark on kube seems cool, never had the chance to try ! Is there a helm chart ?

@vanzin
Copy link
Contributor

vanzin commented Apr 29, 2019

This doesn't seem active so closing the PR for now.

@vanzin vanzin closed this Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants