-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180
Conversation
791d2bc
to
a378b8b
Compare
indent_size = 4 | ||
|
||
[*.scala] | ||
indent_size = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw there are differences in the indentation of Scala files and Python, this allow most editors (sublime, atom,...) to adapt automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be better to do as a separate PR since I could forsee us having some issue with the tabs configuration and wanting to do a revert (or vice versa)
@stibbons, sorry for late review. Let's work together on this PR in the next few days if it is OK for you. |
Yes, I'll be glad ! It is not fully ready yet, I still need to figure out how the script is launched in each situation |
ping @stibbons Any updates ? |
Yes I am back from vacation! Can work on it now :) |
Opened #14567 with Pep8, import reorganisations and editconfig. |
a378b8b
to
a37b9ee
Compare
Rebased, without import reorg and editorconfig files. Still not fully validated. |
@stibbons Do you have time to continue this work ? |
Actually I was waiting for #14567 to be reviewed and merged :( I might have some questions on how Spark deploys Python script on YARN or Mesos if you know how it works |
I can help if you have any question regarding spark on yarn. For mesos, since not so many people use it, we may put it another ticket. |
We are implementing Mesos here (may take a while). While not so many people use it, on the paper it looks great ;) Please mail me at gaetan[a t]xeberon.net if is easier for you (it is for me), this patch does not do the job completely for the moment :( |
# where pip might not be installed | ||
from pip.commands.install import InstallCommand as pip_InstallCommand | ||
pip_InstallCommand().main(args=pip_args) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why install wheel Files here ? Shouldn't they been done in PythonWorkerFactory.scala
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to dig a bit further, but there are some code that aren't executed in client mode (ie run the drive on the developers machine)
It makes sense. If you manage to get this merged I can rebase with only my diff. Too bad we cannot stack pull requests on github :( |
a37b9ee
to
c9bb018
Compare
This new version is meant to be rebased after #13599 is merged. Here is my current state:
I'd like to have the same ability to specify the entry point in Python that we can do in Java/Scala with the |
e0f55aa
to
db65b61
Compare
Status for test 'standalone install, 'client' deployment":
Test deploys a job that depends from many popular Pypi packages, among them Pandas, Theano, Scikit-learn,...):
Execution of this installation takes 1 min on each executor, thanks to pip and wheels being downloaded from our internal pypi.python.org mirror:
"pandas imported!" means "import panda" worked from the executor (4 times means each executor worked), that has been started outside of any virtualenv. Works pretty well :) Important: in "client deployment mode", the 'driver' should be execute from a virtual environment with all the Pypi packages that the driver will require to use. |
9fff24a
to
63264ae
Compare
Rebased. |
63264ae
to
bfd5a09
Compare
…kSession" This reverts commit 2ab64b41137374b935f939d919fec7cb2f56cd63.
- Merge of apache#13599 ("virtualenv in pyspark", Bug SPARK-13587) - and apache#5408 (wheel package support for Pyspark", bug SPARK-6764) - Documentation updated - only Standalone and YARN supported. Mesos not supported - only tested with virtualenv/pip. Conda not tested - client deployment + pip install w/ index: ok (1 min 30 exec) - client deployment + wheelhouse w/o index: ko (cffi refuse the builded wheel) Signed-off-by: Gaetan Semet <[email protected]>
bfd5a09
to
cf977e4
Compare
I was going through the old PySpark JIRAs and there is one about an unexpected pandas failure, which could be semi-related (e.g. good virtual env support with a reasonable requirements file could help avoid that) - but I still don't see the reviewer interest required to take this PR forward (I'm really sorry). |
cc @holdenk Any thought about this PR? |
So I think we need some support for virtualenv/anaconda but this feels a little overly complicated as the first step (and mostly untested) -- maybe starting with a simple supporting running from a specific virtualenv/conda env that is already set up? What are your thoughts @gatorsmile / @davies ? If there is a consesus with the other people working on Python I'm happy to do some more reviews :) |
Here's my approach #13599 for virtualenv and conda support, welcome any comments and reviews https://docs.google.com/document/d/1EGNEf4vFmpGXSd2DPOLu_HL23Xhw9aWKeUrzzxsEbQs/edit?usp=sharing |
cc @ueshin |
ping @ueshin Should we continue this PR? |
gentle ping @ueshin |
@gatorsmile @jiangxb1987 Maybe we should review and merge #13599 first because this pr is based on it. |
Hello. Been a long time, it probably needs a full rework. Maybe we need to take a step back and have a talk between several person interested in this feature to see what is the more suitable for the Spark project. I work a lot on Python packaging nowdays, so I have a pretty good idea on different distribution solutions we have for python (anaconda, pip/virtualenv, now Pipfile), and not only barely generating a python package and throwing it in the wild, I mean ensuring my package work in the targetted environment: pyexecutable is also a solution eventhough it is more complex, wheelhouse + some tricks might also do the job for Spark. Ultimately, the goal is to have something cool and easy to use for PySpark users willing to distribute any kind of work without having to ask the IT guys to install this numpy version on the cluster. |
Can one of the admins verify this patch? |
Is this still being looked into? Or possibly has wheel support been added in another PR? |
Today the best option would be to package into « pex » format |
Or using a customer docker image with Spark on Kubernetes. We could document these in the website? |
Yes spark on kube seems cool, never had the chance to try ! Is there a helm chart ? |
This doesn't seem active so closing the PR for now. |
What changes were proposed in this pull request?
Support automatic deployment of Anaconda or virtualenv environment on executor, and add wheel and wheelhouse support to PySpark, based on SPARK-13587 (see #13599).
Full description in SPARK-16367
The developer write its Python code and it describes the dependencies of this script (
numpy
,pandas
,requests
,...) inside arequirements.txt
file. When the job will be deployed on the executor (and when the driver will be deployed in cluster mode), the same environment will be automatically setup usingpip
(orconda
).Spark submit looks like:
Content of
runner.py
:(this runner script is needed because
--class
does not work for Python, only java)mypackage
is built using the classicpython setup.py sdist
. Could be a wheel as well, eventhough distributing a source package for the job is safer.I have described a bit more this approach here
In bonus, this PR adds support for "wheelhouse" deployment, where all dependencies are packaged into a single archive, so that no network connection is needed when Spark executors deploy the environment
How was this patch tested?
Manually tested on Ubuntu, Spark Standalone. Seems to work on YARN cluster as well.