Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better document that the spark step launchers needs all project dependencies installed #11476

Open
slopp opened this issue Jan 3, 2023 · 2 comments
Labels
area: docs Related to documentation in general

Comments

@slopp
Copy link
Contributor

slopp commented Jan 3, 2023

What's the issue or suggestion?

The example code here leaves out a key requirement: https://docs.dagster.io/integrations/spark#submitting-pyspark-ops-on-emr, all the project python dependencies need to be installed.

For Databricks this can be done in step launcher config. The API config docs are, unfortunately, rather verbose and do not provide an easy example of the syntax for how to do this. Here is an example:

https://github.com/dagster-io/hooli-data-eng-pipelines/blob/master/hooli_data_eng/resources/databricks.py#L29-L56

For EMR, all the python dependencies in the dagster project (setup.py and requirements.txt) need to be installed manually. Normally this installation would be done using bootstrap.sh as documented in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html

Currently it is painful to figure this out, users report iterating through run launches that cause cryptic log errors (need to view stderr to see the actual message) and then fixing the error messages package by package, run by run 😱

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@slopp slopp added the area: docs Related to documentation in general label Jan 3, 2023
@sryza sryza changed the title Better document that the spark step launcheres needs all project dependencies installed Better document that the spark step launchers needs all project dependencies installed Jan 3, 2023
@ei-grad
Copy link

ei-grad commented Jan 4, 2023

For EMR, all the python dependencies in the dagster project (setup.py and requirements.txt) need to be installed manually.

Isn't deploy_local_job_package option of emr_pyspark_step_launcher supposed to do this automatically via uploading to S3 and using --py-files?

@ei-grad
Copy link

ei-grad commented Jan 4, 2023

Oh, its about the dependencies, sorry for inconvinience.

For me the usual solution is to use conda-pack + S3 to build and distrubute environments, and to use the spark.archives option. Maybe mentioning the job environment Python dependencies and having a link to https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html, or even complete example involving the conda-pack would be good for the Dagster spark docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: docs Related to documentation in general
Projects
None yet
Development

No branches or pull requests

2 participants