You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For Databricks this can be done in step launcher config. The API config docs are, unfortunately, rather verbose and do not provide an easy example of the syntax for how to do this. Here is an example:
Currently it is painful to figure this out, users report iterating through run launches that cause cryptic log errors (need to view stderr to see the actual message) and then fixing the error messages package by package, run by run 😱
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered:
sryza
changed the title
Better document that the spark step launcheres needs all project dependencies installed
Better document that the spark step launchers needs all project dependencies installed
Jan 3, 2023
Oh, its about the dependencies, sorry for inconvinience.
For me the usual solution is to use conda-pack + S3 to build and distrubute environments, and to use the spark.archives option. Maybe mentioning the job environment Python dependencies and having a link to https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html, or even complete example involving the conda-pack would be good for the Dagster spark docs.
What's the issue or suggestion?
The example code here leaves out a key requirement: https://docs.dagster.io/integrations/spark#submitting-pyspark-ops-on-emr, all the project python dependencies need to be installed.
For Databricks this can be done in step launcher config. The API config docs are, unfortunately, rather verbose and do not provide an easy example of the syntax for how to do this. Here is an example:
https://github.com/dagster-io/hooli-data-eng-pipelines/blob/master/hooli_data_eng/resources/databricks.py#L29-L56
For EMR, all the python dependencies in the dagster project (setup.py and requirements.txt) need to be installed manually. Normally this installation would be done using
bootstrap.sh
as documented in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.htmlCurrently it is painful to figure this out, users report iterating through run launches that cause cryptic log errors (need to view stderr to see the actual message) and then fixing the error messages package by package, run by run 😱
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: