-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dagster_databricks package for Databricks integration #2468
Conversation
I've tidied this up, finished it off (for now, pending a soon-to-be-PR'd dagster-azure package) and tested it a bunch of times on a Databricks cluster I have access to, both saving to S3 and Azure Data Lake Storage. It seems to work pretty well! There are some slightly hairy parts around getting libraries installed remotely at the minute when using the external step launcher. Normally we could specify libraries in the There are also some (generally commented out) references to dagster-azure, which I intend to sort out as soon as possible, but I'll need to add a bunch of tests and send another PR first! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really impressive and comprehensive piece of work. Thanks a ton for contributing.
I left a few comments inside, mostly on stylistic stuff. I think the biggest open question I have is on how the storage piece fits in. Have you looked into whether it would make sense to expose the s3/dbfs referenced in the PR as an intermediate store?
prod_mode = ModeDefinition( | ||
name='prod', | ||
prod_emr_mode = ModeDefinition( | ||
name='prod_emr', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
python_modules/libraries/dagster-databricks/dagster_databricks/__init__.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/configs.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/configs.py
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/databricks.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks_tests/test_pyspark.py
Show resolved
Hide resolved
infile, self._dbfs_path(run_id, step_key, self._main_file_name()) | ||
) | ||
|
||
if True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With EMR, we've basically punted this to the user, though I suppose that's more difficult with Databricks because the job is often launching the cluster itself instead of relying on a user to launch and configure it separately.
@natekupp - I'm curious whether you have thoughts on the right thing to do here. I don't love this, but also don't have a better option in mind.
python_modules/libraries/dagster-databricks/dagster_databricks/utils.py
Outdated
Show resolved
Hide resolved
Most of this was more or less copied from your work on the EMR subpackage so thanks for that! 🙂
I've pushed fixes for the stylistic stuff, thanks for the comments. Let me know if you'd like the changes squashing. Storage is definitely a concern. I'm not convinced that it's worth adding another storage system for DBFS since they strongly recommend using either S3 or Azure anyway, plus they need to be accessed in different ways depending whether they're accessed from inside or outside the cluster. I've mentioned the use of an intermediate store in a separate comment above. Note that the launcher does require and use a storage system (the simple_pyspark example uses S3), it just also requires additional credentials in Databricks so they can be mounted secretly. |
python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/configs.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-databricks/dagster_databricks/configs.py
Outdated
Show resolved
Hide resolved
@sd2k you should be able to address the latest build errors by updating snapshots for the failing tests - i.e. with |
This package is closely modeled off the dagster_aws.emr subpackage and provides the databricks_pyspark_step_launcher resource and the DatabricksRunJobSolidDefinition solid for running Databricks jobs.
Specifically: - triple single quotes instead of triple double quotes for docstrings - single quotes instead of double quotes everywhere else - oneline docstrings where possible; start on same line everywhere else - rename 'is_terminal' to 'has_terminated' - use 'databricks_run_id' instead of 'run_id' for clarity - make DatabricksJobRunner.client a property - remove unnecessary blank lines
…atabricks storage
Sorry for all the failing tests, various rebases onto master caused issues with the snapshots failing! I've fixed all the failures from the latest buildkite now 🤞 |
Adds a
dagster_databricks
package to integrate with Databricks. Works in a similar manner todagster_aws.emr
.Closes #2458.