Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dagster_databricks package for Databricks integration #2468

Merged
merged 25 commits into from
Jun 9, 2020

Conversation

sd2k
Copy link
Contributor

@sd2k sd2k commented May 16, 2020

Adds a dagster_databricks package to integrate with Databricks. Works in a similar manner to dagster_aws.emr.

Closes #2458.

@sd2k sd2k force-pushed the dagster-databricks branch from 994491d to 8e3f607 Compare May 18, 2020 20:16
@sd2k sd2k marked this pull request as ready for review May 18, 2020 20:17
@sd2k
Copy link
Contributor Author

sd2k commented May 18, 2020

I've tidied this up, finished it off (for now, pending a soon-to-be-PR'd dagster-azure package) and tested it a bunch of times on a Databricks cluster I have access to, both saving to S3 and Azure Data Lake Storage. It seems to work pretty well!

There are some slightly hairy parts around getting libraries installed remotely at the minute when using the external step launcher. Normally we could specify libraries in the libraries field of the run config, but because dagster-databricks hasn't been released yet it needs to be uploaded manually; I also ran into some conflicts with nightly versions of the other dagster packages so ended up doing most of this work based off 0.7.13, and just rebased on master for this PR.

There are also some (generally commented out) references to dagster-azure, which I intend to sort out as soon as possible, but I'll need to add a bunch of tests and send another PR first!

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really impressive and comprehensive piece of work. Thanks a ton for contributing.

I left a few comments inside, mostly on stylistic stuff. I think the biggest open question I have is on how the storage piece fits in. Have you looked into whether it would make sense to expose the s3/dbfs referenced in the PR as an intermediate store?

prod_mode = ModeDefinition(
name='prod',
prod_emr_mode = ModeDefinition(
name='prod_emr',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

infile, self._dbfs_path(run_id, step_key, self._main_file_name())
)

if True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With EMR, we've basically punted this to the user, though I suppose that's more difficult with Databricks because the job is often launching the cluster itself instead of relying on a user to launch and configure it separately.

@natekupp - I'm curious whether you have thoughts on the right thing to do here. I don't love this, but also don't have a better option in mind.

@sd2k
Copy link
Contributor Author

sd2k commented May 19, 2020

This is a really impressive and comprehensive piece of work. Thanks a ton for contributing.

Most of this was more or less copied from your work on the EMR subpackage so thanks for that! 🙂

I left a few comments inside, mostly on stylistic stuff. I think the biggest open question I have is on how the storage piece fits in. Have you looked into whether it would make sense to expose the s3/dbfs referenced in the PR as an intermediate store?

I've pushed fixes for the stylistic stuff, thanks for the comments. Let me know if you'd like the changes squashing.

Storage is definitely a concern. I'm not convinced that it's worth adding another storage system for DBFS since they strongly recommend using either S3 or Azure anyway, plus they need to be accessed in different ways depending whether they're accessed from inside or outside the cluster. I've mentioned the use of an intermediate store in a separate comment above. Note that the launcher does require and use a storage system (the simple_pyspark example uses S3), it just also requires additional credentials in Databricks so they can be mounted secretly.

@sd2k sd2k force-pushed the dagster-databricks branch from 8f33cb0 to e7aa970 Compare June 4, 2020 13:17
@sryza
Copy link
Contributor

sryza commented Jun 6, 2020

@sd2k you should be able to address the latest build errors by updating snapshots for the failing tests - i.e. with pytest --snapshot-update examples/dagster_examples_tests/graphql_tests/test_examples_presets_graphql.py

sd2k added 17 commits June 8, 2020 09:00
This package is closely modeled off the dagster_aws.emr subpackage and
provides the databricks_pyspark_step_launcher resource and the
DatabricksRunJobSolidDefinition solid for running Databricks jobs.
Specifically:

- triple single quotes instead of triple double quotes for docstrings
- single quotes instead of double quotes everywhere else
- oneline docstrings where possible; start on same line everywhere else
- rename 'is_terminal' to 'has_terminated'
- use 'databricks_run_id' instead of 'run_id' for clarity
- make DatabricksJobRunner.client a property
- remove unnecessary blank lines
@sd2k sd2k force-pushed the dagster-databricks branch from f9e5b29 to 4b0540d Compare June 8, 2020 08:01
@sd2k sd2k force-pushed the dagster-databricks branch from 4b0540d to 83c4d3b Compare June 8, 2020 15:25
@sd2k sd2k force-pushed the dagster-databricks branch from 83c4d3b to 6c1bfc6 Compare June 8, 2020 16:12
@sd2k
Copy link
Contributor Author

sd2k commented Jun 8, 2020

Sorry for all the failing tests, various rebases onto master caused issues with the snapshots failing! I've fixed all the failures from the latest buildkite now 🤞

@sryza sryza merged commit 19146d4 into dagster-io:master Jun 9, 2020
@sd2k sd2k deleted the dagster-databricks branch June 9, 2020 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Databricks integration
2 participants