Add dagster_databricks package for Databricks integration #2468

sd2k · 2020-05-16T16:54:30Z

Adds a dagster_databricks package to integrate with Databricks. Works in a similar manner to dagster_aws.emr.

Closes #2458.

sd2k · 2020-05-18T20:21:01Z

I've tidied this up, finished it off (for now, pending a soon-to-be-PR'd dagster-azure package) and tested it a bunch of times on a Databricks cluster I have access to, both saving to S3 and Azure Data Lake Storage. It seems to work pretty well!

There are some slightly hairy parts around getting libraries installed remotely at the minute when using the external step launcher. Normally we could specify libraries in the libraries field of the run config, but because dagster-databricks hasn't been released yet it needs to be uploaded manually; I also ran into some conflicts with nightly versions of the other dagster packages so ended up doing most of this work based off 0.7.13, and just rebased on master for this PR.

There are also some (generally commented out) references to dagster-azure, which I intend to sort out as soon as possible, but I'll need to add a bunch of tests and send another PR first!

sryza

This is a really impressive and comprehensive piece of work. Thanks a ton for contributing.

I left a few comments inside, mostly on stylistic stuff. I think the biggest open question I have is on how the storage piece fits in. Have you looked into whether it would make sense to expose the s3/dbfs referenced in the PR as an intermediate store?

sryza · 2020-05-18T23:45:56Z

examples/dagster_examples/simple_pyspark/pipelines.py

-prod_mode = ModeDefinition(
-    name='prod',
+prod_emr_mode = ModeDefinition(
+    name='prod_emr',


python_modules/libraries/dagster-databricks/dagster_databricks/__init__.py

python_modules/libraries/dagster-databricks/dagster_databricks/configs.py

python_modules/libraries/dagster-databricks/dagster_databricks/databricks.py

python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py

python_modules/libraries/dagster-databricks/dagster_databricks_tests/test_pyspark.py

sryza · 2020-05-19T00:24:05Z

..._modules/libraries/dagster-databricks/dagster_databricks/databricks_pyspark_step_launcher.py

+                infile, self._dbfs_path(run_id, step_key, self._main_file_name())
+            )
+
+        if True:


With EMR, we've basically punted this to the user, though I suppose that's more difficult with Databricks because the job is often launching the cluster itself instead of relying on a user to launch and configure it separately.

@natekupp - I'm curious whether you have thoughts on the right thing to do here. I don't love this, but also don't have a better option in mind.

python_modules/libraries/dagster-databricks/dagster_databricks/utils.py

sd2k · 2020-05-19T08:38:37Z

This is a really impressive and comprehensive piece of work. Thanks a ton for contributing.

Most of this was more or less copied from your work on the EMR subpackage so thanks for that! 🙂

I left a few comments inside, mostly on stylistic stuff. I think the biggest open question I have is on how the storage piece fits in. Have you looked into whether it would make sense to expose the s3/dbfs referenced in the PR as an intermediate store?

I've pushed fixes for the stylistic stuff, thanks for the comments. Let me know if you'd like the changes squashing.

Storage is definitely a concern. I'm not convinced that it's worth adding another storage system for DBFS since they strongly recommend using either S3 or Azure anyway, plus they need to be accessed in different ways depending whether they're accessed from inside or outside the cluster. I've mentioned the use of an intermediate store in a separate comment above. Note that the launcher does require and use a storage system (the simple_pyspark example uses S3), it just also requires additional credentials in Databricks so they can be mounted secretly.

python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py

python_modules/libraries/dagster-databricks/dagster_databricks/configs.py

sryza · 2020-06-06T00:10:47Z

@sd2k you should be able to address the latest build errors by updating snapshots for the failing tests - i.e. with pytest --snapshot-update examples/dagster_examples_tests/graphql_tests/test_examples_presets_graphql.py

This package is closely modeled off the dagster_aws.emr subpackage and provides the databricks_pyspark_step_launcher resource and the DatabricksRunJobSolidDefinition solid for running Databricks jobs.

Specifically: - triple single quotes instead of triple double quotes for docstrings - single quotes instead of double quotes everywhere else - oneline docstrings where possible; start on same line everywhere else - rename 'is_terminal' to 'has_terminated' - use 'databricks_run_id' instead of 'run_id' for clarity - make DatabricksJobRunner.client a property - remove unnecessary blank lines

…atabricks storage

…in.py

…ks tests on buildkite See dagster-io#1960.

sd2k · 2020-06-08T16:13:57Z

Sorry for all the failing tests, various rebases onto master caused issues with the snapshots failing! I've fixed all the failures from the latest buildkite now 🤞

sd2k force-pushed the dagster-databricks branch from 994491d to 8e3f607 Compare May 18, 2020 20:16

sd2k marked this pull request as ready for review May 18, 2020 20:17

sryza reviewed May 19, 2020

View reviewed changes

python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py Outdated Show resolved Hide resolved

sryza reviewed May 20, 2020

View reviewed changes

python_modules/libraries/dagster-databricks/dagster_databricks/databricks_step_main.py Show resolved Hide resolved

sryza reviewed May 20, 2020

View reviewed changes

python_modules/libraries/dagster-databricks/dagster_databricks/configs.py Outdated Show resolved Hide resolved

sryza reviewed May 20, 2020

View reviewed changes

python_modules/libraries/dagster-databricks/dagster_databricks/configs.py Outdated Show resolved Hide resolved

sd2k force-pushed the dagster-databricks branch from 8f33cb0 to e7aa970 Compare June 4, 2020 13:17

sd2k added 17 commits June 8, 2020 09:00

Add dagster-databricks package

1971950

This package is closely modeled off the dagster_aws.emr subpackage and provides the databricks_pyspark_step_launcher resource and the DatabricksRunJobSolidDefinition solid for running Databricks jobs.

Reference Databricks docs in dagster-databricks configs module

eb5fc56

Move build_pyspark_zip into dagster_pyspark utils module

ec0ffee

Add references to Databricks storage docs in 'main' script

554d210

Add comment explaining global vars in databricks_step_main.py

e3f0d8d

Fix Python 2 issues in dagster-databricks

8ba8a25

Check invariants when setting up storage in Databricks job

69a4109

Fix dependencies in dagster-databricks/tox.ini

a758afc

Move 'secret_scope' field into inner credentials object to simplify D…

45328be

…atabricks storage

isort dagster-databricks

10a4ab8

Add pylint to tox.ini for dagster_databricks

2298e53

Install dagster-databricks in 'make install_dev_python_modules'

c23534f

Reference GitHub issue for better storage setup in databricks_step_ma…

d507426

…in.py

Uncomment dagster-azure related config

c2ebb50

Replace assert_called_once with call_count for Python3.5 compat

7ff55fe

Fix lint errors in databricks.py

4f53e82

sd2k force-pushed the dagster-databricks branch from f9e5b29 to 4b0540d Compare June 8, 2020 08:01

Improve handling of libraries by including required libs by default

8a23cf0

sd2k added 4 commits June 8, 2020 16:24

Fix version to match other dagster libraries

e2ab165

Specify supported_pythons to exclude Python 3.8 from dagster-databric…

e0a7461

…ks tests on buildkite See dagster-io#1960.

Add README for dagster-databricks

d9adf92

Install dagster-databricks in dagster-examples tox.ini

f374fbc

sd2k force-pushed the dagster-databricks branch from 4b0540d to 83c4d3b Compare June 8, 2020 15:25

sd2k added 2 commits June 8, 2020 17:12

Update snapshot test for dagster example using databricks

4d704ce

Add API docs for dagster_databricks

6c1bfc6

sd2k force-pushed the dagster-databricks branch from 83c4d3b to 6c1bfc6 Compare June 8, 2020 16:12

Add coveragerc for dagster-databricks

52eddd7

sryza merged commit 19146d4 into dagster-io:master Jun 9, 2020

sd2k deleted the dagster-databricks branch June 9, 2020 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dagster_databricks package for Databricks integration #2468

Add dagster_databricks package for Databricks integration #2468

sd2k commented May 16, 2020 •

edited

Loading

sd2k commented May 18, 2020

sryza left a comment

sryza May 18, 2020

sryza May 19, 2020

sd2k commented May 19, 2020 •

edited

Loading

sryza commented Jun 6, 2020

sd2k commented Jun 8, 2020

Add dagster_databricks package for Databricks integration #2468

Add dagster_databricks package for Databricks integration #2468

Conversation

sd2k commented May 16, 2020 • edited Loading

sd2k commented May 18, 2020

sryza left a comment

Choose a reason for hiding this comment

sryza May 18, 2020

Choose a reason for hiding this comment

sryza May 19, 2020

Choose a reason for hiding this comment

sd2k commented May 19, 2020 • edited Loading

sryza commented Jun 6, 2020

sd2k commented Jun 8, 2020

sd2k commented May 16, 2020 •

edited

Loading

sd2k commented May 19, 2020 •

edited

Loading