-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Databricks integration #2458
Comments
yes please! that would be fantastic |
OK cool. I've made a start on this and have an external step launcher which is working, but I think I may need to add a new system storage for DBFS (the Databricks filesystem) at the very least - I'm not using AWS so can't rely on the S3 storage unfortunately. I'll also have to add an Azure storage system too later but that can be a separate PR :) I've noticed something weird happening when trying to unpickle the events pickled by my
I've worked around it by using |
Wow that's really strange. I've never seen that behavior before. Have you been able to get a repro in a more controlled environment? Or is this only happening in this environment where you are pickling in the remote execution environment that is different from the local environment? |
Works as expected
|
Ah, it looks like it happens as soon as PySpark is imported; if you add
A bit of digging implicates this: https://jira.apache.org/jira/browse/SPARK-22674. I can't tell how the remote EMR executor is working since that must also import pyspark, but that seems to be the issue. |
That pyspark bug is wild. Can you just use dagster.serdes instead? |
hey @sd2k - let me know when you start looking at implementing the system storage for DBFS, happy to help. You can check out this diff for an example of what's needed to add a new system storage: https://dagster.phacility.com/D2259 |
Yep I can do, works fine.
Cheers! I'll take a look today and let you know if I have any questions. |
Hmm it does look like a much bigger undertaking than I expected 😄 I'm a bit unsure whether this is the best way to proceed! The workflow I think Databricks recommend is to not use the DBFS root, instead preferring to either mount an object storage account or access the object store directly (e.g. S3 and Azure. Either way, some kind of config needs executing prior to the pipeline run to mount the object store or set credentials for API calls. Mounting the object store allows you to access it via DBFS using Spark APIs and local filesystem APIs which is convenient for interactive/notebook use, but not much different when running jobs. I get the feeling that a If that sounds reasonable then I need to work on an Azure storage system! |
Another question is how to handle 'Delta Lake' storage. For my purposes the basic idea is that Spark DataFrames would be saved using I also need to look into the |
I've been following Dagster for a month or so as we're looking to revamp our data pipelines at $company. We'll be using Spark for the majority of our ETLs, but using Databricks to manage our infrastructure rather than manually managing Spark clusters.
I noticed that there was an EMR launcher for launching PySpark solids added a week or so ago. I haven't used EMR much, but think Databricks has a similar workflow where jobs are submitted using their Jobs API (which I've been using in the past through Airflow).
Does this sound like a good candidate for a
dagster-databricks
integration library? And if so, are their any plans to support Databricks already, or would you be accepting contributions there?Thanks!
The text was updated successfully, but these errors were encountered: