-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for loading dbt project from cloud store using Airflow Object Store #1148
Conversation
✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.
|
✅ Deploy Preview for sunny-pastelito-5ecb04 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
"dbt_project.yml": Path(project_yml_path) if project_yml_path else None, | ||
"models directory ": Path(self.models_path) if self.models_path else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly speaking I could not understand this even after reading the comment. both the self.dbt_project_path and self.models_path are already Path as specified in the init. I just reverted to its previous state
[skip ci] simplify code
@@ -36,12 +36,8 @@ def test_init_with_manifest_path_and_project_path_succeeds(): | |||
project_name in this case should be based on dbt_project_path | |||
""" | |||
project_config = ProjectConfig(dbt_project_path="/tmp/some-path", manifest_path="target/manifest.json") | |||
if AIRFLOW_IO_AVAILABLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is actually wrong since AIRFLOW_IO_AVAILABLE just means we are using airflow >=2.8.0, and not that we are using object storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corsettigyg wasn't object storage introduced in Airflow 2.8?
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/objectstorage.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tatiana indeed, although using airflow 2.8 does not exactly mean we are using object storage, which was the assumption of this test, so I just fixed it.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1148 +/- ##
=======================================
Coverage 96.53% 96.53%
=======================================
Files 64 64
Lines 3374 3376 +2
=======================================
+ Hits 3257 3259 +2
Misses 117 117 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CorsettiS This is a very promising change, thank you!
Some questions:
- How big are the dbt projects you've tested this feature with?
- I'm slightly concerned about performance, since we'd be running this during DAG processing/parsing time very regularly usually in the scheduler - depending on the customer deployment. This may not scale for bigger dbt projects or for situations where users have many Cosmos
DbtDag
s. Did you run any performance tests? Could you share some numbers / how you validated this feature? - Following (2), by syncing the whole dbt project from object store, we would potentially not benefit from the caching introduced in Cosmos 1.5 in PR Speed up
LoadMode.DBT_LS
by caching dbt ls output in Airflow Variable #1014. Do you have any proposals of how we could avoid performing this action every time? - Would using
LoadMethod.MANIFEST
+ExecutionMode.KUBERNETES
be an option for your company? This should allow the decoupling you are aiming for. Or is there any particular reason why you'd needLoadMethod.DBT_LS
andLoadMethod.LOCAL
?
It seems unlikely we'll be able to release this in 1.6, given the number of things we're still trying to address before the release - but let's see!
in the first scenario, I loaded both the project and the manifest using a local path and rendered the DbtDag with LoadMode.DBT_MANIFEST avg time: 2605 miliseconds In the second scenario, both the project and the manifest are fetched from S3 and the project parsed with LoadMode.DBT_MANIFEST avg time: 3975 miliseconds So it took roughly 52% longer to parse a dbt dag that is fully stored in a cloud provider. Of course that performance would degrade for bigger repos or show better results in case of a better internet connection. Personally speaking, I do not find it a huge time increase considering we can benefit from all features developed for a local dbt project.
|
I have decided to close the PR for now because I found a severe flaw when I was extending the functionality of this PR to the task execution, which is that each singular task will need to list and fetch all artifacts from the repo in s3, which is going to balloon costs in an unexpected way. I will work with it on the side and once I have a good alternative for the problem I will re-create the PR |
Thank you very much, @CorsettiS , for the additional information and context! We're looking forward to seeing your follow up PRs |
Description
This PR is based on #1109 and intends to allow users to load entire dbt projects that are stored at a cloud provider. That way, it is totally possible to decouple dbt and airflow with cosmos.
As of now, the company I currently work at does not have dbt & airflow in the same repo, but by allowing cosmos to fetch entire dbt projects from a cloud provider we can work around that easily. dbt projects are not usually very heavy, so performance should not be drastically impacted (it was not in the local tests I have done).
Would be good if this could be included in the release 1.6.0 #1080
Breaking Change?
No
Checklist