Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] DbtCompileOperator + allow partial_parse.msgpack to be read from S3/GCS/Azure Blob Storage. #870

Open
dwreeves opened this issue Mar 1, 2024 · 4 comments
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes dbt:compile Primarily related to dbt compile command or functionality do-not-stale Related to stale job and dosubot execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc triage-needed Items need to be reviewed / assigned to milestone
Milestone

Comments

@dwreeves
Copy link
Collaborator

dwreeves commented Mar 1, 2024

Our CICD uses a different config than what Cosmos does, and the difference in the profile is close to unavoidable, which makes it so partial parsing doesn't actually work:

Unable to do partial parsing because config vars, config profile, or config target have changed
Unable to do partial parsing because profile has changed
Unable to do partial parsing because env vars used in profiles.yml have changed

I think the only way you can reasonably get this to work is to have a DbtCompileOperator, and then read the partial_parse.msgpack from S3.

The output directory should be templated, ideally, so that you can use ti.xcoms_pull() or otherwise avoid any "clashing" across multiple simultaneous DAG runs. (Then at the end of each DAG run the user can do e.g. S3DeleteObjectsOperator() to clean things up at their leisure.)

Copy link

dosubot bot commented Mar 1, 2024

It seems like you've got this under control, if you want help or have specific questions, let me know what I can do for you!

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added area:parsing Related to parsing DAG/DBT improvement, issues, or fixes dbt:compile Primarily related to dbt compile command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc labels Mar 1, 2024
@yusufshalaby
Copy link

yusufshalaby commented Apr 7, 2024

This would be a gamechanger. Similar to Dagster's implementation where models are split into different data assets but a full dbt run only requires a single dbt parse at the beginning. This would unlock that functionality in Airflow, where we can run each model as a seperate Airflow task that reads from the same cached manifest rather than forcing each task to perform a dbt parse and incurring a high cpu cost. Previous solutions involved building the manifest during CI however this prevents users from injecting any vars at runtime.

I've given some thought to writing an Airflow task that parses the dbt project and passes the manifest to each Cosmos dbt task, but it's tough when you're using Celery/Kubernetes executor with workers spread across multiple machines.

@tatiana tatiana added this to the 1.5.0 milestone May 17, 2024
@tatiana tatiana added the triage-needed Items need to be reviewed / assigned to milestone label May 17, 2024
@tatiana tatiana modified the milestones: 1.5.0, 1.6.0 May 17, 2024
@dwreeves
Copy link
Collaborator Author

dwreeves commented May 17, 2024

I've somewhat caught up on how partial parsing is implemented in 1.4.0, and I'm thinking this really should be two issues:

  1. Support both reading from and writing to S3/GCS/Azure Blob Storage with dbt artifacts post-execution.
  2. Support DagRun-level parameterization of the target path. (To more elegantly support e.g. dbt_vars={"logical_date": "{{ logical_date }}"})

Will have to think a little more about this. I know there are some use cases for dbt artifacts post-execution in addition to just using the partial_parse.msgpack for subsequent executions, and I want to make sure we are supporting that (or, at least, we are designing in a direction that will support that), since the caching directory approach taken in 1.4.0 will work OK enough as is for many simpler use cases.

There is a bit of overlap in the writing part with what the dbt docs operator already does, which is both a blessing and a curse. A blessing because we already have an API (we don't need to reinvent the wheel), but a curse because we already have an API (we don't want to create two different pathways to interacting with S3/Azure/GCS). Unfortunately mimicking exactly the pattern used by the dbt docs operators, and having something like DbtLocalRunS3Operator, would explode the operator space 4x or 5x, and that's untenable.

@dwreeves dwreeves mentioned this issue May 17, 2024
2 tasks
@tatiana tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.7.0 Jul 5, 2024
@tatiana tatiana modified the milestones: Cosmos 1.7.0, Triage Sep 20, 2024
Copy link

dosubot bot commented Dec 20, 2024

Hi, @dwreeves. I'm Dosu, and I'm helping the Cosmos team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You proposed adding a DbtCompileOperator to handle partial_parse.msgpack files from cloud storage, addressing CICD and Cosmos configuration challenges.
  • @yusufshalaby supported the idea, noting its potential to optimize Airflow tasks by reducing CPU costs.
  • You suggested splitting the issue into two: one for cloud storage interactions and another for parameterizing the target path at the DagRun level.
  • The discussion included concerns about overlapping with existing dbt docs operators and avoiding redundant cloud interaction pathways.

Next Steps:

  • Please let us know if this issue is still relevant to the latest version of the Cosmos repository. If so, you can keep the discussion open by commenting on the issue.
  • Otherwise, the issue will be automatically closed in 2400 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 20, 2024
@pankajastro pankajastro added do-not-stale Related to stale job and dosubot and removed stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes dbt:compile Primarily related to dbt compile command or functionality do-not-stale Related to stale job and dosubot execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

4 participants