DBT_MANIFEST mode does not use manifest for task execution #791

seeholza · 2024-01-09T16:12:34Z

I am using DBT_MANIFEST parsing mode, and seeing quite a bit of time being spent in each task on parsing the whole project, spending around 1/3 of the task total time on a full parse:

[2024-01-09, 16:51:27 CET] {subprocess.py:94} INFO - �[0m15:51:27  Unable to do partial parsing because saved manifest not found. Starting full parse.
[2024-01-09, 16:51:30 CET] {subprocess.py:94} INFO - �[0m15:51:30  Found 38 models, 6 tests, 20 sources, 0 exposures, 0 metrics, 532 macros, 0 groups, 0 semantic models

I see the command being executed for each task is along the lines of

[2024-01-09, 16:51:23 CET] {subprocess.py:73} INFO - Running command: ['/opt/***/dbt_venv/bin/dbt', 'run', '--models', 'mart_facts_samples', '--profiles-dir', '/tmp/tmpnhr57up0', '--profile', 'default', '--target', 'prod']

I believe that providing the target-dir with the manifest.json in the command execution would probably allow skipping the re-parsing of the project for each task.

Related to this:

Evaluate alternatives to rendering dbt projects other than during DAG parsing time #520: this is at DAG parsing time, but not at task execution time
Option to copy partial_parse.msgpack or target/ dir into tmpdir. #785: is probably a more generic API for this -- i think for DBT_MANIFEST mode, this could be solved in an easier way by just using the existing manifest?

I am fairly new to cosmos, so I would be grateful for some pointers in the right direction.

The text was updated successfully, but these errors were encountered:

dosubot · 2024-01-09T16:12:52Z

🤖

It's great to see your interest in improving Cosmos! Your observation about using the target-dir with the manifest.json in the command execution to potentially skip the re-parsing of the project for each task sounds like a promising approach.

I would encourage you to go ahead and create a pull request with your proposed solution. The maintainers will review it and provide feedback. Your contribution will not only benefit the community but also help enhance the functionality of Cosmos for everyone. Good luck, and thank you for considering to contribute!

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dwreeves · 2024-01-10T14:53:52Z

I think this is essentially the same as #785; I think the runtime only uses the manifest if it detects no code changes, which it can only do with the msgpack.

Let me know if you'd like to coordinate adding this change as I was going to get to it this weekend, but we could work together on it. The solution is if an existing target dir is found that the whole target dir should be moved into the tmpdir for the runtime.

seeholza · 2024-01-15T10:34:56Z

Hi @dwreeves ,

Next to the job I don't have much time to work on this, but i am happy to help if i manage.

I would probably start by adding support for supplying the msgpack in dbt-manifest mode, with a more generic API change in mind after. But we would need agree on the final design of how msgpack/manifest is copied to the tmp directories.

Do you have a branch already? I will close this issue in favor of #785 and we continue discussing there then.

dwreeves · 2024-01-15T19:28:09Z

Hi @Flinz,

I opened a PR, yes: #800.

The way I implemented it is entirely agnostic to LoadMode.DBT_MANIFEST. Basically, by default, it will always check for the existence of the partial parse file before parsing when running a command locally. So any user who is running dbt compile in their CICD, whether they're using manifest or LS, will be fine. (a small subset of DBT_LS users may mysteriously see their performance improve in the next release after this is implemented!).

Also, for what it is worth, I did test that dbt never uses manifest.json for input purposes, only outputs. So when we say "use manifest for task execution" it's never manifest.json, only ever partial_parse.msgpack.

You raise a good point that #785 is actually two issues in one (one is the msgpack thing, the other is about execution of projects using artifacts from cloud storage). I made a note of that in #800 and I do believe that these should be separated.

## Description dbt uses `partial_parse.msgpack` to make rendering things a lot faster. This PR adds support for `partial_parse.msgpack` in the following places: - `ExecutionMode.LOCAL` - `ExecutionMode.VIRTUALENV` - `LoadMode.DBT_LS` This PR also allows users to explicitly _turn off_ partial parsing. If this is done, then `--no-partial-parse` will be passed as an explicit flag in all dbt command invocations (i.e. all `ExecutionMode`s and `LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.) This should address some performance complaints that users have, e.g. this message from Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829 Albeit, this user will also need to provide a `partial_parse.msgpack`. My experimentation and looking at dbt-core source code confirms that dbt does not use `manifest.json` when partial parsing. It appears that these are just output artifacts, but not input artifacts. Only `partial_parse.msgpack` is used. (There are a couple ways to confirm this other than just checking source code Also, I added a minor refactor of a feature I added a year ago: I added `send_sigint()` to the custom subprocess hook, since this custom subprocess hook did not exist back when I added it (if you want me to split this refactor into a different PR then let me know). ### API change I decided the best way to go about this would be to just add a `partial_parse: bool` to both the execution config and render config. For example: ```python dbt_group = DbtTaskGroup( ..., execution_config=ExecutionConfig( ..., partial_parse=True ), render_config=RenderConfig( ..., partial_parse=False ) ) ``` That said, in all honesty users will not need to set this at all, except unless they want to suppress the little warning message about not being able to find the `partial_parse.msgpack`. This is because by default this addition searches for a msgpack if it exists, which is already the existing behavior in a sense, except right now the msgpack file never exists (dbt does look for it though). When inserting into the `AbstractDbtBaseOperator`, I did not use `global_boolean_flags`. See the subsection below for why. ### Other execution performance improvements The main reason I am adding this feature is that it should dramatically improve performance for users. However, it is not the only way to improve It's possible that we should add a way to add the flag `--no-write-json` as an explicit kwarg to the dbt base operator. Right now users can support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users (e.g. those using Elementary, or those using the dbt local operator `callback` kwarg) will want to write the JSON, but I suspect the majority of users will not. Similarly, `--log-level-file` is not used at all, and at minimum dbt should work best the vast majority of time with `--no-cache-selected-only` set. It's also possible there should be some sort of "performant" mode that automatically sets all these defaults for optimal performance: - `--no-write-json` - `--log-level-file=none` - `--no-cache-selected-only` Perhaps a "performant" config would be too cumbersome to implement (I would agree with that). In which case the docs could also have a section on performance tips. ### A note on `global_boolean_flags` I did not add the partial parse support to `global_boolean_flags` because it doesn't quite fit into the existing paradigm for this. Right now the default for each of these `global_boolean_flags` is False, whereas the default for partial parse is actually True. This makes fitting it in awkward. I think it's possible that just having a `tuple[str]` is insufficient here, as some flags you may want to add (not just `--no-partial-parse` but also `--no-write-json` are by default _True_ and must be explicitly turned off. Meaning that the parsing Cosmos does with flags of `'--{flag.replace("_", "-")}'` is ineffective for flags like this. Right now, we have an example of putting _no_ in front with `no_version_check`. Meaning that the default behavior of version checking is True, but the flag itself starts as negated so the default is actually `False`. My proposal is, instead of `global_boolean_flags: tuple[str]`, this should instead be `global_boolean_flags: tuple[str | tuple[str, str | None, str | None]]`. In the case that a global flag is a `tuple[str, str | None, str | None]`, then the first arg should be the flag, the second should be "if true," and the third should be "if false." `None` indicates, when true/false (respectively), then do nothing. For example: ```python class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta): ... global_boolean_flags = ( ("no_version_check", "--no-version-check", None), ("cache_selected_only", "-cache-selected-only", None), ("partial_parse", None, "--no-partial-parse"), ) ``` And Cosmos want to support `str` parsing for backwards compatibility. It's pretty straightforward to convert the data type: ```python if isinstance(flag, str): flag = (flag, '--{flag.replace("_", "-")}', None) ``` ## Related Issue(s) - Resolves #791 - Partially resolves #785 - #785 should probably be split up into two different stages: (1) support for partial parsing (2) (a) dbt project dir / manifest / `partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt compile` is able to dump into cloud storage. ## Breaking Change? Should not break anything. This doesn't do anything when `partial_parse.msgpack` is missing, and the default behavior (`partial_parse=True`) does not alter the dbt cmd flags. ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works --------- Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Julian LaNeve <[email protected]> Co-authored-by: Justin Bandoro <[email protected]>

## Description dbt uses `partial_parse.msgpack` to make rendering things a lot faster. This PR adds support for `partial_parse.msgpack` in the following places: - `ExecutionMode.LOCAL` - `ExecutionMode.VIRTUALENV` - `LoadMode.DBT_LS` This PR also allows users to explicitly _turn off_ partial parsing. If this is done, then `--no-partial-parse` will be passed as an explicit flag in all dbt command invocations (i.e. all `ExecutionMode`s and `LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.) This should address some performance complaints that users have, e.g. this message from Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829 Albeit, this user will also need to provide a `partial_parse.msgpack`. My experimentation and looking at dbt-core source code confirms that dbt does not use `manifest.json` when partial parsing. It appears that these are just output artifacts, but not input artifacts. Only `partial_parse.msgpack` is used. (There are a couple ways to confirm this other than just checking source code Also, I added a minor refactor of a feature I added a year ago: I added `send_sigint()` to the custom subprocess hook, since this custom subprocess hook did not exist back when I added it (if you want me to split this refactor into a different PR then let me know). ### API change I decided the best way to go about this would be to just add a `partial_parse: bool` to both the execution config and render config. For example: ```python dbt_group = DbtTaskGroup( ..., execution_config=ExecutionConfig( ..., partial_parse=True ), render_config=RenderConfig( ..., partial_parse=False ) ) ``` That said, in all honesty users will not need to set this at all, except unless they want to suppress the little warning message about not being able to find the `partial_parse.msgpack`. This is because by default this addition searches for a msgpack if it exists, which is already the existing behavior in a sense, except right now the msgpack file never exists (dbt does look for it though). When inserting into the `AbstractDbtBaseOperator`, I did not use `global_boolean_flags`. See the subsection below for why. ### Other execution performance improvements The main reason I am adding this feature is that it should dramatically improve performance for users. However, it is not the only way to improve It's possible that we should add a way to add the flag `--no-write-json` as an explicit kwarg to the dbt base operator. Right now users can support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users (e.g. those using Elementary, or those using the dbt local operator `callback` kwarg) will want to write the JSON, but I suspect the majority of users will not. Similarly, `--log-level-file` is not used at all, and at minimum dbt should work best the vast majority of time with `--no-cache-selected-only` set. It's also possible there should be some sort of "performant" mode that automatically sets all these defaults for optimal performance: - `--no-write-json` - `--log-level-file=none` - `--no-cache-selected-only` Perhaps a "performant" config would be too cumbersome to implement (I would agree with that). In which case the docs could also have a section on performance tips. ### A note on `global_boolean_flags` I did not add the partial parse support to `global_boolean_flags` because it doesn't quite fit into the existing paradigm for this. Right now the default for each of these `global_boolean_flags` is False, whereas the default for partial parse is actually True. This makes fitting it in awkward. I think it's possible that just having a `tuple[str]` is insufficient here, as some flags you may want to add (not just `--no-partial-parse` but also `--no-write-json` are by default _True_ and must be explicitly turned off. Meaning that the parsing Cosmos does with flags of `'--{flag.replace("_", "-")}'` is ineffective for flags like this. Right now, we have an example of putting _no_ in front with `no_version_check`. Meaning that the default behavior of version checking is True, but the flag itself starts as negated so the default is actually `False`. My proposal is, instead of `global_boolean_flags: tuple[str]`, this should instead be `global_boolean_flags: tuple[str | tuple[str, str | None, str | None]]`. In the case that a global flag is a `tuple[str, str | None, str | None]`, then the first arg should be the flag, the second should be "if true," and the third should be "if false." `None` indicates, when true/false (respectively), then do nothing. For example: ```python class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta): ... global_boolean_flags = ( ("no_version_check", "--no-version-check", None), ("cache_selected_only", "-cache-selected-only", None), ("partial_parse", None, "--no-partial-parse"), ) ``` And Cosmos want to support `str` parsing for backwards compatibility. It's pretty straightforward to convert the data type: ```python if isinstance(flag, str): flag = (flag, '--{flag.replace("_", "-")}', None) ``` ## Related Issue(s) - Resolves astronomer#791 - Partially resolves astronomer#785 - astronomer#785 should probably be split up into two different stages: (1) support for partial parsing (2) (a) dbt project dir / manifest / `partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt compile` is able to dump into cloud storage. ## Breaking Change? Should not break anything. This doesn't do anything when `partial_parse.msgpack` is missing, and the default behavior (`partial_parse=True`) does not alter the dbt cmd flags. ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works --------- Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Julian LaNeve <[email protected]> Co-authored-by: Justin Bandoro <[email protected]>

seeholza closed this as completed Jan 15, 2024

dwreeves mentioned this issue Jan 15, 2024

Support partial parsing #800

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBT_MANIFEST mode does not use manifest for task execution #791

DBT_MANIFEST mode does not use manifest for task execution #791

seeholza commented Jan 9, 2024

dosubot bot commented Jan 9, 2024 •

edited

Loading

dwreeves commented Jan 10, 2024

seeholza commented Jan 15, 2024

dwreeves commented Jan 15, 2024

DBT_MANIFEST mode does not use manifest for task execution #791

DBT_MANIFEST mode does not use manifest for task execution #791

Comments

seeholza commented Jan 9, 2024

dosubot bot commented Jan 9, 2024 • edited Loading

dwreeves commented Jan 10, 2024

seeholza commented Jan 15, 2024

dwreeves commented Jan 15, 2024

dosubot bot commented Jan 9, 2024 •

edited

Loading