-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBT_MANIFEST mode does not use manifest for task execution #791
Comments
🤖 It's great to see your interest in improving Cosmos! Your observation about using the I would encourage you to go ahead and create a pull request with your proposed solution. The maintainers will review it and provide feedback. Your contribution will not only benefit the community but also help enhance the functionality of Cosmos for everyone. Good luck, and thank you for considering to contribute! This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
I think this is essentially the same as #785; I think the runtime only uses the manifest if it detects no code changes, which it can only do with the Let me know if you'd like to coordinate adding this change as I was going to get to it this weekend, but we could work together on it. The solution is if an existing target dir is found that the whole target dir should be moved into the tmpdir for the runtime. |
Hi @dwreeves , Next to the job I don't have much time to work on this, but i am happy to help if i manage. I would probably start by adding support for supplying the msgpack in Do you have a branch already? I will close this issue in favor of #785 and we continue discussing there then. |
Hi @Flinz, I opened a PR, yes: #800. The way I implemented it is entirely agnostic to Also, for what it is worth, I did test that dbt never uses You raise a good point that #785 is actually two issues in one (one is the msgpack thing, the other is about execution of projects using artifacts from cloud storage). I made a note of that in #800 and I do believe that these should be separated. |
## Description dbt uses `partial_parse.msgpack` to make rendering things a lot faster. This PR adds support for `partial_parse.msgpack` in the following places: - `ExecutionMode.LOCAL` - `ExecutionMode.VIRTUALENV` - `LoadMode.DBT_LS` This PR also allows users to explicitly _turn off_ partial parsing. If this is done, then `--no-partial-parse` will be passed as an explicit flag in all dbt command invocations (i.e. all `ExecutionMode`s and `LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.) This should address some performance complaints that users have, e.g. this message from Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829 Albeit, this user will also need to provide a `partial_parse.msgpack`. My experimentation and looking at dbt-core source code confirms that dbt does not use `manifest.json` when partial parsing. It appears that these are just output artifacts, but not input artifacts. Only `partial_parse.msgpack` is used. (There are a couple ways to confirm this other than just checking source code Also, I added a minor refactor of a feature I added a year ago: I added `send_sigint()` to the custom subprocess hook, since this custom subprocess hook did not exist back when I added it (if you want me to split this refactor into a different PR then let me know). ### API change I decided the best way to go about this would be to just add a `partial_parse: bool` to both the execution config and render config. For example: ```python dbt_group = DbtTaskGroup( ..., execution_config=ExecutionConfig( ..., partial_parse=True ), render_config=RenderConfig( ..., partial_parse=False ) ) ``` That said, in all honesty users will not need to set this at all, except unless they want to suppress the little warning message about not being able to find the `partial_parse.msgpack`. This is because by default this addition searches for a msgpack if it exists, which is already the existing behavior in a sense, except right now the msgpack file never exists (dbt does look for it though). When inserting into the `AbstractDbtBaseOperator`, I did not use `global_boolean_flags`. See the subsection below for why. ### Other execution performance improvements The main reason I am adding this feature is that it should dramatically improve performance for users. However, it is not the only way to improve It's possible that we should add a way to add the flag `--no-write-json` as an explicit kwarg to the dbt base operator. Right now users can support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users (e.g. those using Elementary, or those using the dbt local operator `callback` kwarg) will want to write the JSON, but I suspect the majority of users will not. Similarly, `--log-level-file` is not used at all, and at minimum dbt should work best the vast majority of time with `--no-cache-selected-only` set. It's also possible there should be some sort of "performant" mode that automatically sets all these defaults for optimal performance: - `--no-write-json` - `--log-level-file=none` - `--no-cache-selected-only` Perhaps a "performant" config would be too cumbersome to implement (I would agree with that). In which case the docs could also have a section on performance tips. ### A note on `global_boolean_flags` I did not add the partial parse support to `global_boolean_flags` because it doesn't quite fit into the existing paradigm for this. Right now the default for each of these `global_boolean_flags` is False, whereas the default for partial parse is actually True. This makes fitting it in awkward. I think it's possible that just having a `tuple[str]` is insufficient here, as some flags you may want to add (not just `--no-partial-parse` but also `--no-write-json` are by default _True_ and must be explicitly turned off. Meaning that the parsing Cosmos does with flags of `'--{flag.replace("_", "-")}'` is ineffective for flags like this. Right now, we have an example of putting _no_ in front with `no_version_check`. Meaning that the default behavior of version checking is True, but the flag itself starts as negated so the default is actually `False`. My proposal is, instead of `global_boolean_flags: tuple[str]`, this should instead be `global_boolean_flags: tuple[str | tuple[str, str | None, str | None]]`. In the case that a global flag is a `tuple[str, str | None, str | None]`, then the first arg should be the flag, the second should be "if true," and the third should be "if false." `None` indicates, when true/false (respectively), then do nothing. For example: ```python class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta): ... global_boolean_flags = ( ("no_version_check", "--no-version-check", None), ("cache_selected_only", "-cache-selected-only", None), ("partial_parse", None, "--no-partial-parse"), ) ``` And Cosmos want to support `str` parsing for backwards compatibility. It's pretty straightforward to convert the data type: ```python if isinstance(flag, str): flag = (flag, '--{flag.replace("_", "-")}', None) ``` ## Related Issue(s) - Resolves #791 - Partially resolves #785 - #785 should probably be split up into two different stages: (1) support for partial parsing (2) (a) dbt project dir / manifest / `partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt compile` is able to dump into cloud storage. ## Breaking Change? Should not break anything. This doesn't do anything when `partial_parse.msgpack` is missing, and the default behavior (`partial_parse=True`) does not alter the dbt cmd flags. ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works --------- Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Julian LaNeve <[email protected]> Co-authored-by: Justin Bandoro <[email protected]>
## Description dbt uses `partial_parse.msgpack` to make rendering things a lot faster. This PR adds support for `partial_parse.msgpack` in the following places: - `ExecutionMode.LOCAL` - `ExecutionMode.VIRTUALENV` - `LoadMode.DBT_LS` This PR also allows users to explicitly _turn off_ partial parsing. If this is done, then `--no-partial-parse` will be passed as an explicit flag in all dbt command invocations (i.e. all `ExecutionMode`s and `LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.) This should address some performance complaints that users have, e.g. this message from Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829 Albeit, this user will also need to provide a `partial_parse.msgpack`. My experimentation and looking at dbt-core source code confirms that dbt does not use `manifest.json` when partial parsing. It appears that these are just output artifacts, but not input artifacts. Only `partial_parse.msgpack` is used. (There are a couple ways to confirm this other than just checking source code Also, I added a minor refactor of a feature I added a year ago: I added `send_sigint()` to the custom subprocess hook, since this custom subprocess hook did not exist back when I added it (if you want me to split this refactor into a different PR then let me know). ### API change I decided the best way to go about this would be to just add a `partial_parse: bool` to both the execution config and render config. For example: ```python dbt_group = DbtTaskGroup( ..., execution_config=ExecutionConfig( ..., partial_parse=True ), render_config=RenderConfig( ..., partial_parse=False ) ) ``` That said, in all honesty users will not need to set this at all, except unless they want to suppress the little warning message about not being able to find the `partial_parse.msgpack`. This is because by default this addition searches for a msgpack if it exists, which is already the existing behavior in a sense, except right now the msgpack file never exists (dbt does look for it though). When inserting into the `AbstractDbtBaseOperator`, I did not use `global_boolean_flags`. See the subsection below for why. ### Other execution performance improvements The main reason I am adding this feature is that it should dramatically improve performance for users. However, it is not the only way to improve It's possible that we should add a way to add the flag `--no-write-json` as an explicit kwarg to the dbt base operator. Right now users can support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users (e.g. those using Elementary, or those using the dbt local operator `callback` kwarg) will want to write the JSON, but I suspect the majority of users will not. Similarly, `--log-level-file` is not used at all, and at minimum dbt should work best the vast majority of time with `--no-cache-selected-only` set. It's also possible there should be some sort of "performant" mode that automatically sets all these defaults for optimal performance: - `--no-write-json` - `--log-level-file=none` - `--no-cache-selected-only` Perhaps a "performant" config would be too cumbersome to implement (I would agree with that). In which case the docs could also have a section on performance tips. ### A note on `global_boolean_flags` I did not add the partial parse support to `global_boolean_flags` because it doesn't quite fit into the existing paradigm for this. Right now the default for each of these `global_boolean_flags` is False, whereas the default for partial parse is actually True. This makes fitting it in awkward. I think it's possible that just having a `tuple[str]` is insufficient here, as some flags you may want to add (not just `--no-partial-parse` but also `--no-write-json` are by default _True_ and must be explicitly turned off. Meaning that the parsing Cosmos does with flags of `'--{flag.replace("_", "-")}'` is ineffective for flags like this. Right now, we have an example of putting _no_ in front with `no_version_check`. Meaning that the default behavior of version checking is True, but the flag itself starts as negated so the default is actually `False`. My proposal is, instead of `global_boolean_flags: tuple[str]`, this should instead be `global_boolean_flags: tuple[str | tuple[str, str | None, str | None]]`. In the case that a global flag is a `tuple[str, str | None, str | None]`, then the first arg should be the flag, the second should be "if true," and the third should be "if false." `None` indicates, when true/false (respectively), then do nothing. For example: ```python class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta): ... global_boolean_flags = ( ("no_version_check", "--no-version-check", None), ("cache_selected_only", "-cache-selected-only", None), ("partial_parse", None, "--no-partial-parse"), ) ``` And Cosmos want to support `str` parsing for backwards compatibility. It's pretty straightforward to convert the data type: ```python if isinstance(flag, str): flag = (flag, '--{flag.replace("_", "-")}', None) ``` ## Related Issue(s) - Resolves astronomer#791 - Partially resolves astronomer#785 - astronomer#785 should probably be split up into two different stages: (1) support for partial parsing (2) (a) dbt project dir / manifest / `partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt compile` is able to dump into cloud storage. ## Breaking Change? Should not break anything. This doesn't do anything when `partial_parse.msgpack` is missing, and the default behavior (`partial_parse=True`) does not alter the dbt cmd flags. ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works --------- Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Julian LaNeve <[email protected]> Co-authored-by: Justin Bandoro <[email protected]>
I am using DBT_MANIFEST parsing mode, and seeing quite a bit of time being spent in each task on parsing the whole project, spending around 1/3 of the task total time on a full parse:
I see the command being executed for each task is along the lines of
I believe that providing the
target-dir
with themanifest.json
in the command execution would probably allow skipping the re-parsing of the project for each task.Related to this:
partial_parse.msgpack
ortarget/
dir into tmpdir. #785: is probably a more generic API for this -- i think for DBT_MANIFEST mode, this could be solved in an easier way by just using the existing manifest?I am fairly new to cosmos, so I would be grateful for some pointers in the right direction.
The text was updated successfully, but these errors were encountered: