Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT_MANIFEST mode does not use manifest for task execution #791

Closed
seeholza opened this issue Jan 9, 2024 · 4 comments · Fixed by #800
Closed

DBT_MANIFEST mode does not use manifest for task execution #791

seeholza opened this issue Jan 9, 2024 · 4 comments · Fixed by #800
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes dbt:parse Primarily related to dbt parse command or functionality execution:virtualenv Related to Virtualenv execution environment parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing

Comments

@seeholza
Copy link
Contributor

seeholza commented Jan 9, 2024

I am using DBT_MANIFEST parsing mode, and seeing quite a bit of time being spent in each task on parsing the whole project, spending around 1/3 of the task total time on a full parse:

[2024-01-09, 16:51:27 CET] {subprocess.py:94} INFO - �[0m15:51:27  Unable to do partial parsing because saved manifest not found. Starting full parse.
[2024-01-09, 16:51:30 CET] {subprocess.py:94} INFO - �[0m15:51:30  Found 38 models, 6 tests, 20 sources, 0 exposures, 0 metrics, 532 macros, 0 groups, 0 semantic models

I see the command being executed for each task is along the lines of

[2024-01-09, 16:51:23 CET] {subprocess.py:73} INFO - Running command: ['/opt/***/dbt_venv/bin/dbt', 'run', '--models', 'mart_facts_samples', '--profiles-dir', '/tmp/tmpnhr57up0', '--profile', 'default', '--target', 'prod']

I believe that providing the target-dir with the manifest.json in the command execution would probably allow skipping the re-parsing of the project for each task.

Related to this:

I am fairly new to cosmos, so I would be grateful for some pointers in the right direction.

Copy link

dosubot bot commented Jan 9, 2024

🤖

It's great to see your interest in improving Cosmos! Your observation about using the target-dir with the manifest.json in the command execution to potentially skip the re-parsing of the project for each task sounds like a promising approach.

I would encourage you to go ahead and create a pull request with your proposed solution. The maintainers will review it and provide feedback. Your contribution will not only benefit the community but also help enhance the functionality of Cosmos for everyone. Good luck, and thank you for considering to contribute!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added area:parsing Related to parsing DAG/DBT improvement, issues, or fixes dbt:parse Primarily related to dbt parse command or functionality execution:virtualenv Related to Virtualenv execution environment parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing labels Jan 9, 2024
@dwreeves
Copy link
Collaborator

I think this is essentially the same as #785; I think the runtime only uses the manifest if it detects no code changes, which it can only do with the msgpack.

Let me know if you'd like to coordinate adding this change as I was going to get to it this weekend, but we could work together on it. The solution is if an existing target dir is found that the whole target dir should be moved into the tmpdir for the runtime.

@seeholza
Copy link
Contributor Author

Hi @dwreeves ,

Next to the job I don't have much time to work on this, but i am happy to help if i manage.

I would probably start by adding support for supplying the msgpack in dbt-manifest mode, with a more generic API change in mind after. But we would need agree on the final design of how msgpack/manifest is copied to the tmp directories.

Do you have a branch already? I will close this issue in favor of #785 and we continue discussing there then.

@dwreeves dwreeves mentioned this issue Jan 15, 2024
2 tasks
@dwreeves
Copy link
Collaborator

Hi @Flinz,

I opened a PR, yes: #800.

The way I implemented it is entirely agnostic to LoadMode.DBT_MANIFEST. Basically, by default, it will always check for the existence of the partial parse file before parsing when running a command locally. So any user who is running dbt compile in their CICD, whether they're using manifest or LS, will be fine. (a small subset of DBT_LS users may mysteriously see their performance improve in the next release after this is implemented!).

Also, for what it is worth, I did test that dbt never uses manifest.json for input purposes, only outputs. So when we say "use manifest for task execution" it's never manifest.json, only ever partial_parse.msgpack.

You raise a good point that #785 is actually two issues in one (one is the msgpack thing, the other is about execution of projects using artifacts from cloud storage). I made a note of that in #800 and I do believe that these should be separated.

jbandoro added a commit that referenced this issue Feb 19, 2024
## Description

dbt uses `partial_parse.msgpack` to make rendering things a lot faster.
This PR adds support for `partial_parse.msgpack` in the following
places:

- `ExecutionMode.LOCAL`
- `ExecutionMode.VIRTUALENV`
- `LoadMode.DBT_LS`

This PR also allows users to explicitly _turn off_ partial parsing. If
this is done, then `--no-partial-parse` will be passed as an explicit
flag in all dbt command invocations (i.e. all `ExecutionMode`s and
`LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.)

This should address some performance complaints that users have, e.g.
this message from Slack:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829
Albeit, this user will also need to provide a `partial_parse.msgpack`.

My experimentation and looking at dbt-core source code confirms that dbt
does not use `manifest.json` when partial parsing. It appears that these
are just output artifacts, but not input artifacts. Only
`partial_parse.msgpack` is used. (There are a couple ways to confirm
this other than just checking source code

Also, I added a minor refactor of a feature I added a year ago: I added
`send_sigint()` to the custom subprocess hook, since this custom
subprocess hook did not exist back when I added it (if you want me to
split this refactor into a different PR then let me know).

### API change

I decided the best way to go about this would be to just add a
`partial_parse: bool` to both the execution config and render config.
For example:

```python
dbt_group = DbtTaskGroup(
    ...,
    execution_config=ExecutionConfig(
        ...,
        partial_parse=True
    ),
    render_config=RenderConfig(
        ...,
        partial_parse=False
    )
)
```

That said, in all honesty users will not need to set this at all, except
unless they want to suppress the little warning message about not being
able to find the `partial_parse.msgpack`. This is because by default
this addition searches for a msgpack if it exists, which is already the
existing behavior in a sense, except right now the msgpack file never
exists (dbt does look for it though).

When inserting into the `AbstractDbtBaseOperator`, I did not use
`global_boolean_flags`. See the subsection below for why.

### Other execution performance improvements

The main reason I am adding this feature is that it should dramatically
improve performance for users. However, it is not the only way to
improve

It's possible that we should add a way to add the flag `--no-write-json`
as an explicit kwarg to the dbt base operator. Right now users can
support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users
(e.g. those using Elementary, or those using the dbt local operator
`callback` kwarg) will want to write the JSON, but I suspect the
majority of users will not. Similarly, `--log-level-file` is not used at
all, and at minimum dbt should work best the vast majority of time with
`--no-cache-selected-only` set.

It's also possible there should be some sort of "performant" mode that
automatically sets all these defaults for optimal performance:

- `--no-write-json`
- `--log-level-file=none`
- `--no-cache-selected-only`

Perhaps a "performant" config would be too cumbersome to implement (I
would agree with that). In which case the docs could also have a section
on performance tips.

### A note on `global_boolean_flags`

I did not add the partial parse support to `global_boolean_flags`
because it doesn't quite fit into the existing paradigm for this. Right
now the default for each of these `global_boolean_flags` is False,
whereas the default for partial parse is actually True. This makes
fitting it in awkward.

I think it's possible that just having a `tuple[str]` is insufficient
here, as some flags you may want to add (not just `--no-partial-parse`
but also `--no-write-json` are by default _True_ and must be explicitly
turned off. Meaning that the parsing Cosmos does with flags of
`'--{flag.replace("_", "-")}'` is ineffective for flags like this.

Right now, we have an example of putting _no_ in front with
`no_version_check`. Meaning that the default behavior of version
checking is True, but the flag itself starts as negated so the default
is actually `False`.

My proposal is, instead of `global_boolean_flags: tuple[str]`, this
should instead be `global_boolean_flags: tuple[str | tuple[str, str |
None, str | None]]`. In the case that a global flag is a `tuple[str, str
| None, str | None]`, then the first arg should be the flag, the second
should be "if true," and the third should be "if false." `None`
indicates, when true/false (respectively), then do nothing.

For example:

```python
class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta):
    ...
    global_boolean_flags = (
        ("no_version_check", "--no-version-check", None),
        ("cache_selected_only", "-cache-selected-only", None),
        ("partial_parse", None, "--no-partial-parse"),
    )

```

And Cosmos want to support `str` parsing for backwards compatibility.
It's pretty straightforward to convert the data type:

```python
if isinstance(flag, str):
    flag = (flag, '--{flag.replace("_", "-")}', None)
```

## Related Issue(s)

- Resolves #791
- Partially resolves #785
- #785 should probably be split up into two different stages: (1)
support for partial parsing (2) (a) dbt project dir / manifest /
`partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt
compile` is able to dump into cloud storage.

## Breaking Change?

Should not break anything. This doesn't do anything when
`partial_parse.msgpack` is missing, and the default behavior
(`partial_parse=True`) does not alter the dbt cmd flags.

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works

---------

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Julian LaNeve <[email protected]>
Co-authored-by: Justin Bandoro <[email protected]>
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this issue Jul 14, 2024
## Description

dbt uses `partial_parse.msgpack` to make rendering things a lot faster.
This PR adds support for `partial_parse.msgpack` in the following
places:

- `ExecutionMode.LOCAL`
- `ExecutionMode.VIRTUALENV`
- `LoadMode.DBT_LS`

This PR also allows users to explicitly _turn off_ partial parsing. If
this is done, then `--no-partial-parse` will be passed as an explicit
flag in all dbt command invocations (i.e. all `ExecutionMode`s and
`LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.)

This should address some performance complaints that users have, e.g.
this message from Slack:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829
Albeit, this user will also need to provide a `partial_parse.msgpack`.

My experimentation and looking at dbt-core source code confirms that dbt
does not use `manifest.json` when partial parsing. It appears that these
are just output artifacts, but not input artifacts. Only
`partial_parse.msgpack` is used. (There are a couple ways to confirm
this other than just checking source code

Also, I added a minor refactor of a feature I added a year ago: I added
`send_sigint()` to the custom subprocess hook, since this custom
subprocess hook did not exist back when I added it (if you want me to
split this refactor into a different PR then let me know).

### API change

I decided the best way to go about this would be to just add a
`partial_parse: bool` to both the execution config and render config.
For example:

```python
dbt_group = DbtTaskGroup(
    ...,
    execution_config=ExecutionConfig(
        ...,
        partial_parse=True
    ),
    render_config=RenderConfig(
        ...,
        partial_parse=False
    )
)
```

That said, in all honesty users will not need to set this at all, except
unless they want to suppress the little warning message about not being
able to find the `partial_parse.msgpack`. This is because by default
this addition searches for a msgpack if it exists, which is already the
existing behavior in a sense, except right now the msgpack file never
exists (dbt does look for it though).

When inserting into the `AbstractDbtBaseOperator`, I did not use
`global_boolean_flags`. See the subsection below for why.

### Other execution performance improvements

The main reason I am adding this feature is that it should dramatically
improve performance for users. However, it is not the only way to
improve

It's possible that we should add a way to add the flag `--no-write-json`
as an explicit kwarg to the dbt base operator. Right now users can
support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users
(e.g. those using Elementary, or those using the dbt local operator
`callback` kwarg) will want to write the JSON, but I suspect the
majority of users will not. Similarly, `--log-level-file` is not used at
all, and at minimum dbt should work best the vast majority of time with
`--no-cache-selected-only` set.

It's also possible there should be some sort of "performant" mode that
automatically sets all these defaults for optimal performance:

- `--no-write-json`
- `--log-level-file=none`
- `--no-cache-selected-only`

Perhaps a "performant" config would be too cumbersome to implement (I
would agree with that). In which case the docs could also have a section
on performance tips.

### A note on `global_boolean_flags`

I did not add the partial parse support to `global_boolean_flags`
because it doesn't quite fit into the existing paradigm for this. Right
now the default for each of these `global_boolean_flags` is False,
whereas the default for partial parse is actually True. This makes
fitting it in awkward.

I think it's possible that just having a `tuple[str]` is insufficient
here, as some flags you may want to add (not just `--no-partial-parse`
but also `--no-write-json` are by default _True_ and must be explicitly
turned off. Meaning that the parsing Cosmos does with flags of
`'--{flag.replace("_", "-")}'` is ineffective for flags like this.

Right now, we have an example of putting _no_ in front with
`no_version_check`. Meaning that the default behavior of version
checking is True, but the flag itself starts as negated so the default
is actually `False`.

My proposal is, instead of `global_boolean_flags: tuple[str]`, this
should instead be `global_boolean_flags: tuple[str | tuple[str, str |
None, str | None]]`. In the case that a global flag is a `tuple[str, str
| None, str | None]`, then the first arg should be the flag, the second
should be "if true," and the third should be "if false." `None`
indicates, when true/false (respectively), then do nothing.

For example:

```python
class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta):
    ...
    global_boolean_flags = (
        ("no_version_check", "--no-version-check", None),
        ("cache_selected_only", "-cache-selected-only", None),
        ("partial_parse", None, "--no-partial-parse"),
    )

```

And Cosmos want to support `str` parsing for backwards compatibility.
It's pretty straightforward to convert the data type:

```python
if isinstance(flag, str):
    flag = (flag, '--{flag.replace("_", "-")}', None)
```

## Related Issue(s)

- Resolves astronomer#791
- Partially resolves astronomer#785
- astronomer#785 should probably be split up into two different stages: (1)
support for partial parsing (2) (a) dbt project dir / manifest /
`partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt
compile` is able to dump into cloud storage.

## Breaking Change?

Should not break anything. This doesn't do anything when
`partial_parse.msgpack` is missing, and the default behavior
(`partial_parse=True`) does not alter the dbt cmd flags.

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works

---------

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Julian LaNeve <[email protected]>
Co-authored-by: Justin Bandoro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes dbt:parse Primarily related to dbt parse command or functionality execution:virtualenv Related to Virtualenv execution environment parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants