fix(dbt): cache the file reads of dbt manifest #17996
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary & Motivation
Following up on #17980, one of the issues with memory usage in
dagster-dbt
is that if themanifest
is specified as a path, we potentially hold multiple copies of the dictionary manifest in memory when creating our assets. This is because we read from the path to create the dictionary manifest, and then we hold a reference to the manifest dictionary in our asset metadata. When the same path is read multiple times, multiple copies ensure.We can optimize this by implementing a cache when reading from the path, so that the same dictionary manifest is returned even when a manifest path is specified.
To further optimize this to hold no reference to the manifest dictionary, we could instead hold a reference to the manifest path in our asset metadata. However,
How I Tested These Changes
Use
tracemalloc
.jaffle_dagster
project usingdagster-dbt project scaffold
onjaffle_shop
.@dbt_assets
definitions.PYTHONTRACEMALLOC=1 DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev
Before Change
After Change
Baseline (one
@dbt_assets
definition)