fix(dbt): cache the file reads of dbt manifest #17996

rexledesma · 2023-11-14T14:53:13Z

Summary & Motivation

Following up on #17980, one of the issues with memory usage in dagster-dbt is that if the manifest is specified as a path, we potentially hold multiple copies of the dictionary manifest in memory when creating our assets. This is because we read from the path to create the dictionary manifest, and then we hold a reference to the manifest dictionary in our asset metadata. When the same path is read multiple times, multiple copies ensure.

We can optimize this by implementing a cache when reading from the path, so that the same dictionary manifest is returned even when a manifest path is specified.

To further optimize this to hold no reference to the manifest dictionary, we could instead hold a reference to the manifest path in our asset metadata. However,

This only works if a manifest path is passed. If a manifest is already specified as a dictionary, then we still have to hold it in memory. This is the worst case.
(1) can be solved by preventing a manifest from being specified as a dictionary, but that would be a breaking change.

How I Tested These Changes

Use tracemalloc.

Create a jaffle_dagster project using dagster-dbt project scaffold on jaffle_shop.
Update the code to read the manifest from a path in three separate @dbt_assets definitions.
Run PYTHONTRACEMALLOC=1 DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev

Before Change

Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.4 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
- #3: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-- dbt/dagster_dbt/dbt_manifest.py:17: 3262.9 KiB
-    manifest = cast(Mapping[str, Any], orjson.loads(manifest.read_bytes()))
#4: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 891.8 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53203 other: 38030.8 KiB
Total allocated size: 88177.9 KiB

After Change

Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.9 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
#3: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10 /lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
+ #4: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py:13: 1057.4 KiB
+    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 893.6 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53195 other: 38013.7 KiB
Total allocated size: 85957.5 KiB

Baseline (one @dbt_assets definition)

Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.2 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
#3: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
+ #4: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py:13: 1056.9 KiB
+    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 893.7 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53175 other: 37976.1 KiB
Total allocated size: 85918.9 KiB

rexledesma · 2023-11-14T14:53:24Z

Current dependencies on/for this PR:

master
- PR fix(dbt): cache the file reads of dbt manifest #17996 👈

This stack of pull requests is managed by Graphite.

gibsondan

thanks!

python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py

rexledesma requested review from alangenfeld and gibsondan November 14, 2023 14:54

rexledesma self-assigned this Nov 14, 2023

gibsondan approved these changes Nov 14, 2023

View reviewed changes

python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py Outdated Show resolved Hide resolved

fix(dbt): cache the file reads of dbt manifest

75f0341

rexledesma force-pushed the rl/cache-manifest-reads branch from 83a6e7c to 75f0341 Compare November 14, 2023 16:57

rexledesma merged commit 78558e0 into master Nov 14, 2023
1 check passed

rexledesma deleted the rl/cache-manifest-reads branch November 14, 2023 17:27

rexledesma mentioned this pull request Feb 5, 2024

fix(dbt): do not serialize manifest as asset definition metadata #15447

Merged

rexledesma mentioned this pull request Mar 21, 2024

[embedded-elt][sling] passing translator and replication config from decorator using metadata #20564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dbt): cache the file reads of dbt manifest #17996

fix(dbt): cache the file reads of dbt manifest #17996

rexledesma commented Nov 14, 2023 •

edited

Loading

rexledesma commented Nov 14, 2023

gibsondan left a comment

fix(dbt): cache the file reads of dbt manifest #17996

fix(dbt): cache the file reads of dbt manifest #17996

Conversation

rexledesma commented Nov 14, 2023 • edited Loading

Summary & Motivation

How I Tested These Changes

rexledesma commented Nov 14, 2023

gibsondan left a comment

Choose a reason for hiding this comment

rexledesma commented Nov 14, 2023 •

edited

Loading