Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solving for large stage depths #370

Open
1 task done
wjhrdy opened this issue Oct 27, 2023 · 3 comments
Open
1 task done

Solving for large stage depths #370

wjhrdy opened this issue Oct 27, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@wjhrdy
Copy link

wjhrdy commented Oct 27, 2023

Describe the feature

Certain dbt models that are combinations of views that have not themselves been materialized cause a lot of strain on the query planner and result in many stages getting made. The stage depth in starburst is currently set to 150, which means a developer or user could quickly run into this problem if they are joining or selecting from a composite data product.

Some type of mechanism to identify nodes that create large explosions in the query plan would be helpful for developers as a heuristic on when to think about materialization.

How this could be implemented:

  • Traverse the DBT query plan, and take the rendered queries and run EXPLAIN (TYPE DISTRIBUTED, FORMAT JSON) {query}
  • Use this to build a graph of stages, find inflection points of stage accumulation or when near limit.

The reason I add this here is that this is an issue specific using the features of dbt with starburst. I don't know if it can be solved in this package however.

Describe alternatives you've considered

There is a simple way to estimate model stage complexity, but it is a very bad estimate because it assumes all models are equally dense.

here is an excerpt from my justfile.

# list number of model dependencies for each model in a folder
estimate-model-complexity directory:
#!/usr/bin/env bash
    declare -A lines
    for file in "{{directory}}"/*.sql; do
        filename=$(basename "$file" .sql)
        output=$({{dbt}} list -m +$filename)
        line_count=$(echo "$output" | wc -l)
        line_count=$((line_count - 4))
        lines["$filename"]=$line_count
    done
    for key in "${!lines[@]}"; do
        echo $key: ${lines[$key]}
    done | sort -rn -t: -k2

Who will benefit?

Anyone trying to leverage the modular nature of DBT in starburst views who is trying to figure out the best views to materialize.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@wjhrdy wjhrdy added the enhancement New feature or request label Oct 27, 2023
@hovaesco
Copy link
Contributor

The stage depth in starburst is currently set to 150

This could be changed using https://trino.io/docs/current/admin/properties-query-management.html#query-max-stage-count. What is a sample stage count you are getting?

Use this to build a graph of stages, find inflection points of stage accumulation or when near limit.

How to build something like that and how to use it?

@wjhrdy
Copy link
Author

wjhrdy commented Oct 30, 2023

We are getting around 250 for stage depth, prior to materialization.

@wseaton
Copy link

wseaton commented Oct 30, 2023

@hovaesco here's an example of a dev tool we've made internally: https://gist.github.com/wseaton/0e0cfecb7421bbba6792222e8fec7cf6.

There are some assumptions that it makes about the plan output that might not hold true in practice (the bit about "virtual" stages, for one), but it's been helpful for us in directionally identifying nodes in the dbt graph that contribute to sudden explosions in complexity.

If we could get the direct number of stages from the query planner, and/or more insight into how the plan is generated that would be helpful.

I'm also trying to think of ways to show this to users easier, not sure what makes the most sense. For now, a rendered graph seems to work. We are thinking about embedding this in CI tooling or potentially running this periodically to help as a diagnostic tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants