-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2316] [Feature] Improve subset_graph selection by removing trivial nodes and improving removal order #7195
Comments
I think this can get even better. In addition to nodes with 1 degree, any nodes with 0 parents OR 0 children can be safely removed, even if they have higher degrees. These are quickly removed in the existing algorithm since they add no edges, but this could help collapse especially big/complex DAGs more efficiently. |
I added an additional change to #7194 to sort nodes by degree before searching after doing the trimming, and it reduced my build from 26 seconds down to 10 (!). |
I've been thinking about/refining this further. Here are some of the big open questions that I have to inform how to address this problem:
The current implementation in DBT 1.4 that constructs the graph somewhat randomly being improved by #7194 (removing the trivial nodes and doing 1 sort after by degree) improved my runtimes because of some of these properties:
I suspect that a huge proportion of builds that people do and projects share these properties. I suspect most common selectors are: All of which are connected. I also think this pattern is probably common: I use something like this in CI for changed nodes and their decedents, and also DBT project evaluator. These situations are not always simply connected. It seems like there is room for improvement by:
There are so many ways that this could improved with the knowledge of the selectors beforehand, the node types, etc. I could imagine if a selector is less than say 10% of nodes and is simply connected or a small number of simply connected subgraphs it might be faster to "build up" rather than trim down the DAG. Rough draft PR in my own repo that implements some of these ideas: https://github.com/ttusing/dbt-core/pull/1/files |
@ttusing Thank you for all the thought you've put into this already, and for putting up some code to support it! From a product perspective: I don't know if dbt could (or should!) gracefully scale to DAGs of 1 million nodes. My biggest priority right now is giving teams the ability to split up large monolithic projects into multi-project deployments. Beyond 5k nodes, let alone 500, I believe project complexity inhibits effective collaboration. That said, if there are opportunities to speed up From a user perspective: It's especially frustrating to wait a long time when only trying to select a single model ( I'll leave it to folks from the Core engineering team to take a closer look at the substance of the proposed changes :) |
I've checked Tobie's version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins |
Is this your first time submitting a feature request?
Describe the feature
This is a performance suggestion separate from but similar to ideas explored in #7191, #6073 .
I have a DBT project with ~300 models and ~4200 tests. When selecting a single model to build,
dbt build
takes ~3 minutes to start. This is a major hurdle for development iterations.I have diagnosed that this is because of the performance of subset_graph using
record_timing_info
. Usingdbt run
for the same model builds almost instantly, as doesdbt build
for the entire DAG.There appear to be a number of major efficiency gains possible for this function. I propose a "smart trimming" of the DAG before starting the more expensive calculation in the function.
The function currently (randomly?) chooses a node, and if it is not in the selection, connects all immediate dependents and successors, then removes the node and continues iterating. This is presumably to preserve build order when nodes aren't directly connected and perhaps for other implications I haven't considered.
For a variety of reasons described in related issues, this can be expensive and cause the creation of many new edges, which most commonly will be deleted later. I see spikes of memory usage above 1 gig on my local machine when building for a single node and not above 200MB for the full DAG.
#7191 describes some of these situations. For an additional example, one can imagine two wide tables with lots of tests, where removing one of the tables causes all tests in model A to be linked to model B on the scale of tens of thousands or hundreds of thousands of new connections, only for all tests and models to be removed later because they were not in the selector.
We can avoid these expensive operations to add new edges by first addressing the removal of nodes that do not require adding new edges. These trivial nodes are nodes with degree 0 or 1 (and are not part of the selector). If a node is not linked to at least 2 other nodes, removing it definitally does not require constructing new edges. By iterating over removing trivial nodes, the graph will collapse on a much simplier subgraph that the existing algorithm can much more efficiency prune.
Similarly, when moving on to remove non-selected nodes from the DAG, starting with the lowest degree nodes is likely to improve performance by reducing the number of edges added.
In the case of many common simple selectors, like single models with tests or simple paths, this algorithm gives the full intended subgraph without needing to invoke the more expensive algorithm.
Describe alternatives you've considered
There are other approaches to making this function more efficient, including an alternate trimming approach in #6073 (comment). It appears that suggestion ultimately did not end up being accepted into
dbt-core
for not making efficiency gains. My suggestion should be compatible with most other efficiency suggestions.Who will this benefit?
dbt build
during developmentAre you interested in contributing this feature?
Yes!
Anything else?
No response
The text was updated successfully, but these errors were encountered: