[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

MatthieuBlais · 2023-03-19T09:12:10Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Context: I'm investigating an issue with the command dbt build that became very slow in our project (1,300+ models and 20,000+ tests). It impacts user productivity (waiting time) and time-sensitive scheduled jobs.

I suspect the function Graph.get_subset_graph can become slow when running dbt build in a project with wide tables and many tests.

Let's consider a simple example:

Source A -----> Model B --------> Model C
     |             |                |
    \/             \/               \/ 
  Test A         Test B           Test C

The function loops through all the graph nodes and extract the upstream & downstream nodes. It creates the product of these nodes and adds the edges to the subgraph.

for node in self:
         if node not in include_nodes:
                source_nodes = [x for x, _ in new_graph.in_edges(node)]
                target_nodes = [x for _, x in new_graph.out_edges(node)]

                new_edges = product(source_nodes, target_nodes)
                non_cyclic_new_edges = [
                    (source, target) for source, target in new_edges if source != target
                ]  # removes cyclic refs

                new_graph.add_edges_from(non_cyclic_new_edges)
                new_graph.remove_node(node)

During dbt build, when evaluating the node Test B (and looking at the test nodes only), source_nodes = [Test A] and target_nodes = [Test C]. So, the edges to add to the subgraph are [(Test A, Test C)].
-> Is Test A considered a source nodes due to the add_test_edges function in the build task?

Now, let's consider we have 30 tests in Source A and 500 tests in Model C --> We get 15000 new edges to add to the graph for node Test B.
And now let's consider we also have 100 tests in Model B. For all of them, we have the same source nodes and target nodes, so we have the same 15000 edges to add 100 times to the new graph (1,500,000 edges). The high number of edges to add makes the function new_graph.add_edges_from slow.

As the edges are 100 times the same, I think we could prevent duplicate work by ignoring edges that already exist in the new graph.

new_edges = [
     (source, target) for source, target in new_edges if source != target and not new_graph.has_edge(source, target)
]  # removes cyclic refs and edges already existing in the new graph

What do you think?

Note that the issue doesn't seem to happen if we run dbt run or dbt test (because source_nodes is empty in these cases).

Describe alternatives you've considered

I don't observe the same behavior when running dbt run or dbt test. So, alternatively, we can use these 2 commands sequentially instead of dbt build, but it's not exactly the same

Who will this benefit?

dbt build ... is a standard CLI command many DBT users execute.

Are you interested in contributing this feature?

Yes

Anything else?

No response

The text was updated successfully, but these errors were encountered:

ttusing · 2023-03-19T23:22:12Z

Awesome! I am experiencing this same issue and have been fiddling with how to address it.

Similar concerns to #6073

I think your evaluation is correct that this is a special case for when there are a lot of tests, especially on wide models.

It also seems especially bad when you are selecting a small subset of your DAG, since it has to run this relatively expensive function to change the edge associations. For me, with ~300 models and ~4k tests, dbt build spins up in seconds, but dbt build --select model is very slow since it has to remove ~4200 nodes.

One efficiency idea I had (and am trying to implement but a bit stuck on setting up dev environment): Sort the nodes by number of degrees and first remove any non-selected nodes that have degree 0 or 1 before doing the edge replacement. We can trivially remove any nodes with 0 edges, and any node with 1 edge can also be trivially removed with running the edge replacement because it does not have any pairs of connecting nodes to join on removal. Iterating over this before doing the current method allows us to "trim" the graph down efficiently, assuming that sorting the graph by degree is fast. I can imagine removing a source model then removing all of its tests this way is much faster than randomly trimming the tests out in the middle, reconnecting the graph, then getting back to the source eventually.

Snippet:

        # remove nodes that have 0 or 1 degree and are not in include_nodes then repeat
        while True:
            nodes_to_remove = [
                node
                for node in self
                if (node not in include_nodes and new_graph.degree(node) <= 1)
            ]
            if not nodes_to_remove:
                break
            new_graph.remove_nodes_from(nodes_to_remove)

Similarly, I wonder if this issue is still persistent, which might be doubling the number of edges needlessly: #6073 (comment)

Thanks for the work, keeping an eye on this and will try it out your PR if I can get my dev setup working :).

ttusing · 2023-03-20T00:27:49Z

Added #7195 and got my dev environment up/PR, excited for multiple possible efficiency gains on this!

MatthieuBlais · 2023-03-20T04:02:56Z

Thanks for linking the issues @ttusing , I had not seen them.

I think the idea of trimming the graph is the way to go as it is scalable. Preventing duplicate edges alone doesn't prevent the runtime for dbt build to slowly increase over time as the number of nodes increase. I don't know enough about the source code to evaluate the impact of your change though, so I leave it to the maintainers.

I've tested your changes though and it looks promising!

dbt build runtime in my project:

Current: ~10min
With this PR only: ~3min
With this PR and your PR: ~1:45min
With your PR only: ~1:45min

So I think if your change doesn't have any side effect, we could keep yours only!

ttusing · 2023-03-20T04:23:11Z

Thanks for linking the issues @ttusing , I had not seen them.

I think the idea of trimming the graph is the way to go as it is scalable. Preventing duplicate edges alone doesn't prevent the runtime for dbt build to slowly increase over time as the number of nodes increase. I don't know enough about the source code to evaluate the impact of your change though, so I leave it to the maintainers.

I've tested your changes though and it looks promising!

dbt build runtime in my project:

Current: ~10min

With this PR only: ~3min

With this PR and your PR: ~1:45min

With your PR only: ~1:45min

So I think if your change doesn't have any side effect, we could keep yours only!

I think your changes can still improve runtime for some selectors that still require adding edges after removing nodes. For instance, if you have two disconnected subgraphs that are connected in the main DAG, or a model that is re-joined multiple times downstream, this could shave off a few seconds I'd bet.

Although to be fair I am still unsure why the subgraph needs to calculate these distant ancestors in the first place 😅 . Seems like the type of selector comes up uncommon enough from my usage patterns that I'm not too concerned in picking it apart.

ttusing · 2023-03-20T05:35:17Z

@MatthieuBlais I added an adjustment to my PR to sort the DAG by node degree before searching and it went down to 10 seconds build time (down from 26 in last commit)! Hopefully this is also helpful for you.

#7194

MatthieuBlais · 2023-03-20T06:25:37Z

I think your changes can still improve runtime for some selectors that still require adding edges after removing nodes. For instance, if you have two disconnected subgraphs that are connected in the main DAG, or a model that is re-joined multiple times downstream, this could shave off a few seconds I'd bet.

Yes you're right, after trimming the graph, the model selector matters. I can see a few additional seconds being shaved off. With your change, now the waiting time (excluding partial parsing time) is down to a few seconds only, that looks awesome!

jtcohen6 · 2023-03-22T11:59:35Z

@ttusing @MatthieuBlais Awesome conversation! Would it make sense to consolidate this issue with #7195, as well as the two associated PRs (#7192 + #7194)? Or would it still make sense to consider & review them independently?

ttusing · 2023-03-22T16:48:26Z

@ttusing @MatthieuBlais Awesome conversation! Would it make sense to consolidate this issue with #7195, as well as the two associated PRs (#7192 + #7194)? Or would it still make sense to consider & review them independently?

Could merge the issues! Looks like #7192 is almost ready to merge to main? Would be awesome to get both of these PRs in to speed up everyone's workflow 🚀

peterallenwebb · 2023-03-22T19:52:26Z

@ttusing @MatthieuBlais Thanks so much for your high-quality analysis and coding on this. Our subject area expert still needs to weigh in on the PRs, but they look good to me. We've been aware that this code was bottlenecking some use cases, but to my knowledge the Core team hasn't had access to real-world dbt projects that exhibited the issue in a way that has enabled us to optimize it. With that in mind, if you are willing and able to share your projects with us, it would be enormously helpful as we monitor and improve performance in real-world scenarios. I can arrange a private channel for that if it's something you can do.

ttusing · 2023-03-24T01:08:20Z

@peterallenwebb Yeah, sounds tricky. Think our legal function would want to look at this.

We use DBT cloud, is that potentially a path?

It might also be helpful to establish a pool of testers to run their own benchmarks and report back on a variety of projects, connectors, machines, envs, etc. Does such a thing already exist?

boxysean · 2023-03-24T17:03:49Z

@ttusing, in support of @peterallenwebb's request, I have a little script here that takes in your manifest.json file and outputs a dbt models folder with the same DAG shape, but all models are simply SELECT 1. This project could then be run using (say) dbt-duckdb, for profiling purposes.

Would you be willing to share the output of the script?

ttusing · 2023-03-24T17:13:17Z

I think so, let me run it by our team.

MatthieuBlais added enhancement New feature or request triage labels Mar 19, 2023

github-actions bot changed the title ~~[Feature] Ignore duplicate edges while building the subgraph for the job queue~~ [CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue Mar 19, 2023

MatthieuBlais mentioned this issue Mar 19, 2023

Filter out already existing edges in the subgraph created while build… #7192

Merged

6 tasks

ttusing mentioned this issue Mar 20, 2023

[CT-2316] [Feature] Improve subset_graph selection by removing trivial nodes and improving removal order #7195

Closed

3 tasks

jtcohen6 added performance Team:Execution labels Mar 20, 2023

jtcohen6 removed the triage label Mar 22, 2023

iknox-fa self-assigned this Mar 22, 2023

iknox-fa closed this as completed in #7192 Mar 24, 2023

ttusing mentioned this issue Apr 4, 2023

Sort nodes in graph object by memory impact when building subdag #7271

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

MatthieuBlais commented Mar 19, 2023 •

edited

Loading

ttusing commented Mar 19, 2023 •

edited

Loading

ttusing commented Mar 20, 2023

MatthieuBlais commented Mar 20, 2023

ttusing commented Mar 20, 2023

ttusing commented Mar 20, 2023

MatthieuBlais commented Mar 20, 2023

jtcohen6 commented Mar 22, 2023

ttusing commented Mar 22, 2023

peterallenwebb commented Mar 22, 2023

ttusing commented Mar 24, 2023

boxysean commented Mar 24, 2023

ttusing commented Mar 24, 2023

[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

Comments

MatthieuBlais commented Mar 19, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

ttusing commented Mar 19, 2023 • edited Loading

ttusing commented Mar 20, 2023

MatthieuBlais commented Mar 20, 2023

ttusing commented Mar 20, 2023

ttusing commented Mar 20, 2023

MatthieuBlais commented Mar 20, 2023

jtcohen6 commented Mar 22, 2023

ttusing commented Mar 22, 2023

peterallenwebb commented Mar 22, 2023

ttusing commented Mar 24, 2023

boxysean commented Mar 24, 2023

ttusing commented Mar 24, 2023

MatthieuBlais commented Mar 19, 2023 •

edited

Loading

ttusing commented Mar 19, 2023 •

edited

Loading