-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191
Comments
Awesome! I am experiencing this same issue and have been fiddling with how to address it. Similar concerns to #6073 I think your evaluation is correct that this is a special case for when there are a lot of tests, especially on wide models. It also seems especially bad when you are selecting a small subset of your DAG, since it has to run this relatively expensive function to change the edge associations. For me, with ~300 models and ~4k tests, One efficiency idea I had (and am trying to implement but a bit stuck on setting up dev environment): Sort the nodes by number of degrees and first remove any non-selected nodes that have degree 0 or 1 before doing the edge replacement. We can trivially remove any nodes with 0 edges, and any node with 1 edge can also be trivially removed with running the edge replacement because it does not have any pairs of connecting nodes to join on removal. Iterating over this before doing the current method allows us to "trim" the graph down efficiently, assuming that sorting the graph by degree is fast. I can imagine removing a source model then removing all of its tests this way is much faster than randomly trimming the tests out in the middle, reconnecting the graph, then getting back to the source eventually. Snippet:
Similarly, I wonder if this issue is still persistent, which might be doubling the number of edges needlessly: #6073 (comment) Thanks for the work, keeping an eye on this and will try it out your PR if I can get my dev setup working :). |
Added #7195 and got my dev environment up/PR, excited for multiple possible efficiency gains on this! |
Thanks for linking the issues @ttusing , I had not seen them. I think the idea of trimming the graph is the way to go as it is scalable. Preventing duplicate edges alone doesn't prevent the runtime for I've tested your changes though and it looks promising!
So I think if your change doesn't have any side effect, we could keep yours only! |
I think your changes can still improve runtime for some selectors that still require adding edges after removing nodes. For instance, if you have two disconnected subgraphs that are connected in the main DAG, or a model that is re-joined multiple times downstream, this could shave off a few seconds I'd bet. Although to be fair I am still unsure why the subgraph needs to calculate these distant ancestors in the first place 😅 . Seems like the type of selector comes up uncommon enough from my usage patterns that I'm not too concerned in picking it apart. |
@MatthieuBlais I added an adjustment to my PR to sort the DAG by node degree before searching and it went down to 10 seconds build time (down from 26 in last commit)! Hopefully this is also helpful for you. |
Yes you're right, after trimming the graph, the model selector matters. I can see a few additional seconds being shaved off. With your change, now the waiting time (excluding partial parsing time) is down to a few seconds only, that looks awesome! |
@ttusing @MatthieuBlais Awesome conversation! Would it make sense to consolidate this issue with #7195, as well as the two associated PRs (#7192 + #7194)? Or would it still make sense to consider & review them independently? |
Could merge the issues! Looks like #7192 is almost ready to merge to main? Would be awesome to get both of these PRs in to speed up everyone's workflow 🚀 |
@ttusing @MatthieuBlais Thanks so much for your high-quality analysis and coding on this. Our subject area expert still needs to weigh in on the PRs, but they look good to me. We've been aware that this code was bottlenecking some use cases, but to my knowledge the Core team hasn't had access to real-world dbt projects that exhibited the issue in a way that has enabled us to optimize it. With that in mind, if you are willing and able to share your projects with us, it would be enormously helpful as we monitor and improve performance in real-world scenarios. I can arrange a private channel for that if it's something you can do. |
@peterallenwebb Yeah, sounds tricky. Think our legal function would want to look at this. We use DBT cloud, is that potentially a path? It might also be helpful to establish a pool of testers to run their own benchmarks and report back on a variety of projects, connectors, machines, envs, etc. Does such a thing already exist? |
@ttusing, in support of @peterallenwebb's request, I have a little script here that takes in your Would you be willing to share the output of the script? |
I think so, let me run it by our team. |
Is this your first time submitting a feature request?
Describe the feature
Context: I'm investigating an issue with the command
dbt build
that became very slow in our project (1,300+ models and 20,000+ tests). It impacts user productivity (waiting time) and time-sensitive scheduled jobs.I suspect the function Graph.get_subset_graph can become slow when running
dbt build
in a project with wide tables and many tests.Let's consider a simple example:
The function loops through all the graph nodes and extract the upstream & downstream nodes. It creates the product of these nodes and adds the edges to the subgraph.
During
dbt build
, when evaluating the nodeTest B
(and looking at the test nodes only),source_nodes = [Test A]
andtarget_nodes = [Test C]
. So, the edges to add to the subgraph are[(Test A, Test C)]
.-> Is
Test A
considered a source nodes due to the add_test_edges function in the build task?Now, let's consider we have 30 tests in
Source A
and 500 tests inModel C
--> We get 15000 new edges to add to the graph for nodeTest B
.And now let's consider we also have 100 tests in
Model B
. For all of them, we have the same source nodes and target nodes, so we have the same 15000 edges to add 100 times to the new graph (1,500,000 edges). The high number of edges to add makes the functionnew_graph.add_edges_from
slow.As the edges are 100 times the same, I think we could prevent duplicate work by ignoring edges that already exist in the new graph.
What do you think?
Note that the issue doesn't seem to happen if we run
dbt run
ordbt test
(becausesource_nodes
is empty in these cases).Describe alternatives you've considered
I don't observe the same behavior when running
dbt run
ordbt test
. So, alternatively, we can use these 2 commands sequentially instead ofdbt build
, but it's not exactly the sameWho will this benefit?
dbt build ...
is a standard CLI command many DBT users execute.Are you interested in contributing this feature?
Yes
Anything else?
No response
The text was updated successfully, but these errors were encountered: