Graph dumper #15285

larroy · 2019-06-20T01:40:47Z

Description

Utility to dump the computational graph, specially during backward for human consumption.

Dumps the graph to a dot file that can be rendered.

This is intended for developers, although if there's demand it can be exposed easily from python to get the graph from any variable if is_recording is true. It is done from C++ directly, no Python is required.

Added an environment variable which triggers the logic.

It should have no performance impact when it's not enabled as there's only one additional boolean check during backward, it access the environment only once.

@apeforest @anirudh2290

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

larroy · 2019-06-20T01:52:36Z

#15198

src/common/directed_graph.h

apeforest · 2019-06-20T17:08:50Z

Hi @larroy could you please paste a snapshot of what the dumped graph looks like? Thanks

larroy · 2019-06-24T20:54:05Z

@apeforest I will paste the output next time I run it, right now I'm working on something else. It outputs a dot graph as in the comment here: https://github.com/apache/incubator-mxnet/pull/15285/files#diff-884210dca53fb4c4ac20ca07aa8ecfdfR82

larroy · 2019-06-27T00:00:21Z

@apeforest @kshitij12345 sounds familiar?

larroy · 2019-06-27T00:36:43Z

Named the head grads

larroy · 2019-06-27T00:52:40Z

First gradient, and second gradient:

Forward graph: 
digraph G {
  "var0" -> "log10 node_0"
}
Backward graph: 
digraph G {
  "head grad #0" -> "_backward_log10 node_0_backward"
  "var0" -> "log10 node_0"
  "var0" -> "_backward_log10 node_0_backward"
}
Forward graph: 
digraph G {
  "head grad #0" -> "_backward_log10 node_1"
  "var0" -> "_backward_log10 node_1"
}
Backward graph: 
digraph G {
  "head grad #0" -> "elemwise_mul node_1_backward_grad_grad_inp"
  "head grad #0" -> "_backward_log10 node_1"
  "var0" -> "_backward_log10 node_1"
  "var0" -> "reciprocal node_1_dlogx"
  "_backward_log10 node_1" -> "elemwise_mul node_1_d2ydx2_mid"
  "reciprocal node_1_dlogx" -> "elemwise_mul node_1_d2ydx2_mid"
  "elemwise_mul node_1_d2ydx2_mid" -> "negative node_1_d2ydx2"
  "negative node_1_d2ydx2" -> "elemwise_mul node_1_backward_grad_grad_inp"
}

apeforest · 2019-06-29T07:51:10Z

@larroy Thanks for this contribution. The dumped graph does help developers understand more details about computation graph in forward + backward passes.

However, I also agree with @anirudh2290 there are many overlapping and orthogonal work in this PR compared with existing MXNet/NNVM code base. Can you keep this utility and graph class in a different repo instead of merging it to MXNet. I am afraid having different utilities to do highly similar work may cause more confusion to users. Please let me know if this makes sense.

larroy · 2019-07-01T19:54:01Z

@apeforest How can we keep it in a different repo if we have to call it inside backward? You mean a private branch?

There's indeed some overlap but it's simpler to have a C++ only component that dumps the graph for debug instead of going through non-specific JSon -> having to introduce python. Also that overlap is with NNVM which is an external repo.

I would vote to have this merged which I though twas the original intention and as @anirudh2290 is not blocking it.

The PR is ready missing fixing a minor CI issue with amalgamation. If it's not going to be merged I will abandon it.

ptrendx · 2019-07-09T16:56:28Z

I also do not understand why you need a whole new Graph class for the purpose of serializing to dot. nnvm::Graph and nnvm::IndexedGraph should give you everything you need for that, no?

larroy · 2019-07-09T19:33:38Z

@ptrendx I didn't see code that dumps directly to dot, but to Json as discussed above. If you guys have such a problem with introducing a graph class which has unit tests and all, then don't merge this, no problems. I implemented it like this and for me is useful. 🤷🏻‍♂️

ptrendx · 2019-07-09T20:14:02Z

@larroy There is no need to get so defensive about this, I (and I assume others as well) just want to make your contribution better as a result of the review process.
Going back to this graph class - the problem I have with introducing it is that I do not see any value added by it, as (I think) all of the functionality needed is already there in nnvm::Graph and IndexedGraph, while it makes the codebase less readable and maintainable. Do you have an example of functionality that is required but is not covered by the existing classes?

ptrendx · 2019-07-09T20:30:10Z

Also, to clarify - I do not think that reusing JSON dump and other utilities is a good approach to this. My concern is only about the new graph class itself instead of using existing tools. Removing this class would make this PR 400 lines smaller.

larroy · 2019-07-09T21:24:46Z

@ptrendx not getting defensive, that was not my intention, let's not read between the lines. Feedback should be concrete an actionable for efficient use of everybody's time, that's why I indicated that if the feedback is unspecific and not clearly positive with this change I'm fine with closing the PR in interest of everyones time instead of letting it rot here or having a long discussion.

I think there's many ways to implement something. I spent some effort implementing this to help us introspecting backward graphs, probably you could implement it in different ways, but this is the way I did it and I think is clean, concise, generic and has C++ unit tests.

I think having a generic data structure like the one I introduced in this PR is better than using a non-generic structure into which implementation concerns are mixed with the data structure over and over in the implementation itself. This is in a similar but more concerning way to having a linked list by having the next pointer inside the object or as a generic class like std::list. So for me having generic data structures make the code more maintainable.

For feedback to be useful it has also to be concrete, in this case I don't see the feedback to my code change proposal is concrete enough for me to take actions. Could you post an example of how would you propose to use indexed graph to do the same effect? Maybe that would help keep the conversation focused. It's also a debugging feature, so I don't feel specially passionate about defending my design decisions here.

Adding this functionality was requested by @apeforest @kshitij12345 and I while adding higher oder gradient.
I spent some time adding unit tests and making it generic enough that can be re-used for different purposes, so please understand my surprise when there are unspecific questions like why do we need this or why introduce a small single-header data structure.

Thanks for your comments.

ptrendx · 2019-07-09T22:35:42Z

For feedback to be useful it has also to be concrete, in this case I don't see the feedback to my code change proposal is concrete enough for me to take actions. Could you post an example of how would you propose to use indexed graph to do the same effect? Maybe that would help keep the conversation focused. It's also a debugging feature, so I don't feel specially passionate about defending my design decisions here.

Sure, so I would do this serialization more or less like this pseudo-code:

// g - the nnvm::Graph that is being serialized
idx = g.indexed_graph();
for (i = 0; i < idx.num_nodes(); ++i) {
const auto& node = g[i];
for (const auto& input : node.inputs()) {
// input.source is the src node
// node is the dst node
dump "src -> dst" to dot
}
}
for (const auto& output : idx.outputs) {
// output.source is the src node
// "Output N" is the dst node
dump "src->dst" to dot
}

larroy · 2019-07-10T00:22:18Z

@ptrendx Thanks
I will try to use your suggestion in the next iteration.

szha

additional data structures just for the purpose of serialization is over-engineering.

larroy · 2019-07-26T02:08:32Z

@anirudh2290 @ptrendx @szha please review again, I removed the uber engineered code.

include/mxnet/imperative.h

apeforest · 2019-07-26T05:16:16Z

src/imperative/imperative.cc

@@ -353,6 +356,11 @@ std::vector<NDArray*> Imperative::Backward(
        << "There are no inputs in computation graph that require gradients.";
  }

+  if (backward_graph_dump_enabled_) {
+    std::cout << "Forward graph: " << std::endl;


Can we leave this to the developer? I think we can just have a function DebugGraph() and not putting all the std::cout logic in a function block.

apeforest

Can you print a sample output of this DumpGraph function? We've already have one function to print the NNVM graph: https://github.com/apache/incubator-mxnet/blob/6a8d9eb5fd4f7133c094149dc80a3a236534f223/src/common/exec_utils.h#L285?

If the new graph dump function is better, can we consolidate it with the existing one? I am only concerned that having two versions of debugging functions not only increases code size but also adds confusion to developers.

larroy · 2019-07-26T08:04:20Z

@apeforest I'm confused by your latest review, have you read the comments above? there's sample output, and samples from the graph, please scroll.

About LogMemoryPlan, I don't think the objective of that function was dumping the graph. I think it served a different purpose.

Also we spent many days debugging second order gradients and backward graphs. I'm surprised that you are against this functionality.

src/nnvm/graph_dump.h

ptrendx · 2019-07-26T16:00:39Z

src/nnvm/graph_dump.cc

+
+namespace {
+
+std::string unnamed(const std::string &s) {


unnamed name does not really fit to what the purpose of this function is. Also, the logic here makes your graph potentially incorrect if a person used the same name (or the same lack of name) for multiple nodes (which is unfortunately allowed by MXNet).
What I would prefer is a function named something like DeduplicateName where you would check whether the same name was not used before for different Node and add #number to the name (adding unnamed or better _unnamed_ would be fine there too).

Thanks, makes sense.

Should I add a static atomic counter for that for example?

From what I tested most of the nodes have names from operators or variables, and head gradients I added names which makes the graph much better.

I don't think you need an atomic counter for this - just a map that maps the std::string to list of Node* would be enough to differentiate whether your name is unique or not and what number you need to add to it.
There is nothing (unfortunately) in MXNet preventing you from naming your nodes or variables the same in the same graph. I agree that head grads should have a name, I actually did similar change in my pointwise fusion PR.

Doesn't it do the same effect with less memory usage to what I proposed just for the purpose of adding a number? I guess the counter doesn't need to be atomic anyway.

I guess I'm not sure how a counter would help you solve the issue of separate nodes that have the same name.

Ok I think I got now your intention. I will make the suggested changes.

larroy · 2019-08-16T21:47:25Z

Question, is there interest to get this merged if I address @ptrendx comments? The action points that I have in mind are just deduplicating node names. I have limited bandwidth right now to keep iterating in this pr. Could we summarize which changes are required to get this in or should I close the PR? Thanks.

larroy · 2019-09-08T04:02:37Z

There doesn't seem to be interest to get this in. I will keep it as a private tool. Thanks for the reviews.

larroy requested a review from anirudh2290 as a code owner June 20, 2019 01:40

anirudh2290 reviewed Jun 20, 2019

View reviewed changes

src/common/directed_graph.h Outdated Show resolved Hide resolved

Roshrini added the pr-work-in-progress PR is still work in progress label Jun 23, 2019

larroy force-pushed the graph_printer branch from 3291c44 to ddb4215 Compare June 27, 2019 01:04

larroy requested review from eric-haibin-lin and szha as code owners June 27, 2019 01:04

larroy force-pushed the graph_printer branch from 5ff0936 to fea1244 Compare July 4, 2019 03:55

larroy changed the title ~~[WIP] Graph dumper~~ Graph dumper Jul 5, 2019

szha suggested changes Jul 21, 2019

View reviewed changes

larroy force-pushed the graph_printer branch from 465c4a4 to b19a4e2 Compare July 26, 2019 01:08

apeforest reviewed Jul 26, 2019

View reviewed changes

include/mxnet/imperative.h Show resolved Hide resolved

apeforest reviewed Jul 26, 2019

View reviewed changes

include/mxnet/imperative.h Show resolved Hide resolved

apeforest reviewed Jul 26, 2019

View reviewed changes

apeforest suggested changes Jul 26, 2019

View reviewed changes

ptrendx reviewed Jul 26, 2019

View reviewed changes

src/nnvm/graph_dump.h Outdated Show resolved Hide resolved

ptrendx reviewed Jul 26, 2019

View reviewed changes

larroy added 21 commits August 8, 2019 14:50

Add source files for graph_dump

86ec3a4

forwards

46815a8

Add Static graph

fa6edb7

Add tests

d2541d9

refactor

b617a75

add nodes to test graph

efb36fd

Add nodes to graph in GraphDump

b3855fa

Serialize to Json

ba925a8

Serialize to dot

a43329f

refine graph dump

b2d0e08

Refactor GraphDump to get NodeEntry vector

ffe3c59

Fix lint

16a85c4

lint and refinements

bf0860e

fixes and refinements

0a32ba5

Add const accesor for Node

9a23edc

Fix mac build

614ca4c

Fix build

4c81d03

Make everyone happy

8fba206

fix lint

84d5ae7

fix lint

d0dac5a

CR

8f299d9

larroy force-pushed the graph_printer branch from 2f0cb37 to 8f299d9 Compare August 8, 2019 21:59

larroy added 2 commits August 8, 2019 15:00

remove pragma

07865c3

Fix lint

e9ff474

larroy closed this Sep 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph dumper #15285

Graph dumper #15285

larroy commented Jun 20, 2019 •

edited

Loading

larroy commented Jun 20, 2019

apeforest commented Jun 20, 2019

larroy commented Jun 24, 2019

larroy commented Jun 27, 2019 •

edited

Loading

larroy commented Jun 27, 2019

larroy commented Jun 27, 2019

apeforest commented Jun 29, 2019

larroy commented Jul 1, 2019

ptrendx commented Jul 9, 2019

larroy commented Jul 9, 2019

ptrendx commented Jul 9, 2019

ptrendx commented Jul 9, 2019

larroy commented Jul 9, 2019

ptrendx commented Jul 9, 2019 •

edited

Loading

larroy commented Jul 10, 2019 •

edited

Loading

szha left a comment

larroy commented Jul 26, 2019 •

edited

Loading

apeforest Jul 26, 2019

larroy Jul 26, 2019

apeforest left a comment •

edited

Loading

larroy commented Jul 26, 2019 •

edited

Loading

ptrendx Jul 26, 2019

larroy Jul 26, 2019

larroy Jul 26, 2019

larroy Jul 26, 2019

ptrendx Jul 26, 2019

larroy Jul 26, 2019 •

edited

Loading

ptrendx Jul 26, 2019

larroy Jul 29, 2019

larroy commented Aug 16, 2019

larroy commented Sep 8, 2019

Graph dumper #15285

Graph dumper #15285

Conversation

larroy commented Jun 20, 2019 • edited Loading

Description

Checklist

Essentials

larroy commented Jun 20, 2019

apeforest commented Jun 20, 2019

larroy commented Jun 24, 2019

larroy commented Jun 27, 2019 • edited Loading

larroy commented Jun 27, 2019

larroy commented Jun 27, 2019

apeforest commented Jun 29, 2019

larroy commented Jul 1, 2019

ptrendx commented Jul 9, 2019

larroy commented Jul 9, 2019

ptrendx commented Jul 9, 2019

ptrendx commented Jul 9, 2019

larroy commented Jul 9, 2019

ptrendx commented Jul 9, 2019 • edited Loading

larroy commented Jul 10, 2019 • edited Loading

szha left a comment

Choose a reason for hiding this comment

larroy commented Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest left a comment • edited Loading

Choose a reason for hiding this comment

larroy commented Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larroy Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larroy commented Aug 16, 2019

larroy commented Sep 8, 2019

larroy commented Jun 20, 2019 •

edited

Loading

larroy commented Jun 27, 2019 •

edited

Loading

ptrendx commented Jul 9, 2019 •

edited

Loading

larroy commented Jul 10, 2019 •

edited

Loading

larroy commented Jul 26, 2019 •

edited

Loading

apeforest left a comment •

edited

Loading

larroy commented Jul 26, 2019 •

edited

Loading

larroy Jul 26, 2019 •

edited

Loading