Self timings vs. hierarchical timings #92

alecazam · 2024-02-20T22:19:58Z

Noticing that the timings don't add up, or are spread across multiple cores. One issue appears to be that top offenders are just the hierchical timing of the calls, without subtracting the children.

Would be nice to have self timings sorted out, so can see which headers are specifically at fault for all the build slowdowns. Maybe there's a mode for this, and I'm just too much of a newbie using the tool. But I do love it. So thanks for making this open-source.

Here's an example from our traces. So hash_table is a part of unordered_map. I doubt those both take up 137s and 123s. But maybe hash_table is 123/137 of unordered_map. Self timings would more accurately portray the offender. But it is a huge amount of instantiations of both. It's just when I look at -ftime-trace flame graphs other functions we have are not self times, and have a lot of overlap.

**** Template sets that took longest to instantiate:
137531 ms: std::unordered_map<$> (16285 times, avg 8 ms)
123470 ms: std::__hash_table<$> (18299 times, avg 6 ms)
 65691 ms: std::function<$>::function<$> (1805 times, avg 36 ms)
 65352 ms: std::__function::__value_func<$>::__value_func<$> (1805 times, avg 36 ms)
 54030 ms: std::__function::__func<$>::__func (1805 times, avg 29 ms)
 50907 ms: std::forward_as_tuple<$> (8135 times, avg 6 ms)
 48883 ms: std::allocator_traits<$> (43333 times, avg 1 ms)
 47541 ms: std::__function::__alloc_func<$>::__alloc_func (5415 times, avg 8 ms)

The text was updated successfully, but these errors were encountered:

aras-p · 2024-02-21T14:47:56Z

Yeah I have not implemented self-timings report (yet?). Not sure when/if I'll get around to that.

alecazam · 2024-02-22T06:53:40Z

What I'm trying to figure out is who is instantiating that hash_map 16k times. I assume the template set is collecting the instantiation of any function out of that template. But it's pretty terrible to always see these system STL templates at the tops of builds. Some description of the categories would be helpful in the readme. But I'l see if the source provides more detail.

ben-craig · 2024-02-22T16:44:57Z

Hi alecazam. I authored the template set reporting. The std::unordered_map entries are for instantiating the class, and not any of the methods. You can see how std::function::function also shows up, so that specific function / ctor is tracked separately from the class.

The template set functionality shines most when finding expensive templates that are instantiated with a moderate number of unique template parameters. That's usually a good hint to optimize the template itself, often by extracting parts that don't depend on the template parameter.

If you are still concerned about unordered_map uses, then I would suggest grepping your source code for unordered_map usages in headers, and then keep an eye out for places where you would get a large quantity of different types as key / value pairs. In the past, I've seen that when using enums as keys or values. You can then try to decide what to do about these cases. I've done things where I've type erased the enums into their underlying types, and that let me cut down on unique types and instantiations by quite a bit.

alecazam · 2024-02-22T20:20:03Z

Seems like the Variant union is still a good idea in this day and age. I just wish I had some idea of what file was generating all the templates. I have a good idea in this case, but wondering if that could be automated. With STL there ‘an always a level above that is the culprit.

alecazam · 2024-02-22T22:42:03Z

I'm building a profiler shell to help with build, memory, and perf optimization. It's a flamegraph tool using Perfetto right now. And I'd like the CBA iteration to be a part of it. At first probably just calling the CBA cli, but would like to eventually tie into the C++ calls themselves. Not officially released yet, but still pretty usable. Still a few little issues to fix.

https://github.com/alecazam/kram/tree/main/kram-preview

I just added support in this tool to swap out the clang name "Source" fields with the file details, and I found all the unordered_map instantiations. Thanks for pointing me to the problem!

alecazam · 2024-02-23T15:15:40Z

Actually I cut 30s worth of unordered_map, but still have to find the rest. This involves searching through the clang json output on a per file basis. So anything to accelerate this would help.

Something like top headers contributing to that count. So just one level up that isn't the same STL class. Problem is we use wrappers like TGCMap.h and TGCVector.h to wrap unordered_map and vector usage, but we really need one level up from even that.

13steinj · 2024-02-28T21:00:18Z

Coincidentally I'm running into the same/similar issue/request-- I'm finding that in really long templated-function call stacks (that I don't have any choice in them being so long), it matters more to me that the self-time of a given template took a given amount of time. For example, this output:

**** Templates that took longest to instantiate:
126389 ms: org::apps::appname::visitorPatternFunction<(lambda at header.hpp:16:9)> (6 times, avg 21064 ms)

Is incredibly misleading. Because looking at the callstack using -ftime-trace (thanks btw!), there's a ~ 1 minute 3 second 200 ms call that calls a ~ 1 minute 3 second 100 ms call which calls .... everything else, followed by 4 cheap calls elsewhere in the trace. Initial inspection without checking would make one thing "holy shit, this function takes an average of 21s to instantiate!?" but the reality is it just calls other functions of which a giant tree is the vast majority of 2 calls. Adding stddev and range might be nice too.

If the self-time of a template is a certain amount of time, and fairly consistent, that template can be reduced look into removal.

But if it's taking into account child time, not that helpful.

alecazam · 2024-03-16T21:44:18Z

Seems like CBA has already built up the parent/child lists for the build events. Self time is just adding all of the children durations to another field in the parent (I call this durSub), then subtracting that from the dur to get the self time. This better pinpoints slow headers, since the child timing wouldn't be a part of the report.

I added the self timings for the parser to the Build Report of kram-profile. Don't have the other stats yet, but the same logic applies to all the stats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self timings vs. hierarchical timings #92

Self timings vs. hierarchical timings #92

alecazam commented Feb 20, 2024 •

edited

Loading

aras-p commented Feb 21, 2024

alecazam commented Feb 22, 2024

ben-craig commented Feb 22, 2024

alecazam commented Feb 22, 2024

alecazam commented Feb 22, 2024 •

edited

Loading

alecazam commented Feb 23, 2024 •

edited

Loading

13steinj commented Feb 28, 2024

alecazam commented Mar 16, 2024 •

edited

Loading

Self timings vs. hierarchical timings #92

Self timings vs. hierarchical timings #92

Comments

alecazam commented Feb 20, 2024 • edited Loading

aras-p commented Feb 21, 2024

alecazam commented Feb 22, 2024

ben-craig commented Feb 22, 2024

alecazam commented Feb 22, 2024

alecazam commented Feb 22, 2024 • edited Loading

alecazam commented Feb 23, 2024 • edited Loading

13steinj commented Feb 28, 2024

alecazam commented Mar 16, 2024 • edited Loading

alecazam commented Feb 20, 2024 •

edited

Loading

alecazam commented Feb 22, 2024 •

edited

Loading

alecazam commented Feb 23, 2024 •

edited

Loading

alecazam commented Mar 16, 2024 •

edited

Loading