Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self timings vs. hierarchical timings #92

Open
alecazam opened this issue Feb 20, 2024 · 8 comments
Open

Self timings vs. hierarchical timings #92

alecazam opened this issue Feb 20, 2024 · 8 comments

Comments

@alecazam
Copy link

alecazam commented Feb 20, 2024

Noticing that the timings don't add up, or are spread across multiple cores. One issue appears to be that top offenders are just the hierchical timing of the calls, without subtracting the children.

Would be nice to have self timings sorted out, so can see which headers are specifically at fault for all the build slowdowns. Maybe there's a mode for this, and I'm just too much of a newbie using the tool. But I do love it. So thanks for making this open-source.

Here's an example from our traces. So hash_table is a part of unordered_map. I doubt those both take up 137s and 123s. But maybe hash_table is 123/137 of unordered_map. Self timings would more accurately portray the offender. But it is a huge amount of instantiations of both. It's just when I look at -ftime-trace flame graphs other functions we have are not self times, and have a lot of overlap.

**** Template sets that took longest to instantiate:
137531 ms: std::unordered_map<$> (16285 times, avg 8 ms)
123470 ms: std::__hash_table<$> (18299 times, avg 6 ms)
 65691 ms: std::function<$>::function<$> (1805 times, avg 36 ms)
 65352 ms: std::__function::__value_func<$>::__value_func<$> (1805 times, avg 36 ms)
 54030 ms: std::__function::__func<$>::__func (1805 times, avg 29 ms)
 50907 ms: std::forward_as_tuple<$> (8135 times, avg 6 ms)
 48883 ms: std::allocator_traits<$> (43333 times, avg 1 ms)
 47541 ms: std::__function::__alloc_func<$>::__alloc_func (5415 times, avg 8 ms)
@aras-p
Copy link
Owner

aras-p commented Feb 21, 2024

Yeah I have not implemented self-timings report (yet?). Not sure when/if I'll get around to that.

@alecazam
Copy link
Author

What I'm trying to figure out is who is instantiating that hash_map 16k times. I assume the template set is collecting the instantiation of any function out of that template. But it's pretty terrible to always see these system STL templates at the tops of builds. Some description of the categories would be helpful in the readme. But I'l see if the source provides more detail.

@ben-craig
Copy link

Hi alecazam. I authored the template set reporting. The std::unordered_map entries are for instantiating the class, and not any of the methods. You can see how std::function::function also shows up, so that specific function / ctor is tracked separately from the class.

The template set functionality shines most when finding expensive templates that are instantiated with a moderate number of unique template parameters. That's usually a good hint to optimize the template itself, often by extracting parts that don't depend on the template parameter.

If you are still concerned about unordered_map uses, then I would suggest grepping your source code for unordered_map usages in headers, and then keep an eye out for places where you would get a large quantity of different types as key / value pairs. In the past, I've seen that when using enums as keys or values. You can then try to decide what to do about these cases. I've done things where I've type erased the enums into their underlying types, and that let me cut down on unique types and instantiations by quite a bit.

@alecazam
Copy link
Author

Seems like the Variant union is still a good idea in this day and age. I just wish I had some idea of what file was generating all the templates. I have a good idea in this case, but wondering if that could be automated. With STL there ‘an always a level above that is the culprit.

@alecazam
Copy link
Author

alecazam commented Feb 22, 2024

I'm building a profiler shell to help with build, memory, and perf optimization. It's a flamegraph tool using Perfetto right now. And I'd like the CBA iteration to be a part of it. At first probably just calling the CBA cli, but would like to eventually tie into the C++ calls themselves. Not officially released yet, but still pretty usable. Still a few little issues to fix.

https://github.com/alecazam/kram/tree/main/kram-preview

I just added support in this tool to swap out the clang name "Source" fields with the file details, and I found all the unordered_map instantiations. Thanks for pointing me to the problem!

@alecazam
Copy link
Author

alecazam commented Feb 23, 2024

Actually I cut 30s worth of unordered_map, but still have to find the rest. This involves searching through the clang json output on a per file basis. So anything to accelerate this would help.

Something like top headers contributing to that count. So just one level up that isn't the same STL class. Problem is we use wrappers like TGCMap.h and TGCVector.h to wrap unordered_map and vector usage, but we really need one level up from even that.

@13steinj
Copy link

Coincidentally I'm running into the same/similar issue/request-- I'm finding that in really long templated-function call stacks (that I don't have any choice in them being so long), it matters more to me that the self-time of a given template took a given amount of time. For example, this output:

**** Templates that took longest to instantiate:
126389 ms: org::apps::appname::visitorPatternFunction<(lambda at header.hpp:16:9)> (6 times, avg 21064 ms)

Is incredibly misleading. Because looking at the callstack using -ftime-trace (thanks btw!), there's a ~ 1 minute 3 second 200 ms call that calls a ~ 1 minute 3 second 100 ms call which calls .... everything else, followed by 4 cheap calls elsewhere in the trace. Initial inspection without checking would make one thing "holy shit, this function takes an average of 21s to instantiate!?" but the reality is it just calls other functions of which a giant tree is the vast majority of 2 calls. Adding stddev and range might be nice too.

If the self-time of a template is a certain amount of time, and fairly consistent, that template can be reduced look into removal.

But if it's taking into account child time, not that helpful.

@alecazam
Copy link
Author

alecazam commented Mar 16, 2024

Seems like CBA has already built up the parent/child lists for the build events. Self time is just adding all of the children durations to another field in the parent (I call this durSub), then subtracting that from the dur to get the self time. This better pinpoints slow headers, since the child timing wouldn't be a part of the report.

I added the self timings for the parser to the Build Report of kram-profile. Don't have the other stats yet, but the same logic applies to all the stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants