cache last callstack for further callstack walking performance improvement #1187

derekbruening · 2014-11-28T07:13:41Z

From [email protected] on April 16, 2013 19:53:10

Callstack walking for optimized code is complex and I could cite quite a few issue #s here on tweaks for both functionality and performance. This issue covers going a step beyond the work in issue #1186 and caching the full last callstack, with fp's, to try and avoid find_next_fp() even more. It still shows up on perlbench diffmail:

1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k
ITIMER distribution (182407):
0.0% of time in APPLICATION (1)
3.2% of time in INTERPRETER (5785)
0.3% of time in DISPATCH (628)
0.1% of time in SYSCALL HANDLER (103)
3.0% of time in INDIRECT BRANCH LOOKUP (5421)
43.8% of time in FRAGMENT CACHE (79891)
49.7% of time in UNKNOWN (90578)
RES-pcsamples.0.7276.html
897 get_shadow_table
1011 add_to_delay_list
1109 bitmapx2_set
1262 find_free_list_entry
1307 packed_callstack_hash
1438 safe_read
1662 module_lookup
2334 shadow_set_range
2458 address_to_frame
2999 rb_in_node
3925 print_callstack
5895 find_next_fp

Original issue: http://code.google.com/p/drmemory/issues/detail?id=1187

derekbruening · 2014-11-28T07:13:41Z

From [email protected] on April 17, 2013 08:21:24

**** TODO initial implementation: stats look good, but no perf win on cfrac https://codereview.appspot.com/8789046/ cfrac full mode:

grep ^callstack ls -1td logs/D*|head -1/g*
callstack fp scans: 1, fp cache hits: 5445061
callstack walks: 10890123, cache hits: 7033166
callstack is_retaddr: 10890140, backdecode: 10890127, unreadable: 0

w/o issue #1186 find_next_fp cache or issue #1187 cstack cache:
51.26user 0.04system 0:51.42elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
ITIMER distribution (5120):
2.6% of time in INTERPRETER (134)
1.1% of time in INDIRECT BRANCH LOOKUP (57)
20.7% of time in FRAGMENT CACHE (1061)
75.5% of time in UNKNOWN (3868)
RES-pcsamples.0.891.html
95 address_to_frame
98 shadow_set_range
107 get_shadow_table
114 bitmapx2_byte
152 find_next_fp
160 print_callstack
175 rb_in_node

w/ issue #1186 find_next_fp cache:
38.75user 0.04system 0:38.88elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
ITIMER distribution (3867):
0.4% of time in INTERPRETER (17)
1.2% of time in INDIRECT BRANCH LOOKUP (47)
26.0% of time in FRAGMENT CACHE (1005)
72.4% of time in UNKNOWN (2798)
RES-pcsamples.0.6680.html
58 module_lookup
65 memset
65 rb_in_node
76 address_to_frame
86 shadow_set_range
183 print_callstack

w/ issue #1187 cstack cache:
45.29user 0.04system 0:45.43elapsed 99%CPU (0avgtext+0avgdata 11840maxresident)k
ITIMER distribution (4523):
0.0% of time in APPLICATION (1)
0.6% of time in INTERPRETER (26)
1.5% of time in INDIRECT BRANCH LOOKUP (70)
21.6% of time in FRAGMENT CACHE (979)
76.2% of time in UNKNOWN (3447)
RES-pcsamples.0.11656.html
74 address_to_frame
88 shadow_set_range
95 find_next_fp
101 rb_in_node
148 fpra_cache_check
174 print_callstack

w/ both:
39.23user 0.04system 0:39.36elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
ITIMER distribution (3917):
1.5% of time in INTERPRETER (58)
1.8% of time in INDIRECT BRANCH LOOKUP (69)
26.7% of time in FRAGMENT CACHE (1044)
70.1% of time in UNKNOWN (2746)
RES-pcsamples.0.17198.html
47 safe_read
50 rb_in_node
57 address_to_frame
58 __x86.get_pc_thunk.bx
59 memset
62 packed_callstack_hash
113 shadow_set_range
166 print_callstack
168 fpra_cache_check

**** TODO slight speedup on perlbench

this is the copy from goobuntu, diffmail run:

/usr/bin/time /work/drmemory/git/build_x86_rel/bin/drmemory.pl -replace_malloc -dr_ops "-msgbox_mask 12 -prof_pcs" -pause_at_assert -dr /work/dr/git/exports -- /extsw/spec2006/v1.2-perlbench/perlbench_base.gcc43-32bit -I./lib diffmail.pl 4 800 10 17 19 300 > diffmail.4.800.10.17.19.300.out

1719.92user 5.65system 28:51.26elapsed 99%CPU (0avgtext+0avgdata 429740maxresident)k
ITIMER distribution (171083):
0.0% of time in APPLICATION (2)
2.7% of time in INTERPRETER (4553)
0.5% of time in DISPATCH (926)
0.2% of time in SYSCALL HANDLER (345)
3.1% of time in INDIRECT BRANCH LOOKUP (5318)
45.7% of time in FRAGMENT CACHE (78125)
47.8% of time in UNKNOWN (81814)
RES-pcsamples.0.28924.html
1110 module_lookup
1139 add_to_delay_list
1320 packed_callstack_hash
1335 bitmapx2_set
1345 safe_read
1350 find_free_list_entry
1730 address_to_frame
1797 rb_in_node
2263 shadow_set_range
2575 fpra_cache_check
3286 find_next_fp
4000 print_callstack

this is vs the final issue #1186 run, pasting from below:
1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k

debug run:
app mallocs: 278012393, frees: 277917410, large mallocs: 10026
unique malloc stacks: 11512032
callstack fp scans: 21,553,498, fp cache hits: 851,322,232
callstack walks: 278,016,823, cache hits: 207,739,067
callstack is_retaddr: 611533459, backdecode: 443968947, unreadable: 0

again, fpra_cache_check shows up: can we speed that up?

**** TODO correctness: seems to have some problems? => bailing

The performance numbers show that while it has benefits in the absence of
the issue #1186 find_next_fp single-frame cache, it is not as effective, so we
want both. When combined with the single-frame cache it actually causes a
slight slowdown on cfrac and a slight speedup on perlbench diffmail.

fpra_cache_check() shows up in profiling. For every frame on a new
callstack walk, we have to iterate the whole cache to look for a tail
match, as we can't just blindly use the whole tail. Any ideas on how to
improve this?

Running drheapstat on cfrac w/ and w/o this feature and comparing
callstack.log: 195 unique callstacks w/o, 216 w/: seems like there's a
correctness problem. Unless you can see something I'm missing here, I may
abandon this as it will take a bunch more work to ensure correctness and there's
still no guarantee of a perf win (though in theory it seems like we should be
able to get a win).

perf summary:
cfrac full mode:
base (w/o issue #1186 find_next_fp cache or issue #1187 cstack cache):
51.26user 0.04system 0:51.42elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
w/ issue #1186 find_next_fp cache:
38.75user 0.04system 0:38.88elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
w/ issue #1187 cstack cache:
45.29user 0.04system 0:45.43elapsed 99%CPU (0avgtext+0avgdata 11840maxresident)k
w/ both:
39.23user 0.04system 0:39.36elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k

perlbench diffmail:
base (w/o either):
3359.52user 5.47system 56:14.44elapsed 99%CPU (0avgtext+0avgdata 425808maxresident)k
w/ issue #1186 find_next_fp cache:
1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k
w/ both:
1719.92user 5.65system 28:51.26elapsed 99%CPU (0avgtext+0avgdata 429740maxresident)k

I implemented it quickly, so it's possible it's worthwhile if done right:
and it makes sense for repeated malloc w/ similar base callstack. But I
can't seem to get much of a win.

Bailing for now. Leaving this issue open for either trying something like this cache again, or some other idea to improve callstack walking.

derekbruening added Migrated Priority-Medium Type-Feature Performance labels Nov 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache last callstack for further callstack walking performance improvement #1187

cache last callstack for further callstack walking performance improvement #1187

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

cache last callstack for further callstack walking performance improvement #1187

cache last callstack for further callstack walking performance improvement #1187

Comments

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

/usr/bin/time /work/drmemory/git/build_x86_rel/bin/drmemory.pl -replace_malloc -dr_ops "-msgbox_mask 12 -prof_pcs" -pause_at_assert -dr /work/dr/git/exports -- /extsw/spec2006/v1.2-perlbench/perlbench_base.gcc43-32bit -I./lib diffmail.pl 4 800 10 17 19 300 > diffmail.4.800.10.17.19.300.out