Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache last callstack for further callstack walking performance improvement #1187

Open
derekbruening opened this issue Nov 28, 2014 · 1 comment

Comments

@derekbruening
Copy link
Contributor

From [email protected] on April 16, 2013 19:53:10

Callstack walking for optimized code is complex and I could cite quite a few issue #s here on tweaks for both functionality and performance. This issue covers going a step beyond the work in issue #1186 and caching the full last callstack, with fp's, to try and avoid find_next_fp() even more. It still shows up on perlbench diffmail:

1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k
ITIMER distribution (182407):
0.0% of time in APPLICATION (1)
3.2% of time in INTERPRETER (5785)
0.3% of time in DISPATCH (628)
0.1% of time in SYSCALL HANDLER (103)
3.0% of time in INDIRECT BRANCH LOOKUP (5421)
43.8% of time in FRAGMENT CACHE (79891)
49.7% of time in UNKNOWN (90578)
RES-pcsamples.0.7276.html
897 get_shadow_table
1011 add_to_delay_list
1109 bitmapx2_set
1262 find_free_list_entry
1307 packed_callstack_hash
1438 safe_read
1662 module_lookup
2334 shadow_set_range
2458 address_to_frame
2999 rb_in_node
3925 print_callstack
5895 find_next_fp

Original issue: http://code.google.com/p/drmemory/issues/detail?id=1187

@derekbruening
Copy link
Contributor Author

From [email protected] on April 17, 2013 08:21:24

**** TODO initial implementation: stats look good, but no perf win on cfrac https://codereview.appspot.com/8789046/ cfrac full mode:

grep ^callstack ls -1td logs/D*|head -1/g*
callstack fp scans: 1, fp cache hits: 5445061
callstack walks: 10890123, cache hits: 7033166
callstack is_retaddr: 10890140, backdecode: 10890127, unreadable: 0

w/o issue #1186 find_next_fp cache or issue #1187 cstack cache:
51.26user 0.04system 0:51.42elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
ITIMER distribution (5120):
2.6% of time in INTERPRETER (134)
1.1% of time in INDIRECT BRANCH LOOKUP (57)
20.7% of time in FRAGMENT CACHE (1061)
75.5% of time in UNKNOWN (3868)
RES-pcsamples.0.891.html
95 address_to_frame
98 shadow_set_range
107 get_shadow_table
114 bitmapx2_byte
152 find_next_fp
160 print_callstack
175 rb_in_node

w/ issue #1186 find_next_fp cache:
38.75user 0.04system 0:38.88elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
ITIMER distribution (3867):
0.4% of time in INTERPRETER (17)
1.2% of time in INDIRECT BRANCH LOOKUP (47)
26.0% of time in FRAGMENT CACHE (1005)
72.4% of time in UNKNOWN (2798)
RES-pcsamples.0.6680.html
58 module_lookup
65 memset
65 rb_in_node
76 address_to_frame
86 shadow_set_range
183 print_callstack

w/ issue #1187 cstack cache:
45.29user 0.04system 0:45.43elapsed 99%CPU (0avgtext+0avgdata 11840maxresident)k
ITIMER distribution (4523):
0.0% of time in APPLICATION (1)
0.6% of time in INTERPRETER (26)
1.5% of time in INDIRECT BRANCH LOOKUP (70)
21.6% of time in FRAGMENT CACHE (979)
76.2% of time in UNKNOWN (3447)
RES-pcsamples.0.11656.html
74 address_to_frame
88 shadow_set_range
95 find_next_fp
101 rb_in_node
148 fpra_cache_check
174 print_callstack

w/ both:
39.23user 0.04system 0:39.36elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
ITIMER distribution (3917):
1.5% of time in INTERPRETER (58)
1.8% of time in INDIRECT BRANCH LOOKUP (69)
26.7% of time in FRAGMENT CACHE (1044)
70.1% of time in UNKNOWN (2746)
RES-pcsamples.0.17198.html
47 safe_read
50 rb_in_node
57 address_to_frame
58 __x86.get_pc_thunk.bx
59 memset
62 packed_callstack_hash
113 shadow_set_range
166 print_callstack
168 fpra_cache_check

**** TODO slight speedup on perlbench

this is the copy from goobuntu, diffmail run:

/usr/bin/time /work/drmemory/git/build_x86_rel/bin/drmemory.pl -replace_malloc -dr_ops "-msgbox_mask 12 -prof_pcs" -pause_at_assert -dr /work/dr/git/exports -- /extsw/spec2006/v1.2-perlbench/perlbench_base.gcc43-32bit -I./lib diffmail.pl 4 800 10 17 19 300 > diffmail.4.800.10.17.19.300.out

1719.92user 5.65system 28:51.26elapsed 99%CPU (0avgtext+0avgdata 429740maxresident)k
ITIMER distribution (171083):
0.0% of time in APPLICATION (2)
2.7% of time in INTERPRETER (4553)
0.5% of time in DISPATCH (926)
0.2% of time in SYSCALL HANDLER (345)
3.1% of time in INDIRECT BRANCH LOOKUP (5318)
45.7% of time in FRAGMENT CACHE (78125)
47.8% of time in UNKNOWN (81814)
RES-pcsamples.0.28924.html
1110 module_lookup
1139 add_to_delay_list
1320 packed_callstack_hash
1335 bitmapx2_set
1345 safe_read
1350 find_free_list_entry
1730 address_to_frame
1797 rb_in_node
2263 shadow_set_range
2575 fpra_cache_check
3286 find_next_fp
4000 print_callstack

this is vs the final issue #1186 run, pasting from below:
1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k

debug run:
app mallocs: 278012393, frees: 277917410, large mallocs: 10026
unique malloc stacks: 11512032
callstack fp scans: 21,553,498, fp cache hits: 851,322,232
callstack walks: 278,016,823, cache hits: 207,739,067
callstack is_retaddr: 611533459, backdecode: 443968947, unreadable: 0

again, fpra_cache_check shows up: can we speed that up?

**** TODO correctness: seems to have some problems? => bailing

The performance numbers show that while it has benefits in the absence of
the issue #1186 find_next_fp single-frame cache, it is not as effective, so we
want both. When combined with the single-frame cache it actually causes a
slight slowdown on cfrac and a slight speedup on perlbench diffmail.

fpra_cache_check() shows up in profiling. For every frame on a new
callstack walk, we have to iterate the whole cache to look for a tail
match, as we can't just blindly use the whole tail. Any ideas on how to
improve this?

Running drheapstat on cfrac w/ and w/o this feature and comparing
callstack.log: 195 unique callstacks w/o, 216 w/: seems like there's a
correctness problem. Unless you can see something I'm missing here, I may
abandon this as it will take a bunch more work to ensure correctness and there's
still no guarantee of a perf win (though in theory it seems like we should be
able to get a win).

perf summary:
cfrac full mode:
base (w/o issue #1186 find_next_fp cache or issue #1187 cstack cache):
51.26user 0.04system 0:51.42elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
w/ issue #1186 find_next_fp cache:
38.75user 0.04system 0:38.88elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k
w/ issue #1187 cstack cache:
45.29user 0.04system 0:45.43elapsed 99%CPU (0avgtext+0avgdata 11840maxresident)k
w/ both:
39.23user 0.04system 0:39.36elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k

perlbench diffmail:
base (w/o either):
3359.52user 5.47system 56:14.44elapsed 99%CPU (0avgtext+0avgdata 425808maxresident)k
w/ issue #1186 find_next_fp cache:
1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k
w/ both:
1719.92user 5.65system 28:51.26elapsed 99%CPU (0avgtext+0avgdata 429740maxresident)k

I implemented it quickly, so it's possible it's worthwhile if done right:
and it makes sense for repeated malloc w/ similar base callstack. But I
can't seem to get much of a win.

Bailing for now. Leaving this issue open for either trying something like this cache again, or some other idea to improve callstack walking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant