Share callstack b/w Caller and Stack #808

abhinav · 2020-03-27T01:45:29Z

A log entry that includes both, the caller and the stack trace requests
stack information twice: once to determine the caller and once to build
the stack trace.

This change optimizes this by requesting call stack information once and
sharing it for both cases.

I ran the benchmarks with -count 3.

Before

BenchmarkAddCallerHook-4         1262595               886 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1344021               889 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1000000              1085 ns/op             248 B/op          3 allocs/op
BenchmarkAddStack-4               483288              2160 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               550969              1993 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               614550              2000 ns/op             304 B/op          2 allocs/op
BenchmarkAddCallerAndStack-4      397208              2554 ns/op             552 B/op          5 allocs/op
BenchmarkAddCallerAndStack-4      462566              2578 ns/op             552 B/op          5 allocs/op
BenchmarkAddCallerAndStack-4      459589              2980 ns/op             552 B/op          5 allocs/op

After

BenchmarkAddCallerHook-4         1138980               960 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1096950               965 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1242375               978 ns/op             240 B/op          2 allocs/op
BenchmarkAddStack-4               868773              1388 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               857756              1438 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               865842              1348 ns/op             304 B/op          2 allocs/op
BenchmarkAddCallerAndStack-4      701884              1715 ns/op             544 B/op          4 allocs/op
BenchmarkAddCallerAndStack-4      698002              1709 ns/op             544 B/op          4 allocs/op
BenchmarkAddCallerAndStack-4      608336              1712 ns/op             544 B/op          4 allocs/op

On average,

Adding caller alone is ~14 nanoseconds (~1.5%) slower
Adding stack alone is ~659 nanoseconds (~32%) faster
Adding both, caller and stack is ~992 nanoseconds (~36%) faster

The second of these rightfully raises eyebrows because it was expected
to have no changes at best. A cursory investigation doesn't reveal
any obvious reason for the speed up unless there's something incredibly
wrong with the change.

This adds benchmarks to measure the cost of logging with stack trace and of logging with both, caller and stack trace.

codecov · 2020-03-27T01:50:00Z

Codecov Report

Merging #808 into master will increase coverage by 0.14%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #808      +/-   ##
==========================================
+ Coverage   98.31%   98.45%   +0.14%     
==========================================
  Files          43       43              
  Lines        2311     2464     +153     
==========================================
+ Hits         2272     2426     +154     
+ Misses         32       31       -1     
  Partials        7        7

Impacted Files	Coverage Δ
logger.go	`95.23% <100.00%> (+0.64%)`	⬆️
stacktrace.go	`100.00% <100.00%> (+2.12%)`	⬆️
field.go	`100.00% <0.00%> (ø)`
zapcore/field.go	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f994215...332434f. Read the comment docs.

A log entry that includes both, the caller and the stack trace requests stack information twice: once to determine the caller and once to build the stack trace. This change optimizes this by requesting call stack information once and sharing it for both cases. --- I ran the benchmarks with `-count 3`. **Before** ``` BenchmarkAddCallerHook-4 1265024 912 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 1389937 949 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 1327134 862 ns/op 248 B/op 3 allocs/op BenchmarkAddStack-4 565006 1911 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 644364 2180 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 547028 1896 ns/op 304 B/op 2 allocs/op BenchmarkAddCallerAndStack-4 405864 2471 ns/op 552 B/op 5 allocs/op BenchmarkAddCallerAndStack-4 492368 2940 ns/op 552 B/op 5 allocs/op BenchmarkAddCallerAndStack-4 414658 2588 ns/op 552 B/op 5 allocs/op ``` **After** ``` BenchmarkAddCallerHook-4 1000000 1054 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 1245952 1009 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 1102441 974 ns/op 240 B/op 2 allocs/op BenchmarkAddStack-4 908425 1269 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 774799 1345 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 836653 1575 ns/op 304 B/op 2 allocs/op BenchmarkAddCallerAndStack-4 716456 1623 ns/op 544 B/op 4 allocs/op BenchmarkAddCallerAndStack-4 562585 1826 ns/op 544 B/op 4 allocs/op BenchmarkAddCallerAndStack-4 705979 1830 ns/op 544 B/op 4 allocs/op ``` On average, 1. Adding caller alone is ~104 nanoseconds *slower* 2. Adding stack alone is ~600 nanoseconds *faster* 3. Adding both, caller and stack is ~906 nanoseconds *faster* The second of these rightfully raises eyebrows because it was expected to have no changes at best. A cursory investigation doesn't reveal any obvious reason for the speed up unless there's something incredibly wrong with the change.

The change previously had a degradation on the caller-only route because it captured more of the stack than necessary. This optimizes that route back to original performance by limiting the depth of the captured stack. master: ``` BenchmarkAddCallerHook-4 1042035 1036 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 916015 1096 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 1178632 1193 ns/op 248 B/op 3 allocs/op ``` With this change: ``` BenchmarkAddCallerHook-4 1090138 1056 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 1135695 1068 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 896173 1122 ns/op 240 B/op 2 allocs/op ```

This tests the beahvior of Callers when the limit is larger than 64 (the value used by the pool by default). To test this reliably, I had to add an option to disable skipping zap frames in makeStacktrace.

This verifies that when there is no limit on the stack depth, deep stacks are resized as expected.

prashantv · 2020-03-31T23:13:06Z

stacktrace.go

+	for {
+		// Skip the call to runtime.Callers and programCounter.Callers
+		// so that the program counters start at our caller.
+		numFrames = runtime.Callers(skip+2, pc)


can do numFrames := since we don't use the value outside of the for loop and always replace it

prashantv · 2020-03-31T23:21:10Z

stacktrace_test.go

+			var trace string
+			withStackDepth(tt.stackDepth, func() {
+				pcs := newProgramCounters().Callers(1, tt.limit)
+				trace = makeStacktrace(pcs, false /* don't skip zap */)


trying to avoid this boolean arg (as it's a test-only code path, and generally boolean args are not great)
presumably you need a fair bit of stack depth,, is it possible to use something outside of zap that takes in a callback, e.g.,
https://golang.org/pkg/bytes/#Map
hacky PoC: https://play.golang.org/p/bKrUH8kEreE

And keep calling that function recursively till we hit the limit we want, and that should add non-zap stack frames?

prashantv · 2020-03-31T23:28:20Z

logger.go

 			ce.Entry.Caller = zapcore.NewEntryCaller(0, "", 0, false)
 		} else {
-			frame, _ := runtime.CallersFrames(pcs[callerIdx:]).Next()
+			frame, _ := runtime.CallersFrames(pcs).Next()


if we are adding caller and stack, then the stackSkip passed in is 0 (e.g., no caller skip), so the caller value used here would be incorrect right, since we wouldn't be respecting the log.callerSkip

prashantv · 2020-03-31T23:30:06Z

stacktrace.go

+		if limit < len(pc) {
+			pc = pc[:limit]
+		} else {
+			pc = make([]uintptr, limit)


this case never gets hit in prod right? anyway we can refactor the change to avoid adding a case that's not necessary in prod?

abhinav · 2021-08-09T16:00:26Z

Closing until I have a chance to look at this again in the future.

(Supersedes #808.) If we use the `AddCaller(true)` option for a logger with `AddStacktrace(level)`, which captures a stack trace for the specified level or higher, Zap currently captures inspects the stack twice: - `getCallerFrame` retrieves information about the immediate caller of the log entry - `StackSkip` calls `takeStacktrace` to capture the entire stack trace and build a string from it For the case where both caller and stack trace are requested, we can avoid redundant work by sharing information about the call stack between the logic for the two. To accomplish this, the following high-level pieces were added: type stacktrace captureStacktrace(skip int, depth stacktraceDepth) *stacktrace func (*stacktrace) Next() (runtime.Frame, bool) type stackFormatter func newStackFormatter(*buffer.Buffer) func (*stackFormatter) FormatStack(*stacktrace) func (*stackFormatter) FormatFrame(runtime.Frame) `captureStacktrace` captures stack traces (either just one frame or the entire stack) that can then be inspected manually (to extract caller information) and formatted into a string representation either one-frame-at-a-time or wholesale with `stackFormatter`. This allows us reorganize logic that applies the AddCaller and AddStacktrace options so that it can: - capture as much of the stack as it needs - extract caller information from the first frame if we need it - format that frame and the remaining frames into a stack trace --- Benchmarking: I ran the new benchmark included in this PR, as well as the existing stack trace/caller based benchmarks 10 times for 10 seconds each. ``` % go test -run '^$' -bench 'Caller|Stack' -benchmem -benchtime 10s -count 10 [...] name old time/op new time/op delta StackField-96 834ns ± 8% 819ns ±14% ~ (p=0.971 n=10+10) AddCallerHook-96 466ns ± 7% 480ns ± 1% ~ (p=0.118 n=10+6) AddCallerAndStacktrace-96 1.16µs ± 4% 0.78µs ± 8% -32.60% (p=0.000 n=9+10) TakeStacktrace-96 2.66µs ± 2% 2.74µs ± 2% +3.01% (p=0.000 n=9+10) name old alloc/op new alloc/op delta StackField-96 542B ± 0% 544B ± 0% +0.33% (p=0.000 n=10+10) AddCallerHook-96 254B ± 0% 248B ± 0% -2.36% (p=0.000 n=10+10) AddCallerAndStacktrace-96 652B ± 0% 429B ± 0% -34.20% (p=0.000 n=9+6) TakeStacktrace-96 420B ± 0% 421B ± 0% +0.24% (p=0.000 n=10+10) name old allocs/op new allocs/op delta StackField-96 3.00 ± 0% 3.00 ± 0% ~ (all equal) AddCallerHook-96 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10) AddCallerAndStacktrace-96 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=10+10) TakeStacktrace-96 2.00 ± 0% 2.00 ± 0% ~ (all equal) ``` Summary: - AddCaller+AddStacktrace takes 2 fewer allocations and is ~30% faster - AddCaller costs one allocation less - takeStacktrace costs 3% more CPU I think that the overall improvement is worth the slightly more expensive stack traces. Refs GO-1181

benchmark: Measure cost of Trace and Caller+Trace

512b342

This adds benchmarks to measure the cost of logging with stack trace and of logging with both, caller and stack trace.

abhinav force-pushed the abg/share-caller-stack branch from 7cae0d4 to 84a97f7 Compare March 27, 2020 01:46

abhinav added 2 commits March 27, 2020 12:58

abhinav force-pushed the abg/share-caller-stack branch from c861963 to 3520ae7 Compare March 27, 2020 19:58

Callers: Verify behavior for large limits

954fa36

This tests the beahvior of Callers when the limit is larger than 64 (the value used by the pool by default). To test this reliably, I had to add an option to disable skipping zap frames in makeStacktrace.

abhinav force-pushed the abg/share-caller-stack branch from d568a1e to 954fa36 Compare March 27, 2020 21:38

Test program counter stack resizing

332434f

This verifies that when there is no limit on the stack depth, deep stacks are resized as expected.

abhinav changed the title ~~WIP: Share callstack b/w Caller and Stack~~ Share callstack b/w Caller and Stack Mar 27, 2020

abhinav requested a review from prashantv March 27, 2020 23:40

prashantv reviewed Mar 31, 2020

View reviewed changes

prashantv added the waiting for author label Aug 14, 2020

abhinav closed this Aug 9, 2021

abhinav deleted the abg/share-caller-stack branch August 9, 2021 16:00

abhinav mentioned this pull request Jan 25, 2022

Optimize AddCaller+AddStacktrace by sharing stacks #1052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share callstack b/w Caller and Stack #808

Share callstack b/w Caller and Stack #808

abhinav commented Mar 27, 2020 •

edited

Loading

codecov bot commented Mar 27, 2020 •

edited

Loading

prashantv Mar 31, 2020

prashantv Mar 31, 2020

prashantv Mar 31, 2020

prashantv Mar 31, 2020

abhinav commented Aug 9, 2021

Share callstack b/w Caller and Stack #808

Share callstack b/w Caller and Stack #808

Conversation

abhinav commented Mar 27, 2020 • edited Loading

codecov bot commented Mar 27, 2020 • edited Loading

Codecov Report

prashantv Mar 31, 2020

Choose a reason for hiding this comment

prashantv Mar 31, 2020

Choose a reason for hiding this comment

prashantv Mar 31, 2020

Choose a reason for hiding this comment

prashantv Mar 31, 2020

Choose a reason for hiding this comment

abhinav commented Aug 9, 2021

abhinav commented Mar 27, 2020 •

edited

Loading

codecov bot commented Mar 27, 2020 •

edited

Loading