-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share callstack b/w Caller and Stack #808
Conversation
This adds benchmarks to measure the cost of logging with stack trace and of logging with both, caller and stack trace.
7cae0d4
to
84a97f7
Compare
Codecov Report
@@ Coverage Diff @@
## master #808 +/- ##
==========================================
+ Coverage 98.31% 98.45% +0.14%
==========================================
Files 43 43
Lines 2311 2464 +153
==========================================
+ Hits 2272 2426 +154
+ Misses 32 31 -1
Partials 7 7
Continue to review full report at Codecov.
|
A log entry that includes both, the caller and the stack trace requests stack information twice: once to determine the caller and once to build the stack trace. This change optimizes this by requesting call stack information once and sharing it for both cases. --- I ran the benchmarks with `-count 3`. **Before** ``` BenchmarkAddCallerHook-4 1265024 912 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 1389937 949 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 1327134 862 ns/op 248 B/op 3 allocs/op BenchmarkAddStack-4 565006 1911 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 644364 2180 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 547028 1896 ns/op 304 B/op 2 allocs/op BenchmarkAddCallerAndStack-4 405864 2471 ns/op 552 B/op 5 allocs/op BenchmarkAddCallerAndStack-4 492368 2940 ns/op 552 B/op 5 allocs/op BenchmarkAddCallerAndStack-4 414658 2588 ns/op 552 B/op 5 allocs/op ``` **After** ``` BenchmarkAddCallerHook-4 1000000 1054 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 1245952 1009 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 1102441 974 ns/op 240 B/op 2 allocs/op BenchmarkAddStack-4 908425 1269 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 774799 1345 ns/op 304 B/op 2 allocs/op BenchmarkAddStack-4 836653 1575 ns/op 304 B/op 2 allocs/op BenchmarkAddCallerAndStack-4 716456 1623 ns/op 544 B/op 4 allocs/op BenchmarkAddCallerAndStack-4 562585 1826 ns/op 544 B/op 4 allocs/op BenchmarkAddCallerAndStack-4 705979 1830 ns/op 544 B/op 4 allocs/op ``` On average, 1. Adding caller alone is ~104 nanoseconds *slower* 2. Adding stack alone is ~600 nanoseconds *faster* 3. Adding both, caller and stack is ~906 nanoseconds *faster* The second of these rightfully raises eyebrows because it was expected to have no changes at best. A cursory investigation doesn't reveal any obvious reason for the speed up unless there's something incredibly wrong with the change.
The change previously had a degradation on the caller-only route because it captured more of the stack than necessary. This optimizes that route back to original performance by limiting the depth of the captured stack. master: ``` BenchmarkAddCallerHook-4 1042035 1036 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 916015 1096 ns/op 248 B/op 3 allocs/op BenchmarkAddCallerHook-4 1178632 1193 ns/op 248 B/op 3 allocs/op ``` With this change: ``` BenchmarkAddCallerHook-4 1090138 1056 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 1135695 1068 ns/op 240 B/op 2 allocs/op BenchmarkAddCallerHook-4 896173 1122 ns/op 240 B/op 2 allocs/op ```
c861963
to
3520ae7
Compare
This tests the beahvior of Callers when the limit is larger than 64 (the value used by the pool by default). To test this reliably, I had to add an option to disable skipping zap frames in makeStacktrace.
d568a1e
to
954fa36
Compare
This verifies that when there is no limit on the stack depth, deep stacks are resized as expected.
for { | ||
// Skip the call to runtime.Callers and programCounter.Callers | ||
// so that the program counters start at our caller. | ||
numFrames = runtime.Callers(skip+2, pc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do numFrames :=
since we don't use the value outside of the for loop and always replace it
var trace string | ||
withStackDepth(tt.stackDepth, func() { | ||
pcs := newProgramCounters().Callers(1, tt.limit) | ||
trace = makeStacktrace(pcs, false /* don't skip zap */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to avoid this boolean arg (as it's a test-only code path, and generally boolean args are not great)
presumably you need a fair bit of stack depth,, is it possible to use something outside of zap that takes in a callback, e.g.,
https://golang.org/pkg/bytes/#Map
hacky PoC: https://play.golang.org/p/bKrUH8kEreE
And keep calling that function recursively till we hit the limit we want, and that should add non-zap stack frames?
ce.Entry.Caller = zapcore.NewEntryCaller(0, "", 0, false) | ||
} else { | ||
frame, _ := runtime.CallersFrames(pcs[callerIdx:]).Next() | ||
frame, _ := runtime.CallersFrames(pcs).Next() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are adding caller and stack, then the stackSkip
passed in is 0 (e.g., no caller skip), so the caller value used here would be incorrect right, since we wouldn't be respecting the log.callerSkip
if limit < len(pc) { | ||
pc = pc[:limit] | ||
} else { | ||
pc = make([]uintptr, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case never gets hit in prod right? anyway we can refactor the change to avoid adding a case that's not necessary in prod?
Closing until I have a chance to look at this again in the future. |
(Supersedes #808.) If we use the `AddCaller(true)` option for a logger with `AddStacktrace(level)`, which captures a stack trace for the specified level or higher, Zap currently captures inspects the stack twice: - `getCallerFrame` retrieves information about the immediate caller of the log entry - `StackSkip` calls `takeStacktrace` to capture the entire stack trace and build a string from it For the case where both caller and stack trace are requested, we can avoid redundant work by sharing information about the call stack between the logic for the two. To accomplish this, the following high-level pieces were added: type stacktrace captureStacktrace(skip int, depth stacktraceDepth) *stacktrace func (*stacktrace) Next() (runtime.Frame, bool) type stackFormatter func newStackFormatter(*buffer.Buffer) func (*stackFormatter) FormatStack(*stacktrace) func (*stackFormatter) FormatFrame(runtime.Frame) `captureStacktrace` captures stack traces (either just one frame or the entire stack) that can then be inspected manually (to extract caller information) and formatted into a string representation either one-frame-at-a-time or wholesale with `stackFormatter`. This allows us reorganize logic that applies the AddCaller and AddStacktrace options so that it can: - capture as much of the stack as it needs - extract caller information from the first frame if we need it - format that frame and the remaining frames into a stack trace --- Benchmarking: I ran the new benchmark included in this PR, as well as the existing stack trace/caller based benchmarks 10 times for 10 seconds each. ``` % go test -run '^$' -bench 'Caller|Stack' -benchmem -benchtime 10s -count 10 [...] name old time/op new time/op delta StackField-96 834ns ± 8% 819ns ±14% ~ (p=0.971 n=10+10) AddCallerHook-96 466ns ± 7% 480ns ± 1% ~ (p=0.118 n=10+6) AddCallerAndStacktrace-96 1.16µs ± 4% 0.78µs ± 8% -32.60% (p=0.000 n=9+10) TakeStacktrace-96 2.66µs ± 2% 2.74µs ± 2% +3.01% (p=0.000 n=9+10) name old alloc/op new alloc/op delta StackField-96 542B ± 0% 544B ± 0% +0.33% (p=0.000 n=10+10) AddCallerHook-96 254B ± 0% 248B ± 0% -2.36% (p=0.000 n=10+10) AddCallerAndStacktrace-96 652B ± 0% 429B ± 0% -34.20% (p=0.000 n=9+6) TakeStacktrace-96 420B ± 0% 421B ± 0% +0.24% (p=0.000 n=10+10) name old allocs/op new allocs/op delta StackField-96 3.00 ± 0% 3.00 ± 0% ~ (all equal) AddCallerHook-96 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10) AddCallerAndStacktrace-96 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=10+10) TakeStacktrace-96 2.00 ± 0% 2.00 ± 0% ~ (all equal) ``` Summary: - AddCaller+AddStacktrace takes 2 fewer allocations and is ~30% faster - AddCaller costs one allocation less - takeStacktrace costs 3% more CPU I think that the overall improvement is worth the slightly more expensive stack traces. Refs GO-1181
A log entry that includes both, the caller and the stack trace requests
stack information twice: once to determine the caller and once to build
the stack trace.
This change optimizes this by requesting call stack information once and
sharing it for both cases.
I ran the benchmarks with
-count 3
.Before
After
On average,
The second of these rightfully raises eyebrows because it was expected
to have no changes at best. A cursory investigation doesn't reveal
any obvious reason for the speed up unless there's something incredibly
wrong with the change.