-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/log: Speed up vmodule support and warn about it in help text #29017
Conversation
This bit me pretty bad when doing performance testing. I turned on vmodule for a file which hardly logs at all, but all the calls to log.VDepth slowed the system down to a crawl. Alternatively we could improve log.VDepth to make this less of a concern. It's already been handcrafted to avoid allocations, but holding the logging lock while calling runtime.Callers is not good. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pro-tip: run with TESTFLAGS="-count 10"
and then compare the output of the two runs with benchstat
.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/util/log/clog.go, line 1401 at r2 (raw file):
// Here is another cheap but safe test to see if vmodule is enabled. if atomic.LoadInt32(&logging.filterLength) > 0 { pcs := logging.pcsPool.Get().([1]uintptr)
I don't see a call to logging.pcsPool.Put
which makes me suspect that you're allocating on every call here. You can also run with TESTFLAGS="-benchmem"
to see the number of allocations per iteration.
pkg/util/log/clog.go, line 1410 at r2 (raw file):
v = logging.setV(pcs[0]) } logging.mu.Unlock()
I think there should be a call to logging.pcsPool.Put(pcs)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchstat output with -count 10
:
name old time/op new time/op delta
VDepthWithVModule-8 790ns ± 1% 254ns ± 6% -67.83% (p=0.000 n=9+10)
name old alloc/op new alloc/op delta
VDepthWithVModule-8 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
VDepthWithVModule-8 0.00 0.00 ~ (all equal)
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/util/log/clog.go, line 1401 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I don't see a call to
logging.pcsPool.Put
which makes me suspect that you're allocating on every call here. You can also run withTESTFLAGS="-benchmem"
to see the number of allocations per iteration.
🤦Thanks, fixed.
Oddly enough, -benchmem
actually doesn't show any allocations without the Put
calls. Seems like an accounting bug of some sort, although we could just be misunderstanding something.
Including the Put
s is obviously what we want, though, and it performs 10-20% better than without them so they clearly have some effect.
pkg/util/log/clog.go, line 1410 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think there should be a call to
logging.pcsPool.Put(pcs)
here.
Done. And in the early return case.
By using a sync.Pool to manage the PC buffers for us. This avoids locking while calling runtime.Callers without needing to allocate a new buffer every time VDepth is called. Take the benchmarks with a grain of salt since they're just from my laptop, but Before the change: $ make bench PKG=./pkg/util/log/ BENCHES=BenchmarkVDepthWithVModule BenchmarkVDepthWithVModule-8 2000000 825 ns/op After the change: $ make bench PKG=./pkg/util/log/ BENCHES=BenchmarkVDepthWithVModule BenchmarkVDepthWithVModule-8 5000000 284 ns/op Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
bors r+ |
29017: util/log: Speed up vmodule support and warn about it in help text r=a-robinson a=a-robinson It's possible that with the second commit the help text change is no longer needed, but I'm curious what you think about that. I haven't tried re-running tpc-c 10k with this, so while I'm pretty sure this is a lot better I don't know how much of a performance hit there still is. Co-authored-by: Alex Robinson <[email protected]>
Build succeeded |
29049: backport-2.1: util/log: Speed up vmodule support and warn about it in help text r=a-robinson a=a-robinson Backport 2/2 commits from #29017. /cc @cockroachdb/release --- It's possible that with the second commit the help text change is no longer needed, but I'm curious what you think about that. I haven't tried re-running tpc-c 10k with this, so while I'm pretty sure this is a lot better I don't know how much of a performance hit there still is. Co-authored-by: Alex Robinson <[email protected]>
Fixes cockroachdb#17770. This commit fixes the last user of `runtime.Callers` that misused the stdlib function by translating the PC values it returned directly into symbolic information (see https://golang.org/pkg/runtime/#Callers). Go's documentation warns that this is a recipe for disaster when mixed with mid-stack inlining. The other concern in cockroachdb#17770 was this comment: cockroachdb#17770 (comment). This was discussed in golang/go#29582 and addressed in golang/go@956879d. An alternative would be to use `runtime.Caller` here, but that would force an allocation that was hard-earned in cockroachdb#29017. Instead, this commit avoids any performance hit. ``` name old time/op new time/op delta Header-4 267ns ± 1% 268ns ± 0% ~ (p=0.584 n=10+10) VDepthWithVModule-4 260ns ± 3% 255ns ± 1% -1.87% (p=0.018 n=10+9) name old alloc/op new alloc/op delta Header-4 0.00B 0.00B ~ (all equal) VDepthWithVModule-4 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Header-4 0.00 0.00 ~ (all equal) VDepthWithVModule-4 0.00 0.00 ~ (all equal) ``` Release note: None
35014: kv: introduce new "max outstanding size" setting to txnPipeliner r=nvanbenschoten a=nvanbenschoten Fixes #32522. This change creates a new cluster setting called: ``` kv.transaction.write_pipelining_max_outstanding_size ``` It limits the size in bytes that can be dedicated to tracking in-flight writes that have been pipelined. Once this limit is hit, not more writes will be pipelined by a transaction until some of the writes are proven and removed from the outstanding write set. This change once again illustrates the need for periodic proving of outstanding writes. We touch upon that in the type definition's comment and in #35009. Release note: None 35199: log: fix remaining misuse of runtime.Callers/runtime.FuncForPC r=nvanbenschoten a=nvanbenschoten Fixes #17770. This commit fixes the last user of `runtime.Callers` that misused the stdlib function by translating the PC values it returned directly into symbolic information (see https://golang.org/pkg/runtime/#Callers) [1]. Go's documentation warns that this is a recipe for disaster when mixed with mid-stack inlining. The other concern in #17770 was this comment: #17770 (comment). This was discussed in golang/go#29582 and addressed in golang/go@956879d. An alternative would be to use `runtime.Caller` here, but that would force an allocation that was hard-earned in #29017. Instead, this commit avoids any performance hit. ``` name old time/op new time/op delta Header-4 267ns ± 1% 268ns ± 0% ~ (p=0.584 n=10+10) VDepthWithVModule-4 260ns ± 3% 255ns ± 1% -1.87% (p=0.018 n=10+9) name old alloc/op new alloc/op delta Header-4 0.00B 0.00B ~ (all equal) VDepthWithVModule-4 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Header-4 0.00 0.00 ~ (all equal) VDepthWithVModule-4 0.00 0.00 ~ (all equal) ``` [1] I went through and verified that this was still correct. Release note: None 35203: closedts: react faster to config changes r=danhhz a=tbg This approximately halves the duration of `make test PKG=./pkg/ccl/changefeedccl TESTS=/sinkless`, from ~60s to ~30s. Touches #34455. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
It's possible that with the second commit the help text change is no longer needed, but I'm curious what you think about that. I haven't tried re-running tpc-c 10k with this, so while I'm pretty sure this is a lot better I don't know how much of a performance hit there still is.