-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add cpu profiler #95623
server: add cpu profiler #95623
Conversation
457da7a
to
41da6e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! Haven't reviewed in detail yet but left a general comment about finding a better set of signals to trigger the collection on.
pkg/server/env_sampler.go
Outdated
@@ -188,6 +197,9 @@ func startSampleEnvironment( | |||
if queryProfiler != nil { | |||
queryProfiler.MaybeDumpQueries(ctx, cfg.sessionRegistry, cfg.st) | |||
} | |||
if cpuProfiler != nil { | |||
cpuProfiler.MaybeTakeProfile(ctx, cfg.runtime.HostDiskReadCount.Value(), cfg.runtime.HostDiskWriteCount.Value(), cfg.runtime.Uptime.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disk read and write counts don't seem like the appropriate metrics to trigger a CPU profile here. We usually care about CPU profiles when the CPU is busy! The combined normalized CPU usage in [0,1]
is cfg.runtime.CPUCombinedPercentNorm.Value()
. I think (but you should verify) that we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8
makes sense.
But taking a step back, we have admission control, too. If admission control limits CPU usage, we still want to know what consumes CPU, but admission control might limit CPU effectively enough to stay below threshold. I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization. I think this will need a little bit of plumbing and maybe some advice by the likes of @irfansharif and @sumeerbhola on how we should go about it.
41da6e9
to
260b8d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @irfansharif, @koorosh, @smg260, @sumeerbhola, and @zachlite)
pkg/server/env_sampler.go
line 201 at r1 (raw file):
cfg.runtime.CPUCombinedPercentNorm.Value()
Thanks for clarifying which metric we should use, have switched to using it.
we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense
From the docs it says crdb should not persistently use >80% CPU so maybe 0.8 is indeed the ideal default.
I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization
I'm not too familiar with admission control, would we simply want to know if admission control is enabled (and if it is we would override the cpu usage threshold setting default to some lower value) or use an admission control metric for another comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)
pkg/server/env_sampler.go
line 201 at r1 (raw file):
Previously, Santamaura (Alex Santamaura) wrote…
cfg.runtime.CPUCombinedPercentNorm.Value()
Thanks for clarifying which metric we should use, have switched to using it.
we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense
From the docs it says crdb should not persistently use >80% CPU so maybe 0.8 is indeed the ideal default.
I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization
I'm not too familiar with admission control, would we simply want to know if admission control is enabled (and if it is we would override the cpu usage threshold setting default to some lower value) or use an admission control metric for another comparison?
AC, for "regular" work (i.e. not backups, changefeeds, etc.) will try to keep CPU use as high as possible. If the incident work on a given node is all high-priority and consumes >90% of CPU, it'll keep it at that threshold. If AC is intentionally reducing CPU consumption of "elastic" work (backups, changefeeds, etc.) it's still trying to keep it as high as possible. The amount of "regular work" being done WRT CPU% on those nodes is unaffected, just the elastic CPU % is lowered. So I think we can ignore AC for this PR.
Only briefly looking at MaybeTakeProfile
, we're taking CPU profiles once-per-minute when CPU % > 80%? Have you also considered what would take to capture a CPU % when there's a noticeable "spike"? Feel free to defer this to a future PR, perhaps filing a dedicated issue for it, but in at least some of the incidents we've tied to #75799 we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)
pkg/server/env_sampler.go
line 201 at r1 (raw file):
we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.
Hmm what if the CpuProfiler
stored the last CPUCombinedPercentNorm.Value()
regardless of whether a profile is taken and then we calculate a delta between the current and previous. We then check this delta against a new setting (e.g. server.cpu_profile.cpu_usage_delta
) and if it exceeds the setting we take a profile? What do you think, would this approach handle interesting cpu spikes? If so what do you think the default delta should be for the setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)
pkg/server/env_sampler.go
line 201 at r1 (raw file):
Previously, Santamaura (Alex Santamaura) wrote…
we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.
Hmm what if the
CpuProfiler
stored the lastCPUCombinedPercentNorm.Value()
regardless of whether a profile is taken and then we calculate a delta between the current and previous. We then check this delta against a new setting (e.g.server.cpu_profile.cpu_usage_delta
) and if it exceeds the setting we take a profile? What do you think, would this approach handle interesting cpu spikes? If so what do you think the default delta should be for the setting?
This "spike" detection worth separating out into its own PR to not stall this, very useful one. But for your question, it's hard to answer without some trial and error. I found https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 which conveniently comes with a very simple go implementation that we could perhaps consider instead. It takes a moving mean of the last N measurements of CPU %, and triggers if a new sample is some standard deviations away from the moving mean -- which sounds closer to what I think we want. It would work for various baseline CPU %s, even when nodes are running hot (~65%) or cold (~10%). What do you think? Want to file an issue and track separately? Quick experiments would tell us how effective it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)
pkg/server/env_sampler.go
line 201 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This "spike" detection worth separating out into its own PR to not stall this, very useful one. But for your question, it's hard to answer without some trial and error. I found https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 which conveniently comes with a very simple go implementation that we could perhaps consider instead. It takes a moving mean of the last N measurements of CPU %, and triggers if a new sample is some standard deviations away from the moving mean -- which sounds closer to what I think we want. It would work for various baseline CPU %s, even when nodes are running hot (~65%) or cold (~10%). What do you think? Want to file an issue and track separately? Quick experiments would tell us how effective it is.
The Go implementation: https://github.com/MicahParks/peakdetect/blob/master/peakdetect.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)
pkg/server/env_sampler.go
line 201 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
The Go implementation: https://github.com/MicahParks/peakdetect/blob/master/peakdetect.go.
CPU utilization is fine as one of the signals, but it is coarse grained (wrt time) and misses behavior at tiny time scales that is observed by admission control to do throttling. We have seen production situations where CPU utilization is low, but CPU admission control is active (sometimes coupled with a spiky runnable goroutine count, and sometimes not).
If we need to make a decision periodically, I'd suggest also calling into admission control to see if it recommends doing anything based on things it observed since the last such call. We can evolve the admission control implementation for this method (that would return a bool) over time -- for starters it could be whether the KV slots were exhausted for more than N milliseconds (say 100 milliseconds) during that 1min (if that was the interval).
This is orthogonal to the spike detection discussion above, in my mind. If there was a one time short spike that isn't caught by this periodic sampling, we may not care about it. If it is periodic, say a 5s spike every 10min and we keep missing it because of 1min sampling, that would be very unfortunate. By spike I don't mean a delta change that is large, but something that returns true from the decision function on whether to take a profile or not (we can of course ignore this true if we have recently taken a profile). We could just poll more frequently to capture these.
And of course there is the spike that is a delta change where two consecutive polls are returning true to take a profile, and the latter has experienced a delta change, and we've already taken a cpu profile recently, so we are going to miss something by not taking the profile now. I don't know how important this is.
Yes makes sense, I've created the issue. Thanks for the links as well. |
Filed #97699 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try initing kv with --splits 1000 and run the workload for a while to let it balance out. Then increase the concurrency until the system is at 85%.
I'd like to see this as well if you can make it happen!
Nonetheless, thanks for running these thorough tests @Santamaura - much appreciated. Given the latency increase I agree that there's a non-negligible amount of overhead here, but with the cluster setting default set to false
I think it's okay to proceed. We can continue optimizing as we go, but as it stands this is a solid platform to work with!
bad allocations that come with the profiler labels
I'll be curious to see the difference when we disable labels, but we can dive into that with #97699 (thanks for filing @tbg)
Otherwise, I'm satisfied with these tests and I'm happy with the server.cpu_profile.enabled=false
setting by default.
As one final thing, can we add a panic handler to MaybeTakeProfile
?
otherwise! Of course, let's be sure to also address Tobi's final comments.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @Santamaura, @smg260, @tbg, and @zachlite)
pkg/server/profiler/cpuprofiler.go
line 42 at r76 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Default this to MaxInt64 while we figure out the perf impact.
I'm fine with that, but let's make sure to document that if you enable with server.cpu_profile.enabled=true
, you will also need to configure server.cpu_profile.cpu_usage_combined_threshold
to a realistic value as well.
pkg/server/profiler/cpuprofiler.go
line 110 at r76 (raw file):
// MaybeTakeProfile takes a cpu profile if cpu usage is high enough. func (cp *CPUProfiler) MaybeTakeProfile(ctx context.Context, currentCpuUsage int64) { if !cpuProfileEnabled.Get(&cp.st.SV) {
nit: can we add a panic handler here just in case at the top of the function? IMO these profilers should never be the cause of a process crashing (We did the same for activequeryprofiler)
defer func() {
if p := recover(); p != nil {
logcrash.ReportPanic(ctx, &cp.st.SV, p, 1)
}
}()
This might just be me being extra paranoid, but I figure it can't hurt 😅
471ec03
to
fd10a15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @abarganier, @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)
pkg/cli/start.go
line 1438 at r24 (raw file):
Previously, tbg (Tobias Grieger) wrote…
My intention was that we'd remove this - if we want periodic CPU profiles taken every 7s, then we can use the cluster setting to set the CPU usage threshold to 1%, set the reset interval to 7s, and then we have it.
The single loop driving the CPU profiler is here:
cockroach/pkg/server/env_sampler.go
Lines 199 to 201 in 2c65d19
if cpuProfiler != nil { cpuProfiler.MaybeTakeProfile(ctx, int64(cfg.runtime.CPUCombinedPercentNorm.Value()*100)) }
Removed.
pkg/server/profiler/cpuprofiler.go
line 42 at r76 (raw file):
Previously, abarganier (Alex Barganier) wrote…
I'm fine with that, but let's make sure to document that if you enable with
server.cpu_profile.enabled=true
, you will also need to configureserver.cpu_profile.cpu_usage_combined_threshold
to a realistic value as well.
Done. Expanded in the server.cpu_profile.enabled
description about this.
pkg/server/profiler/cpuprofiler.go
line 110 at r76 (raw file):
Previously, abarganier (Alex Barganier) wrote…
nit: can we add a panic handler here just in case at the top of the function? IMO these profilers should never be the cause of a process crashing (We did the same for activequeryprofiler)
defer func() { if p := recover(); p != nil { logcrash.ReportPanic(ctx, &cp.st.SV, p, 1) } }()
This might just be me being extra paranoid, but I figure it can't hurt 😅
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once CI passes - nice work @Santamaura!
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)
This PR adds a cpu profiler to the server package. The following cluster settings have been added to configure the cpu profiler: - server.cpu_profile.cpu_usage_combined_threshold is the baseline value for when cpu profiles should be taken - server.cpu_profile.interval is when the high water mark resets to the cpu_usage_combined_threshold value - server.cpu_profile.duration is how long a cpu profile is taken - server.cpu_profile.enabled is whether the on/off switch of the cpu profiler Fixes: cockroachdb#75799 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try initing kv with --splits 1000 and run the workload for a while to let it balance out. Then increase the concurrency until the system is at 85%.
I tried this a couple days ago but after bumping concurrency a number of times I was only able to get cpu to ~75% and then at some point the admin ui became non responsive. Regardless, I set the cpu profiler to take profiles at 75% and the numbers increase on the workload and graphs seemed roughly about the same as the previous test.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @abarganier, @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)
TFTRs! |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from cf97b4f to blathers/backport-release-22.2-95623: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Late to the show, this is awesome! How was this not release note worthy I wonder, are we concerned about its usage? |
@rmloveland looks like we dropped the ball on the release note here, would you be able to add this to the right backlog somewhere so that we mention it for 23.2 at least? |
Tobias Grieger ***@***.***> writes:
@rmloveland looks like we dropped the ball on the release note here,
would you be able to add this to the right backlog somewhere so that
we mention it for 23.2 at least?
ya no problem! filed
https://cockroachlabs.atlassian.net/browse/DOC-8130
however the fix version of that docs issue is v23.2, i think if this is
backported to v23.1 it will need a "Release Note: blah blah" added to
the backport PR message to get picked up by our release notes automation
for a v23.1 point release
|
This PR adds a cpu profiler to the server package.
The following cluster settings have been added to configure
the cpu profiler:
value for when cpu profiles should be taken
the cpu_usage_combined_threshold value
cpu profiler
Fixes: #75799
Release note: None