Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add cpu profiler #95623

Merged
merged 1 commit into from
Mar 2, 2023
Merged

server: add cpu profiler #95623

merged 1 commit into from
Mar 2, 2023

Conversation

Santamaura
Copy link
Contributor

@Santamaura Santamaura commented Jan 20, 2023

This PR adds a cpu profiler to the server package.
The following cluster settings have been added to configure
the cpu profiler:

  • server.cpu_profile.cpu_usage_combined_threshold is the baseline
    value for when cpu profiles should be taken
  • server.cpu_profile.interval is when the high water mark resets to
    the cpu_usage_combined_threshold value
  • server.cpu_profile.duration is how long a cpu profile is taken
  • server.cpu_profile.enabled is whether the on/off switch of the
    cpu profiler

Fixes: #75799

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Santamaura Santamaura force-pushed the cpu-profiler branch 7 times, most recently from 457da7a to 41da6e9 Compare January 23, 2023 22:23
@Santamaura Santamaura marked this pull request as ready for review January 23, 2023 23:13
@Santamaura Santamaura requested review from a team as code owners January 23, 2023 23:13
@Santamaura Santamaura requested a review from a team January 23, 2023 23:13
@Santamaura Santamaura requested review from a team as code owners January 23, 2023 23:13
@Santamaura Santamaura requested review from herkolategan, smg260, tbg, koorosh and zachlite and removed request for a team January 23, 2023 23:13
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! Haven't reviewed in detail yet but left a general comment about finding a better set of signals to trigger the collection on.

@@ -188,6 +197,9 @@ func startSampleEnvironment(
if queryProfiler != nil {
queryProfiler.MaybeDumpQueries(ctx, cfg.sessionRegistry, cfg.st)
}
if cpuProfiler != nil {
cpuProfiler.MaybeTakeProfile(ctx, cfg.runtime.HostDiskReadCount.Value(), cfg.runtime.HostDiskWriteCount.Value(), cfg.runtime.Uptime.Value())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk read and write counts don't seem like the appropriate metrics to trigger a CPU profile here. We usually care about CPU profiles when the CPU is busy! The combined normalized CPU usage in [0,1] is cfg.runtime.CPUCombinedPercentNorm.Value(). I think (but you should verify) that we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense.

But taking a step back, we have admission control, too. If admission control limits CPU usage, we still want to know what consumes CPU, but admission control might limit CPU effectively enough to stay below threshold. I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization. I think this will need a little bit of plumbing and maybe some advice by the likes of @irfansharif and @sumeerbhola on how we should go about it.

Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @irfansharif, @koorosh, @smg260, @sumeerbhola, and @zachlite)


pkg/server/env_sampler.go line 201 at r1 (raw file):

cfg.runtime.CPUCombinedPercentNorm.Value()

Thanks for clarifying which metric we should use, have switched to using it.

we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense

From the docs it says crdb should not persistently use >80% CPU so maybe 0.8 is indeed the ideal default.

I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization

I'm not too familiar with admission control, would we simply want to know if admission control is enabled (and if it is we would override the cpu usage threshold setting default to some lower value) or use an admission control metric for another comparison?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)


pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

cfg.runtime.CPUCombinedPercentNorm.Value()

Thanks for clarifying which metric we should use, have switched to using it.

we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense

From the docs it says crdb should not persistently use >80% CPU so maybe 0.8 is indeed the ideal default.

I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization

I'm not too familiar with admission control, would we simply want to know if admission control is enabled (and if it is we would override the cpu usage threshold setting default to some lower value) or use an admission control metric for another comparison?

AC, for "regular" work (i.e. not backups, changefeeds, etc.) will try to keep CPU use as high as possible. If the incident work on a given node is all high-priority and consumes >90% of CPU, it'll keep it at that threshold. If AC is intentionally reducing CPU consumption of "elastic" work (backups, changefeeds, etc.) it's still trying to keep it as high as possible. The amount of "regular work" being done WRT CPU% on those nodes is unaffected, just the elastic CPU % is lowered. So I think we can ignore AC for this PR.

Only briefly looking at MaybeTakeProfile, we're taking CPU profiles once-per-minute when CPU % > 80%? Have you also considered what would take to capture a CPU % when there's a noticeable "spike"? Feel free to defer this to a future PR, perhaps filing a dedicated issue for it, but in at least some of the incidents we've tied to #75799 we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.

Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)


pkg/server/env_sampler.go line 201 at r1 (raw file):

we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.

Hmm what if the CpuProfiler stored the last CPUCombinedPercentNorm.Value() regardless of whether a profile is taken and then we calculate a delta between the current and previous. We then check this delta against a new setting (e.g. server.cpu_profile.cpu_usage_delta) and if it exceeds the setting we take a profile? What do you think, would this approach handle interesting cpu spikes? If so what do you think the default delta should be for the setting?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)


pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.

Hmm what if the CpuProfiler stored the last CPUCombinedPercentNorm.Value() regardless of whether a profile is taken and then we calculate a delta between the current and previous. We then check this delta against a new setting (e.g. server.cpu_profile.cpu_usage_delta) and if it exceeds the setting we take a profile? What do you think, would this approach handle interesting cpu spikes? If so what do you think the default delta should be for the setting?

This "spike" detection worth separating out into its own PR to not stall this, very useful one. But for your question, it's hard to answer without some trial and error. I found https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 which conveniently comes with a very simple go implementation that we could perhaps consider instead. It takes a moving mean of the last N measurements of CPU %, and triggers if a new sample is some standard deviations away from the moving mean -- which sounds closer to what I think we want. It would work for various baseline CPU %s, even when nodes are running hot (~65%) or cold (~10%). What do you think? Want to file an issue and track separately? Quick experiments would tell us how effective it is.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)


pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This "spike" detection worth separating out into its own PR to not stall this, very useful one. But for your question, it's hard to answer without some trial and error. I found https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 which conveniently comes with a very simple go implementation that we could perhaps consider instead. It takes a moving mean of the last N measurements of CPU %, and triggers if a new sample is some standard deviations away from the moving mean -- which sounds closer to what I think we want. It would work for various baseline CPU %s, even when nodes are running hot (~65%) or cold (~10%). What do you think? Want to file an issue and track separately? Quick experiments would tell us how effective it is.

The Go implementation: https://github.com/MicahParks/peakdetect/blob/master/peakdetect.go.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)


pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

The Go implementation: https://github.com/MicahParks/peakdetect/blob/master/peakdetect.go.

CPU utilization is fine as one of the signals, but it is coarse grained (wrt time) and misses behavior at tiny time scales that is observed by admission control to do throttling. We have seen production situations where CPU utilization is low, but CPU admission control is active (sometimes coupled with a spiky runnable goroutine count, and sometimes not).
If we need to make a decision periodically, I'd suggest also calling into admission control to see if it recommends doing anything based on things it observed since the last such call. We can evolve the admission control implementation for this method (that would return a bool) over time -- for starters it could be whether the KV slots were exhausted for more than N milliseconds (say 100 milliseconds) during that 1min (if that was the interval).

This is orthogonal to the spike detection discussion above, in my mind. If there was a one time short spike that isn't caught by this periodic sampling, we may not care about it. If it is periodic, say a 5s spike every 10min and we keep missing it because of 1min sampling, that would be very unfortunate. By spike I don't mean a delta change that is large, but something that returns true from the decision function on whether to take a profile or not (we can of course ignore this true if we have recently taken a profile). We could just poll more frequently to capture these.

And of course there is the spike that is a delta change where two consecutive polls are returning true to take a profile, and the latter has experienced a delta change, and we've already taken a cpu profile recently, so we are going to miss something by not taking the profile now. I don't know how important this is.

@Santamaura
Copy link
Contributor Author

What do you think? Want to file an issue and track separately?

Yes makes sense, I've created the issue. Thanks for the links as well.

@tbg
Copy link
Member

tbg commented Feb 27, 2023

Filed #97699

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try initing kv with --splits 1000 and run the workload for a while to let it balance out. Then increase the concurrency until the system is at 85%.

I'd like to see this as well if you can make it happen!

Nonetheless, thanks for running these thorough tests @Santamaura - much appreciated. Given the latency increase I agree that there's a non-negligible amount of overhead here, but with the cluster setting default set to false I think it's okay to proceed. We can continue optimizing as we go, but as it stands this is a solid platform to work with!

bad allocations that come with the profiler labels

I'll be curious to see the difference when we disable labels, but we can dive into that with #97699 (thanks for filing @tbg)

Otherwise, I'm satisfied with these tests and I'm happy with the server.cpu_profile.enabled=false setting by default.

As one final thing, can we add a panic handler to MaybeTakeProfile?

:lgtm: otherwise! Of course, let's be sure to also address Tobi's final comments.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @Santamaura, @smg260, @tbg, and @zachlite)


pkg/server/profiler/cpuprofiler.go line 42 at r76 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Default this to MaxInt64 while we figure out the perf impact.

I'm fine with that, but let's make sure to document that if you enable with server.cpu_profile.enabled=true, you will also need to configure server.cpu_profile.cpu_usage_combined_threshold to a realistic value as well.


pkg/server/profiler/cpuprofiler.go line 110 at r76 (raw file):

// MaybeTakeProfile takes a cpu profile if cpu usage is high enough.
func (cp *CPUProfiler) MaybeTakeProfile(ctx context.Context, currentCpuUsage int64) {
	if !cpuProfileEnabled.Get(&cp.st.SV) {

nit: can we add a panic handler here just in case at the top of the function? IMO these profilers should never be the cause of a process crashing (We did the same for activequeryprofiler)

	defer func() {
		if p := recover(); p != nil {
			logcrash.ReportPanic(ctx, &cp.st.SV, p, 1)
		}
	}()

This might just be me being extra paranoid, but I figure it can't hurt 😅

Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @abarganier, @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)


pkg/cli/start.go line 1438 at r24 (raw file):

Previously, tbg (Tobias Grieger) wrote…

My intention was that we'd remove this - if we want periodic CPU profiles taken every 7s, then we can use the cluster setting to set the CPU usage threshold to 1%, set the reset interval to 7s, and then we have it.

The single loop driving the CPU profiler is here:

if cpuProfiler != nil {
cpuProfiler.MaybeTakeProfile(ctx, int64(cfg.runtime.CPUCombinedPercentNorm.Value()*100))
}

Removed.


pkg/server/profiler/cpuprofiler.go line 42 at r76 (raw file):

Previously, abarganier (Alex Barganier) wrote…

I'm fine with that, but let's make sure to document that if you enable with server.cpu_profile.enabled=true, you will also need to configure server.cpu_profile.cpu_usage_combined_threshold to a realistic value as well.

Done. Expanded in the server.cpu_profile.enabled description about this.


pkg/server/profiler/cpuprofiler.go line 110 at r76 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: can we add a panic handler here just in case at the top of the function? IMO these profilers should never be the cause of a process crashing (We did the same for activequeryprofiler)

	defer func() {
		if p := recover(); p != nil {
			logcrash.ReportPanic(ctx, &cp.st.SV, p, 1)
		}
	}()

This might just be me being extra paranoid, but I figure it can't hurt 😅

Added.

@blathers-crl blathers-crl bot requested a review from irfansharif March 1, 2023 15:31
@Santamaura Santamaura removed the request for review from irfansharif March 1, 2023 15:31
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once CI passes - nice work @Santamaura!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)

This PR adds a cpu profiler to the server package.
The following cluster settings have been added to configure
the cpu profiler:
- server.cpu_profile.cpu_usage_combined_threshold is the baseline
value for when cpu profiles should be taken
- server.cpu_profile.interval is when the high water mark resets to
the cpu_usage_combined_threshold value
- server.cpu_profile.duration is how long a cpu profile is taken
- server.cpu_profile.enabled is whether the on/off switch of the
cpu profiler

Fixes: cockroachdb#75799

Release note: None
@blathers-crl blathers-crl bot requested a review from irfansharif March 2, 2023 15:02
@Santamaura Santamaura removed the request for review from irfansharif March 2, 2023 15:03
Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try initing kv with --splits 1000 and run the workload for a while to let it balance out. Then increase the concurrency until the system is at 85%.

I tried this a couple days ago but after bumping concurrency a number of times I was only able to get cpu to ~75% and then at some point the admin ui became non responsive. Regardless, I set the cpu profiler to take profiles at 75% and the numbers increase on the workload and graphs seemed roughly about the same as the previous test.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @abarganier, @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)

@Santamaura
Copy link
Contributor Author

TFTRs!
bors r+

@craig
Copy link
Contributor

craig bot commented Mar 2, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 2, 2023

Build succeeded:

@craig craig bot merged commit a44c3c2 into cockroachdb:master Mar 2, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 2, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from cf97b4f to blathers/backport-release-22.2-95623: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cucaroach
Copy link
Contributor

Late to the show, this is awesome! How was this not release note worthy I wonder, are we concerned about its usage?

@tbg
Copy link
Member

tbg commented Jun 16, 2023

@rmloveland looks like we dropped the ball on the release note here, would you be able to add this to the right backlog somewhere so that we mention it for 23.2 at least?

@rmloveland
Copy link
Collaborator

rmloveland commented Jun 16, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: add cpuprofile_dumper
8 participants