-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tenant: add endpoint with instant metrics #70750
Conversation
fd7403d
to
e1dbe12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)
-- commits, line 12 at r1:
NIT: Update the commit/PR message to be /_status/load
.
pkg/server/tenant.go, line 321 at r1 (raw file):
return func(w http.ResponseWriter, r *http.Request) { if err := cpuTime.Get(os.Getpid()); err != nil {
Should this CPU calculation code be an exported function on status.RuntimeStatSampler
? Something like:
func (rsr *RuntimeStatSampler) SampleCPU() {
...
}
I believe this code should be thread-safe. You could call that code both from here and from the SampleEnvironment
method. It would get the updated CPU time and then update the CPUUserNS
and CPUSysNS
gauges.
pkg/server/tenant.go, line 323 at r1 (raw file):
if err := cpuTime.Get(os.Getpid()); err != nil { log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err) http.Error(w, err.Error(), http.StatusInternalServerError)
Don't we need
return` here in the error code path?
pkg/server/tenant.go, line 334 at r1 (raw file):
if err := exporter.PrintAsText(w); err != nil { log.Errorf(r.Context(), "%v", err) http.Error(w, err.Error(), http.StatusInternalServerError)
And a return
here as well.
e1dbe12
to
0f8f621
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)
Previously, andy-kimball (Andy Kimball) wrote…
NIT: Update the commit/PR message to be
/_status/load
.
done
pkg/server/tenant.go, line 321 at r1 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Should this CPU calculation code be an exported function on
status.RuntimeStatSampler
? Something like:func (rsr *RuntimeStatSampler) SampleCPU() { ... }
I believe this code should be thread-safe. You could call that code both from here and from the
SampleEnvironment
method. It would get the updated CPU time and then update theCPUUserNS
andCPUSysNS
gauges.
changed
pkg/server/tenant.go, line 323 at r1 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Don't we need
return` here in the error code path?
after the SampleCPU
change - not needed.
pkg/server/tenant.go, line 334 at r1 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
And a
return
here as well.
after the SampleCPU
change - not needed.
@knz, feel free to add any other person(s) who might want to review this. We're not sure who the right people/team would be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/server/tenant.go, line 334 at r1 (raw file):
Previously, darinpp wrote…
after the
SampleCPU
change - not needed.
How come it's not still needed? If we added code after this block, we don't want to execute it in this error case.
0f8f621
to
57cf942
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/server/tenant.go, line 334 at r1 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
How come it's not still needed? If we added code after this block, we don't want to execute it in this error case.
OK. Added a return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, but make sure to get a review from someone on the Server team before merging.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to update the PR description with the new contents of the commit message.
Reviewed 3 of 5 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)
@dhartunian you may want to have a look too |
There's one potential problem I wanted to call out: because we're updating the server's instance of Perhaps we should use a different instance of |
57cf942
to
07205b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. There is actually nothing that requires the two sets to share metrics. I changed the new endpoint to use separate gauges.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @darinpp and @knz)
pkg/server/tenant.go, line 321 at r4 (raw file):
return func(w http.ResponseWriter, r *http.Request) { if err := cpuTime.Get(os.Getpid()); err != nil {
We're back to not sharing this code, which I don't think is great. Can't you encapsulate this code in a little function, like we do with GetUserCPUSeconds
, and then use it both here and from the sampler? Something like:
func GetCPUSeconds(ctx context.Context) (userTimeNs, sysTimeNs int64, err error)
I actually think we should move GetUserCPUSeconds
to tenant.go
and make it use the new GetCPUSeconds
function. GetUserCPUSeconds
is a very specialized function that shouldn't be in a shared file, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Andy's suggestions re: code organization.
Reviewed 2 of 5 files at r1, 1 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @darinpp, @dhartunian, and @knz)
pkg/ccl/serverccl/tenant_vars_test.go, line 62 at r4 (raw file):
"invalid non-200 status code %v from tenant", resp.StatusCode) prometheusMetricStringPattern := `^(?P<metric>\w+)(?:\{` +
Minor: We import the prometheus client library as a dependency so you might be able to use it to parse the metrics directly instead of manually via regex.
07205b8
to
90a2f80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @dhartunian, and @knz)
pkg/ccl/serverccl/tenant_vars_test.go, line 62 at r4 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Minor: We import the prometheus client library as a dependency so you might be able to use it to parse the metrics directly instead of manually via regex.
switched to use Prometheus client
pkg/server/tenant.go, line 321 at r4 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
We're back to not sharing this code, which I don't think is great. Can't you encapsulate this code in a little function, like we do with
GetUserCPUSeconds
, and then use it both here and from the sampler? Something like:func GetCPUSeconds(ctx context.Context) (userTimeNs, sysTimeNs int64, err error)
I actually think we should move
GetUserCPUSeconds
totenant.go
and make it use the newGetCPUSeconds
function.GetUserCPUSeconds
is a very specialized function that shouldn't be in a shared file, IMO.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @darinpp, @dhartunian, and @knz)
pkg/server/tenant.go, line 292 at r5 (raw file):
getUserCPUSec := func(ctx context.Context) float64 { userTimeMS, _, err := status.GetCPUTime(ctx) if err != nil {
Don't need to log this error, as it's already logged in GetCPUTime
.
pkg/server/status/runtime.go, line 450 at r5 (raw file):
userTimeMS, sysTimeMS, err := GetCPUTime(ctx) if err != nil { log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
This error has already been logged, you don't need to log again.
pkg/server/status/runtime.go, line 694 at r5 (raw file):
// GetCPUTime returns the cumulative user/system time (in ms) since the process start. func GetCPUTime(ctx context.Context) (userTimeMS, sysTimeMS int64, err error) {
NIT: I think this should be userTimeMs
rather than capitalizing as MS
.
90a2f80
to
08e6e0f
Compare
08e6e0f
to
538a2f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @dhartunian, and @knz)
pkg/server/tenant.go, line 292 at r5 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Don't need to log this error, as it's already logged in
GetCPUTime
.
I removed the logging in GetCPUTime
. Leaving the log here as it is higher up in the call stack and will log the caller.
pkg/server/status/runtime.go, line 450 at r5 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
This error has already been logged, you don't need to log again.
I removed the logging in GetCPUTime
. Leaving the log here as it is higher up in the call stack and will log the caller.
pkg/server/status/runtime.go, line 694 at r5 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
NIT: I think this should be
userTimeMs
rather than capitalizing asMS
.
Changed the abbreviations to mixed case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian and @knz)
Previously the tenant process was serving various metrics on `/_status/vars`. This endpoint has all the available metrics and these are updated every 10 sec. Many of the metrics show a rate that is calculated over the 10 sec interval. Some of the metrics are used by the cockroach operator to monitor the CPU workload of the tenant process and use that workload for automatic scaling. The 10 sec interval however is too long and causes a slow scaling up. The reporting of high CPU utilization can take up to 20 sec (to compute a delta). To resolve this, the PR adds a new endpoint `/_status/load` that provides an instant reading of a very small subset of the normal metrics - user and system CPU time for now. By having these be instant, the client can retrieve in quick succession, consecutive snapshots and compute a precise CPU utulization. It also allows the client to control the interval between the two pulls (as opposed to having it hard coded to 10 sec). Release note: None
538a2f2
to
5b0bdb0
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 5b0bdb0 to blathers/backport-release-21.2-70750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Previously the tenant process was serving various metrics on
/_status/vars
. This endpoint has all the available metrics and these areupdated every 10 sec. Many of the metrics show a rate that is calculated
over the 10 sec interval. Some of the metrics are used by the cockroach
operator to monitor the CPU workload of the tenant process and use that
workload for automatic scaling. The 10 sec interval however is too long
and causes a slow scaling up. The reporting of high CPU utilization can
take up to 20 sec (to compute a delta). To resolve this, the PR adds a
new endpoint
/_status/load
that provides an instant reading of avery small subset of the normal metrics - user and system CPU time for
now. By having these be instant, the client can retrieve in quick
succession, consecutive snapshots and compute a precise CPU utulization.
It also allows the client to control the interval between the two pulls
(as opposed to having it hard coded to 10 sec).
Release note: None