Skip to content

Commit

Permalink
Merge #69826 #70750
Browse files Browse the repository at this point in the history
69826: clusterversion: mint 21.2 cluster version r=celiala a=celiala

Shortly after [Creating a release branch](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/187859111/Creating+a+release+branch) for release-21.2, we'll want to merge PRs as instructed by the [New major version checklist](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/1522270228/New+major+version+checklist).

This PR is for Step 1a of the [New major version checklist](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/1522270228/New+major+version+checklist), where we add a cluster version StartX for the corresponding .0 release.

Notes: 
- **This should NOT be merged until we've released a beta** (adding `do-not-merge` until this happens)
- These are all PRs created as part of the steps of the [New major version checklist](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/1522270228/New+major+version+checklist):
	- Step 1a (you are here): #69826
	- Step 1b: #69828
	- Step 2a + Step 2b: #69827
	- Step 2c: #69829
	- Step 3: TODO

Release justification: Non-production code change.
Release note: None

70750: tenant: add endpoint with instant metrics r=darinpp a=darinpp

Previously the tenant process was serving various metrics on
`/_status/vars`. This endpoint has all the available metrics and these are
updated every 10 sec. Many of the metrics show a rate that is calculated
over the 10 sec interval. Some of the metrics are used by the cockroach
operator to monitor the CPU workload of the tenant process and use that
workload for automatic scaling. The 10 sec interval however is too long
and causes a slow scaling up. The reporting of high CPU utilization can
take up to 20 sec (to compute a delta). To resolve this, the PR adds a
new endpoint `/_status/load` that provides an instant reading of a
very small subset of the normal metrics - user and system CPU time for
now. By having these be instant, the client can retrieve in quick
succession, consecutive snapshots and compute a precise CPU utulization.
It also allows the client to control the interval between the two pulls
(as opposed to having it hard coded to 10 sec).

Release note: None

Co-authored-by: Celia La <[email protected]>
Co-authored-by: Darin Peshev <[email protected]>
  • Loading branch information
3 people committed Oct 1, 2021
3 parents 1057ea6 + ea24703 + 5b0bdb0 commit 5f53feb
Show file tree
Hide file tree
Showing 11 changed files with 181 additions and 19 deletions.
2 changes: 1 addition & 1 deletion docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
Expand Up @@ -165,4 +165,4 @@ trace.debug.enable boolean false if set, traces for recent requests can be seen
trace.jaeger.agent string the address of a Jaeger agent to receive traces using the Jaeger UDP Thrift protocol, as <host>:<port>. If no port is specified, 6381 will be used.
trace.opentelemetry.collector string address of an OpenTelemetry trace collector to receive traces using the otel gRPC protocol, as <host>:<port>. If no port is specified, 4317 will be used.
trace.zipkin.collector string the address of a Zipkin instance to receive traces, as <host>:<port>. If no port is specified, 9411 will be used.
version version 21.1-1168 set the active cluster version in the format '<major>.<minor>'
version version 21.2 set the active cluster version in the format '<major>.<minor>'
2 changes: 1 addition & 1 deletion docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,6 @@
<tr><td><code>trace.jaeger.agent</code></td><td>string</td><td><code></code></td><td>the address of a Jaeger agent to receive traces using the Jaeger UDP Thrift protocol, as <host>:<port>. If no port is specified, 6381 will be used.</td></tr>
<tr><td><code>trace.opentelemetry.collector</code></td><td>string</td><td><code></code></td><td>address of an OpenTelemetry trace collector to receive traces using the otel gRPC protocol, as <host>:<port>. If no port is specified, 4317 will be used.</td></tr>
<tr><td><code>trace.zipkin.collector</code></td><td>string</td><td><code></code></td><td>the address of a Zipkin instance to receive traces, as <host>:<port>. If no port is specified, 9411 will be used.</td></tr>
<tr><td><code>version</code></td><td>version</td><td><code>21.1-1168</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
<tr><td><code>version</code></td><td>version</td><td><code>21.2</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
</tbody>
</table>
4 changes: 4 additions & 0 deletions pkg/ccl/serverccl/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ go_test(
"server_sql_test.go",
"tenant_grpc_test.go",
"tenant_status_test.go",
"tenant_vars_test.go",
],
embed = [":serverccl"],
deps = [
Expand Down Expand Up @@ -63,7 +64,10 @@ go_test(
"//pkg/util/log",
"//pkg/util/randutil",
"//pkg/util/timeutil",
"@com_github_elastic_gosigar//:gosigar",
"@com_github_lib_pq//:pq",
"@com_github_prometheus_client_model//go",
"@com_github_prometheus_common//expfmt",
"@com_github_stretchr_testify//require",
"@org_golang_x_crypto//bcrypt",
],
Expand Down
110 changes: 110 additions & 0 deletions pkg/ccl/serverccl/tenant_vars_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
// Copyright 2021 The Cockroach Authors.
//
// Licensed as a CockroachDB Enterprise file under the Cockroach Community
// License (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// https://github.com/cockroachdb/cockroach/blob/master/licenses/CCL.txt

package serverccl

import (
"context"
"crypto/tls"
"net/http"
"os"
"testing"

"github.com/cockroachdb/cockroach/pkg/base"
_ "github.com/cockroachdb/cockroach/pkg/ccl/kvccl"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/sql/tests"
"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
"github.com/cockroachdb/cockroach/pkg/util/leaktest"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/elastic/gosigar"
io_prometheus_client "github.com/prometheus/client_model/go"
"github.com/prometheus/common/expfmt"
"github.com/stretchr/testify/require"
)

func TestTenantVars(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)

ctx := context.Background()

serverParams, _ := tests.CreateTestServerParams()
testCluster := serverutils.StartNewTestCluster(t, 1 /* numNodes */, base.TestClusterArgs{
ServerArgs: serverParams,
})
defer testCluster.Stopper().Stop(ctx)

server := testCluster.Server(0 /* idx */)

tenant, _ := serverutils.StartTenant(t, server, base.TestTenantArgs{
TenantID: roachpb.MakeTenantID(10 /* id */),
})

url := "https://" + tenant.HTTPAddr() + "/_status/load"
client := http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
}
resp, err := client.Get(url)
require.NoError(t, err)
defer resp.Body.Close()
require.Equal(t, 200, resp.StatusCode,
"invalid non-200 status code %v from tenant", resp.StatusCode)

var parser expfmt.TextParser
metrics, err := parser.TextToMetricFamilies(resp.Body)
require.NoError(t, err)

userCPU, found := metrics["sys_cpu_user_ns"]
require.True(t, found)
require.Len(t, userCPU.GetMetric(), 1)
require.Equal(t, io_prometheus_client.MetricType_GAUGE, userCPU.GetType())
cpuUserNanos := userCPU.Metric[0].GetGauge().GetValue()

sysCPU, found := metrics["sys_cpu_sys_ns"]
require.True(t, found)
require.True(t, found)
require.Len(t, sysCPU.GetMetric(), 1)
require.Equal(t, io_prometheus_client.MetricType_GAUGE, sysCPU.GetType())
cpuSysNanos := sysCPU.Metric[0].GetGauge().GetValue()

// The values are between zero and whatever User/Sys time is observed after the get.
require.Positive(t, cpuUserNanos)
require.Positive(t, cpuSysNanos)
cpuTime := gosigar.ProcTime{}
require.NoError(t, cpuTime.Get(os.Getpid()))
require.LessOrEqual(t, cpuUserNanos, float64(cpuTime.User)*1e6)
require.LessOrEqual(t, cpuSysNanos, float64(cpuTime.Sys)*1e6)

resp, err = client.Get(url)
require.NoError(t, err)
defer resp.Body.Close()
require.Equal(t, 200, resp.StatusCode,
"invalid non-200 status code %v from tenant", resp.StatusCode)

metrics, err = parser.TextToMetricFamilies(resp.Body)
require.NoError(t, err)

userCPU, found = metrics["sys_cpu_user_ns"]
require.True(t, found)
require.Len(t, userCPU.GetMetric(), 1)
require.Equal(t, io_prometheus_client.MetricType_GAUGE, userCPU.GetType())
cpuUserNanos2 := userCPU.Metric[0].GetGauge().GetValue()

sysCPU, found = metrics["sys_cpu_sys_ns"]
require.True(t, found)
require.True(t, found)
require.Len(t, sysCPU.GetMetric(), 1)
require.Equal(t, io_prometheus_client.MetricType_GAUGE, sysCPU.GetType())
cpuSysNanos2 := sysCPU.Metric[0].GetGauge().GetValue()

require.LessOrEqual(t, float64(cpuTime.User)*1e6, cpuUserNanos2)
require.LessOrEqual(t, float64(cpuTime.Sys)*1e6, cpuSysNanos2)
}
7 changes: 7 additions & 0 deletions pkg/clusterversion/cockroach_versions.go
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,8 @@ const (
// TenantUsageSingleConsumptionColumn changes the tenant_usage system table to
// use a single consumption column (encoding a proto).
TenantUsageSingleConsumptionColumn
// V21_2 is CockroachDB v21.2. It's used for all v21.2.x patch releases.
V21_2

// *************************************************
// Step (1): Add new versions here.
Expand Down Expand Up @@ -487,6 +489,11 @@ var versionsSingleton = keyedVersions{
Key: TenantUsageSingleConsumptionColumn,
Version: roachpb.Version{Major: 21, Minor: 1, Internal: 1168},
},
{
// V21_2 is CockroachDB v21.2. It's used for all v21.2.x patch releases.
Key: V21_2,
Version: roachpb.Version{Major: 21, Minor: 2},
},
// *************************************************
// Step (2): Add new versions here.
// Do not add new versions to a patch release.
Expand Down
5 changes: 3 additions & 2 deletions pkg/clusterversion/key_string.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions pkg/server/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,9 @@ const (
// statusVars exposes prometheus metrics for monitoring consumption.
statusVars = statusPrefix + "vars"

// loadStatusVars exposes prometheus metrics for instant monitoring of CPU load.
loadStatusVars = statusPrefix + "load"

// raftStateDormant is used when there is no known raft state.
raftStateDormant = "StateDormant"

Expand Down
18 changes: 8 additions & 10 deletions pkg/server/status/runtime.go
Original file line number Diff line number Diff line change
Expand Up @@ -445,8 +445,8 @@ func (rsr *RuntimeStatSampler) SampleEnvironment(
if err := mem.Get(pid); err != nil {
log.Ops.Errorf(ctx, "unable to get mem usage: %v", err)
}
cpuTime := gosigar.ProcTime{}
if err := cpuTime.Get(pid); err != nil {
userTimeMillis, sysTimeMillis, err := GetCPUTime(ctx)
if err != nil {
log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
}
cgroupCPU, _ := cgroups.GetCgroupCPU()
Expand Down Expand Up @@ -507,8 +507,8 @@ func (rsr *RuntimeStatSampler) SampleEnvironment(
now := rsr.clock.PhysicalNow()
dur := float64(now - rsr.last.now)
// cpuTime.{User,Sys} are in milliseconds, convert to nanoseconds.
utime := int64(cpuTime.User) * 1e6
stime := int64(cpuTime.Sys) * 1e6
utime := userTimeMillis * 1e6
stime := sysTimeMillis * 1e6
urate := float64(utime-rsr.last.utime) / dur
srate := float64(stime-rsr.last.stime) / dur
combinedNormalizedPerc := (srate + urate) / cpuShare
Expand Down Expand Up @@ -690,14 +690,12 @@ func subtractNetworkCounters(from *net.IOCountersStat, sub net.IOCountersStat) {
from.PacketsSent -= sub.PacketsSent
}

// GetUserCPUSeconds returns the cumulative User CPU time for this process, in
// seconds.
func GetUserCPUSeconds(ctx context.Context) float64 {
// GetCPUTime returns the cumulative user/system time (in ms) since the process start.
func GetCPUTime(ctx context.Context) (userTimeMillis, sysTimeMillis int64, err error) {
pid := os.Getpid()
cpuTime := gosigar.ProcTime{}
if err := cpuTime.Get(pid); err != nil {
log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
return 0, 0, err
}
// cpuTime.User is in milliseconds; convert to seconds.
return float64(cpuTime.User) * 1e-3
return int64(cpuTime.User), int64(cpuTime.Sys), nil
}
41 changes: 40 additions & 1 deletion pkg/server/tenant.go
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ func StartTenant(
})
f := varsHandler{metricSource: args.recorder, st: args.Settings}.handleVars
mux.Handle(statusVars, http.HandlerFunc(f))
ff := loadVarsHandler(ctx, args.runtime)
mux.Handle(loadStatusVars, http.HandlerFunc(ff))

tlsConnManager := netutil.MakeServer(
args.stopper,
Expand Down Expand Up @@ -277,8 +279,12 @@ func StartTenant(
log.SetTenantIDs(args.TenantID.String(), int32(s.SQLInstanceID()))

externalUsageFn := func(ctx context.Context) multitenant.ExternalUsage {
userTimeMillis, _, err := status.GetCPUTime(ctx)
if err != nil {
log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
}
return multitenant.ExternalUsage{
CPUSecs: status.GetUserCPUSeconds(ctx),
CPUSecs: float64(userTimeMillis) * 1e-3,
PGWireBytes: s.pgServer.BytesInAndOut(),
}
}
Expand Down Expand Up @@ -308,6 +314,39 @@ func StartTenant(
return s, pgLAddr, httpLAddr, nil
}

// Construct a handler responsible for serving the instant values of selected
// load metrics. These include user and system CPU time currently.
func loadVarsHandler(
ctx context.Context, rsr *status.RuntimeStatSampler,
) func(http.ResponseWriter, *http.Request) {
cpuUserNanos := metric.NewGauge(rsr.CPUUserNS.GetMetadata())
cpuSysNanos := metric.NewGauge(rsr.CPUSysNS.GetMetadata())
registry := metric.NewRegistry()
registry.AddMetric(cpuUserNanos)
registry.AddMetric(cpuSysNanos)

return func(w http.ResponseWriter, r *http.Request) {
userTimeMillis, sysTimeMillis, err := status.GetCPUTime(ctx)
if err != nil {
// Just log but don't return an error to match the _status/vars metrics handler.
log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
}
// cpuTime.{User,Sys} are in milliseconds, convert to nanoseconds.
utime := userTimeMillis * 1e6
stime := sysTimeMillis * 1e6
cpuUserNanos.Update(utime)
cpuSysNanos.Update(stime)

exporter := metric.MakePrometheusExporter()
exporter.ScrapeRegistry(registry, true)
if err := exporter.PrintAsText(w); err != nil {
log.Errorf(r.Context(), "%v", err)
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
}
}

func makeTenantSQLServerArgs(
stopper *stop.Stopper, kvClusterName string, baseCfg BaseConfig, sqlCfg SQLConfig,
) (sqlServerArgs, error) {
Expand Down
4 changes: 2 additions & 2 deletions pkg/sql/logictest/testdata/logic_test/crdb_internal
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ select crdb_internal.get_vmodule()
query T
select regexp_replace(crdb_internal.node_executable_version()::string, '(-\d+)?$', '');
----
21.1
21.2

query ITTT colnames
select node_id, component, field, regexp_replace(regexp_replace(value, '^\d+$', '<port>'), e':\\d+', ':<port>') as value from crdb_internal.node_runtime_info
Expand Down Expand Up @@ -538,7 +538,7 @@ select * from crdb_internal.node_inflight_trace_spans
query T
select regexp_replace(crdb_internal.node_executable_version()::string, '(-\d+)?$', '');
----
21.1
21.2

user root

Expand Down
4 changes: 2 additions & 2 deletions pkg/sql/logictest/testdata/logic_test/crdb_internal_tenant
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ select crdb_internal.get_vmodule()
query T
select regexp_replace(crdb_internal.node_executable_version()::string, '(-\d+)?$', '');
----
21.1
21.2

query ITTT colnames
select node_id, component, field, regexp_replace(regexp_replace(value, '^\d+$', '<port>'), e':\\d+', ':<port>') as value from crdb_internal.node_runtime_info
Expand Down Expand Up @@ -442,7 +442,7 @@ select * from crdb_internal.gossip_alerts
query T
select regexp_replace(crdb_internal.node_executable_version()::string, '(-\d+)?$', '');
----
21.1
21.2

user root

Expand Down

0 comments on commit 5f53feb

Please sign in to comment.