stmtdiagnostics: support probabilistic bundle collection

We enable the following ability (the first half of cockroachdb#82896): Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted. With a given stmt rate (say 1000/s) and a given percentile we're trying to capture (say p99.9), we have 0.001R stmt/s in the 99.9th percentile (1/s in our example). We should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. We do all this using the existing statement diagnostics machinery. It looks roughly as follows > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO ...', -- fingerprint 0.01, -- sampling probability '120ms'::INTERVAL, -- latency target '15m'::INTERVAL -- request expiration ); $ cockroach statement-diag list --insecure No statement diagnostics bundles available. Outstanding activation requests: ID Statement Sampling probability Min latency 770367610894417921 INSERT INTO ... 0.0100 90ms # wait for an eventual capture.. $ cockroach statement-diag list --insecure Statement diagnostics bundles: ID Collection time Statement 770367073624621057 2022-06-14 00:49:33 UTC INSERT INTO ... $ cockroach statement-diag download 770367073624621057 --insecure Bundle saved to "stmt-bundle-770367073624621057.zip" $ unzip stmt-bundle-770367073624621057.zip stmt $ cat stmt/trace.txt ... 0.846ms 0.017ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r76/1:/Table/10{7/1-8}] waiting to acquire read latch /Table/107/1/41/7/0@0,0, held by write latch /Table/107/1/41/7/[email protected],0 98.776ms 97.930ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r76/1:/Table/10{7/1-8}] scanning lock table for conflicting locks We leave wiring this up to the UI and continuous capture (second half of \cockroachdb#82896) to future PRs. Release note: None
irfansharif · Jun 15, 2022 · 1954135 · 1954135
1 parent 35aae48
commit 1954135
Show file tree

Hide file tree

Showing 29 changed files with 561 additions and 112 deletions.
diff --git a/docs/generated/http/full.md b/docs/generated/http/full.md
@@ -4229,6 +4229,7 @@ Support status: [reserved](#support-status)
 | statement_fingerprint | [string](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-string) |  |  | [reserved](#support-status) |
 | min_execution_latency | [google.protobuf.Duration](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-google.protobuf.Duration) |  | MinExecutionLatency, when non-zero, indicates the minimum execution latency of a query for which to collect the diagnostics report. In other words, if a query executes faster than this threshold, then the diagnostics report is not collected on it, and we will try to get a bundle the next time we see the query fingerprint.<br><br>NB: if MinExecutionLatency is non-zero, then all queries that match the fingerprint will be traced until a slow enough query comes along. This tracing might have some performance overhead. | [reserved](#support-status) |
 | expires_after | [google.protobuf.Duration](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-google.protobuf.Duration) |  | ExpiresAfter, when non-zero, sets the expiration interval of this request. | [reserved](#support-status) |
+| sampling_probability | [double](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-double) |  | SamplingProbability controls how likely we are to try and collect a diagnostics report for a given execution. The semantics with MinExecutionLatency are worth noting (and perhaps simplifying?): - If SamplingProbability is zero, we're always sampling. This is for   compatibility with pre-22.2 versions where this parameter was not   available. - If SamplingProbability is non-zero, MinExecutionLatency must be non-zero.   We'll sample stmt executions with the given probability until:   (a) we capture one that exceeds MinExecutionLatency, or   (b) we hit the ExpiresAfter point.<br><br>SamplingProbability lets users control at a per-stmt granularity how much collection overhead is acceptable to try an capture an outlier execution for further analysis (are high p99.9s due to latch waits? racing with split transfers?). A high sampling rate can capture a trace sooner, but the added overhead may also cause the trace to be non-representative if the tracing overhead across all requests is causing resource saturation (network, memory) and resulting in slowdown.<br><br>TODO(irfansharif): Wire this up to the UI code. When selecting the latency threshold, we should want to force specifying a sampling probability.<br><br>TODO(irfansharif): We could do better than a hard-coded default value for probability (100% could be too high-overhead so probably not the right one). Strawman: could consider the recent request rate for the fingerprint (say averaged over the last 10m? 30m?), consider what %-ile the latency target we're looking to capture is under, and suggest a sampling probability that gets you at least one trace in the next T seconds with 95% likelihood? Or provide a hint for how long T is for the currently chosen sampling probability. | [reserved](#support-status) |
 
 
 

diff --git a/docs/generated/settings/settings-for-tenants.txt b/docs/generated/settings/settings-for-tenants.txt
@@ -281,4 +281,4 @@ trace.jaeger.agent	string		the address of a Jaeger agent to receive traces using
 trace.opentelemetry.collector	string		address of an OpenTelemetry trace collector to receive traces using the otel gRPC protocol, as <host>:<port>. If no port is specified, 4317 will be used.
 trace.span_registry.enabled	boolean	true	if set, ongoing traces can be seen at https://<ui>/#/debug/tracez
 trace.zipkin.collector	string		the address of a Zipkin instance to receive traces, as <host>:<port>. If no port is specified, 9411 will be used.
-version	version	22.1-16	set the active cluster version in the format '<major>.<minor>'
+version	version	22.1-18	set the active cluster version in the format '<major>.<minor>'
diff --git a/docs/generated/settings/settings.html b/docs/generated/settings/settings.html
@@ -212,6 +212,6 @@
 <tr><td><code>trace.opentelemetry.collector</code></td><td>string</td><td><code></code></td><td>address of an OpenTelemetry trace collector to receive traces using the otel gRPC protocol, as <host>:<port>. If no port is specified, 4317 will be used.</td></tr>
 <tr><td><code>trace.span_registry.enabled</code></td><td>boolean</td><td><code>true</code></td><td>if set, ongoing traces can be seen at https://<ui>/#/debug/tracez</td></tr>
 <tr><td><code>trace.zipkin.collector</code></td><td>string</td><td><code></code></td><td>the address of a Zipkin instance to receive traces, as <host>:<port>. If no port is specified, 9411 will be used.</td></tr>
-<tr><td><code>version</code></td><td>version</td><td><code>22.1-16</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
+<tr><td><code>version</code></td><td>version</td><td><code>22.1-18</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
 </tbody>
 </table>
diff --git a/docs/generated/sql/functions.md b/docs/generated/sql/functions.md
@@ -3094,7 +3094,7 @@ active for the current transaction.</p>
 </span></td></tr>
 <tr><td><a name="crdb_internal.repair_ttl_table_scheduled_job"></a><code>crdb_internal.repair_ttl_table_scheduled_job(oid: oid) &rarr; void</code></td><td><span class="funcdesc"><p>Repairs the scheduled job for a TTL table if it is missing.</p>
 </span></td></tr>
-<tr><td><a name="crdb_internal.request_statement_bundle"></a><code>crdb_internal.request_statement_bundle(stmtFingerprint: <a href="string.html">string</a>, minExecutionLatency: <a href="interval.html">interval</a>, expiresAfter: <a href="interval.html">interval</a>) &rarr; <a href="bool.html">bool</a></code></td><td><span class="funcdesc"><p>Used to request statement bundle for a given statement fingerprint
+<tr><td><a name="crdb_internal.request_statement_bundle"></a><code>crdb_internal.request_statement_bundle(stmtFingerprint: <a href="string.html">string</a>, samplingProbability: <a href="float.html">float</a>, minExecutionLatency: <a href="interval.html">interval</a>, expiresAfter: <a href="interval.html">interval</a>) &rarr; <a href="bool.html">bool</a></code></td><td><span class="funcdesc"><p>Used to request statement bundle for a given statement fingerprint
 that has execution latency greater than the ‘minExecutionLatency’. If the
 ‘expiresAfter’ argument is empty, then the statement bundle request never
 expires until the statement bundle is collected</p>

diff --git a/pkg/cli/clisqlclient/statement_diag.go b/pkg/cli/clisqlclient/statement_diag.go
@@ -13,6 +13,7 @@ package clisqlclient
 import (
 	"context"
 	"database/sql/driver"
+	"fmt"
 	"io"
 	"os"
 	"time"
@@ -79,6 +80,9 @@ type StmtDiagActivationRequest struct {
 	// Statement is the SQL statement fingerprint.
 	Statement   string
 	RequestedAt time.Time
+	// Zero value indicates that there is no sampling probability set on the
+	// request.
+	SamplingProbability float64
 	// Zero value indicates that there is no minimum latency set on the request.
 	MinExecutionLatency time.Duration
 	// Zero value indicates that the request never expires.
@@ -99,9 +103,39 @@ func StmtDiagListOutstandingRequests(
 	return result, nil
 }
 
+// TODO(irfansharif): Remove this in 23.1.
+func isAtLeast22dot2ClusterVersion(ctx context.Context, conn Conn) (bool, error) {
+	// Check whether the upgrade to add the sampling_probability column to the
+	// statement_diagnostics_requests system table has already been run.
+	row, err := conn.QueryRow(ctx, `
+ SELECT
+   count(*)
+ FROM
+   [SHOW COLUMNS FROM system.statement_diagnostics_requests]
+ WHERE
+   column_name = 'sampling_probability';`)
+	if err != nil {
+		return false, err
+	}
+	c, ok := row[0].(int64)
+	if !ok {
+		return false, nil
+	}
+	return c == 1, nil
+}
+
 func stmtDiagListOutstandingRequestsInternal(
 	ctx context.Context, conn Conn,
 ) ([]StmtDiagActivationRequest, error) {
+	var extraColumns string
+	atLeast22dot2, err := isAtLeast22dot2ClusterVersion(ctx, conn)
+	if err != nil {
+		return nil, err
+	}
+	if atLeast22dot2 {
+		extraColumns = ", sampling_probability"
+	}
+
 	// Converting an INTERVAL to a number of milliseconds within that interval
 	// is a pain - we extract the number of seconds and multiply it by 1000,
 	// then we extract the number of milliseconds and add that up to the
@@ -111,16 +145,16 @@ func stmtDiagListOutstandingRequestsInternal(
                         EXTRACT(millisecond FROM min_execution_latency)::INT8 -
                         EXTRACT(second FROM min_execution_latency)::INT8 * 1000`
 	rows, err := conn.Query(ctx,
-		"SELECT id, statement_fingerprint, requested_at, "+getMilliseconds+`, expires_at
+		fmt.Sprintf("SELECT id, statement_fingerprint, requested_at, "+getMilliseconds+`, expires_at%s
 			FROM system.statement_diagnostics_requests
 			WHERE NOT completed
-			ORDER BY requested_at DESC`,
+			ORDER BY requested_at DESC`, extraColumns),
 	)
 	if err != nil {
 		return nil, err
 	}
 	var result []StmtDiagActivationRequest
-	vals := make([]driver.Value, 5)
+	vals := make([]driver.Value, 6)
 	for {
 		if err := rows.Next(vals); err == io.EOF {
 			break
@@ -129,16 +163,24 @@ func stmtDiagListOutstandingRequestsInternal(
 		}
 		var minExecutionLatency time.Duration
 		var expiresAt time.Time
+		var samplingProbability float64
+
 		if ms, ok := vals[3].(int64); ok {
 			minExecutionLatency = time.Millisecond * time.Duration(ms)
 		}
 		if e, ok := vals[4].(time.Time); ok {
 			expiresAt = e
 		}
+		if atLeast22dot2 {
+			if sp, ok := vals[5].(float64); ok {
+				samplingProbability = sp
+			}
+		}
 		info := StmtDiagActivationRequest{
 			ID:                  vals[0].(int64),
 			Statement:           vals[1].(string),
 			RequestedAt:         vals[2].(time.Time),
+			SamplingProbability: samplingProbability,
 			MinExecutionLatency: minExecutionLatency,
 			ExpiresAt:           expiresAt,
 		}

diff --git a/pkg/cli/statement_diag.go b/pkg/cli/statement_diag.go
@@ -85,7 +85,7 @@ func runStmtDiagList(cmd *cobra.Command, args []string) (resErr error) {
 	} else {
 		fmt.Printf("Outstanding activation requests:\n")
 		w := tabwriter.NewWriter(&buf, 4, 0, 2, ' ', 0)
-		fmt.Fprint(w, "  ID\tActivation time\tStatement\tMin execution latency\tExpires at\n")
+		fmt.Fprint(w, "  ID\tActivation time\tStatement\tSampling probability\tMin execution latency\tExpires at\n")
 		for _, r := range reqs {
 			minExecLatency := "N/A"
 			if r.MinExecutionLatency != 0 {
@@ -95,9 +95,15 @@ func runStmtDiagList(cmd *cobra.Command, args []string) (resErr error) {
 			if !r.ExpiresAt.IsZero() {
 				expiresAt = r.ExpiresAt.String()
 			}
+			var samplingProbability string
+			if r.SamplingProbability == 0 {
+				samplingProbability = fmt.Sprintf("%0.4f", 1.0)
+			} else {
+				samplingProbability = fmt.Sprintf("%0.4f", r.SamplingProbability)
+			}
 			fmt.Fprintf(
-				w, "  %d\t%s\t%s\t%s\t%s\n",
-				r.ID, r.RequestedAt.UTC().Format(timeFmt), r.Statement, minExecLatency, expiresAt,
+				w, "  %d\t%s\t%s\t%s\t%s\t%s\n",
+				r.ID, r.RequestedAt.UTC().Format(timeFmt), r.Statement, samplingProbability, minExecLatency, expiresAt,
 			)
 		}
 		_ = w.Flush()

diff --git a/pkg/cli/statement_diag_test.go b/pkg/cli/statement_diag_test.go
@@ -32,13 +32,13 @@ func Example_statement_diag() {
 		        (20, 'SELECT _ FROM _ WHERE _ > _', 'SELECT a FROM t WHERE b > 1', '2010-01-02 03:04:06', ARRAY[1001,1002,1003]),
 		        (30, 'SELECT _ FROM _ WHERE _ > _', 'SELECT a FROM t WHERE b > 1', '2010-01-02 03:04:07', ARRAY[1001])`,
 
-		`INSERT INTO system.statement_diagnostics_requests(id, completed, statement_fingerprint, statement_diagnostics_id, requested_at, min_execution_latency, expires_at)
-		 VALUES (1, TRUE, 'SELECT _ FROM _', 10, '2010-01-02 03:04:00', NULL, NULL),
-		        (2, TRUE, 'SELECT _ FROM _ WHERE _ > _', 20, '2010-01-02 03:04:02', NULL, NULL),
-		        (3, TRUE, 'SELECT _ FROM _ WHERE _ > _', 30, '2010-01-02 03:04:05', NULL, NULL),
-		        (4, FALSE, 'SELECT _ + _', NULL, '2010-01-02 03:04:10', '1d 2h 3m 4s 5ms 6us', NULL),
-		        (5, FALSE, 'SELECT _ - _', NULL, '2010-01-02 03:04:11', NULL, '2030-01-02 03:04:12'),
-		        (6, FALSE, 'SELECT _ / _', NULL, '2010-01-02 03:04:12', '0s', NULL)`,
+		`INSERT INTO system.statement_diagnostics_requests(id, completed, statement_fingerprint, statement_diagnostics_id, requested_at, sampling_probability, min_execution_latency, expires_at)
+		 VALUES (1, TRUE, 'SELECT _ FROM _', 10, '2010-01-02 03:04:00', NULL, NULL, NULL),
+		        (2, TRUE, 'SELECT _ FROM _ WHERE _ > _', 20, '2010-01-02 03:04:02', 0.5, NULL, NULL),
+		        (3, TRUE, 'SELECT _ FROM _ WHERE _ > _', 30, '2010-01-02 03:04:05', 1.0, NULL, NULL),
+		        (4, FALSE, 'SELECT _ + _', NULL, '2010-01-02 03:04:10', 0.8, '1d 2h 3m 4s 5ms 6us', NULL),
+		        (5, FALSE, 'SELECT _ - _', NULL, '2010-01-02 03:04:11', 1.0, NULL, '2030-01-02 03:04:12'),
+		        (6, FALSE, 'SELECT _ / _', NULL, '2010-01-02 03:04:12', NULL, '0s', NULL)`,
 	}
 
 	for _, cmd := range commands {
@@ -95,10 +95,10 @@ func Example_statement_diag() {
 	//   10  2010-01-02 03:04:05 UTC  SELECT _ FROM _
 	//
 	// Outstanding activation requests:
-	//   ID  Activation time          Statement     Min execution latency  Expires at
-	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  N/A                    never
-	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  N/A                    2030-01-02 03:04:12 +0000 UTC
-	//   4   2010-01-02 03:04:10 UTC  SELECT _ + _  26h3m4.005s            never
+	//   ID  Activation time          Statement     Sampling probability  Min execution latency  Expires at
+	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  1.0000                N/A                    never
+	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  1.0000                N/A                    2030-01-02 03:04:12 +0000 UTC
+	//   4   2010-01-02 03:04:10 UTC  SELECT _ + _  0.8000                26h3m4.005s            never
 	// statement-diag download 13
 	// ERROR: failed to download statement diagnostics bundle 13 to 'stmt-bundle-13.zip': no statement diagnostics bundle with ID 13
 	// statement-diag download 20 tempfile.zip
@@ -121,18 +121,18 @@ func Example_statement_diag() {
 	//   20  2010-01-02 03:04:06 UTC  SELECT _ FROM _ WHERE _ > _
 	//
 	// Outstanding activation requests:
-	//   ID  Activation time          Statement     Min execution latency  Expires at
-	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  N/A                    never
-	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  N/A                    2030-01-02 03:04:12 +0000 UTC
-	//   4   2010-01-02 03:04:10 UTC  SELECT _ + _  26h3m4.005s            never
+	//   ID  Activation time          Statement     Sampling probability  Min execution latency  Expires at
+	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  1.0000                N/A                    never
+	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  1.0000                N/A                    2030-01-02 03:04:12 +0000 UTC
+	//   4   2010-01-02 03:04:10 UTC  SELECT _ + _  0.8000                26h3m4.005s            never
 	// statement-diag delete --all
 	// statement-diag list
 	// No statement diagnostics bundles available.
 	// Outstanding activation requests:
-	//   ID  Activation time          Statement     Min execution latency  Expires at
-	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  N/A                    never
-	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  N/A                    2030-01-02 03:04:12 +0000 UTC
-	//   4   2010-01-02 03:04:10 UTC  SELECT _ + _  26h3m4.005s            never
+	//   ID  Activation time          Statement     Sampling probability  Min execution latency  Expires at
+	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  1.0000                N/A                    never
+	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  1.0000                N/A                    2030-01-02 03:04:12 +0000 UTC
+	//   4   2010-01-02 03:04:10 UTC  SELECT _ + _  0.8000                26h3m4.005s            never
 	// statement-diag cancel xx
 	// ERROR: invalid ID
 	// statement-diag cancel 5 6
@@ -143,9 +143,9 @@ func Example_statement_diag() {
 	// statement-diag list
 	// No statement diagnostics bundles available.
 	// Outstanding activation requests:
-	//   ID  Activation time          Statement     Min execution latency  Expires at
-	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  N/A                    never
-	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  N/A                    2030-01-02 03:04:12 +0000 UTC
+	//   ID  Activation time          Statement     Sampling probability  Min execution latency  Expires at
+	//   6   2010-01-02 03:04:12 UTC  SELECT _ / _  1.0000                N/A                    never
+	//   5   2010-01-02 03:04:11 UTC  SELECT _ - _  1.0000                N/A                    2030-01-02 03:04:12 +0000 UTC
 	// statement-diag cancel 123
 	// ERROR: no outstanding activation request with ID 123
 	// statement-diag cancel --all

diff --git a/pkg/clusterversion/cockroach_versions.go b/pkg/clusterversion/cockroach_versions.go
@@ -373,6 +373,10 @@ const (
 	// ON UPDATE) expression can be defined to be 'nextval('s')'; we want to be
 	// able to refer to sequence 's' by its ID, since 's' might be later renamed.
 	UpgradeSequenceToBeReferencedByID
+	// SampledStmtDiagReqs enables installing statement diagnostic requests that
+	// probabilistically collects stmt bundles, controlled by the user provided
+	// sampling rate.
+	SampledStmtDiagReqs
 
 	// *************************************************
 	// Step (1): Add new versions here.
@@ -658,6 +662,10 @@ var versionsSingleton = keyedVersions{
 		Key:     UpgradeSequenceToBeReferencedByID,
 		Version: roachpb.Version{Major: 22, Minor: 1, Internal: 16},
 	},
+	{
+		Key:     SampledStmtDiagReqs,
+		Version: roachpb.Version{Major: 22, Minor: 1, Internal: 18},
+	},
 
 	// *************************************************
 	// Step (2): Add new versions here.