server: allow an arbitrary number of spans in a span stats request #105638

zachlite · 2023-06-27T17:08:24Z

Is your feature request related to a problem? Please describe.

As of #98490, A single SpanStats request can request stats for multiple spans via its spans []roachpb.Span field. However, the maximum number of spans allowed per request is controlled by the cluster setting server.span_stats.span_batch_limit, which has a default value of 500. The original purpose of server.span_stats.span_batch_limit was to ensure that the caller received a speedy response.

#103128 made SpanStats a dependency of SHOW RANGES WITH DETAILS, which immediately caused roachtest failures where the cluster under test had more ranges than the default 500.

The cluster setting was bumped as a workaround in #105500, but the following realities still exist:

We can expect customers to be faced with this same error, and deal with the friction of needing to bump the cluster setting themselves. Friction is not good.
The roachtest failures are preventing sql: extend SHOW RANGES to include object size estimates #103128 from being backported in its current form.
Even if a customer bumps the setting, there's no batching. This could lead to increased rates of failed requests due to timeouts, failed transactions, etc.

Describe the solution you'd like

A SpanStats request should be able to service an arbitrary number of spans in a reasonable amount of time.
In the wild, the largest clusters can have 1e6 ranges or more, and we must be able to service these cases.

A possible implementation is proposed here:

...The solution will likely involve using bounded batches to fetch the span stats and streaming the results to the client.

Challenges:

Determining the appropriate batch size and concurrency levels.
Span stats relies on RPC node fan-outs, which are slow.
Span stats should be considered authoritative, and to achieve this, it relies on Meta2 scans to look up range descriptors. These transactions add overhead. Perhaps there's profiling that would reveal low-hanging fruit to optimize.
Furthermore, if SpanStats is authoritative, does that rule out the option to pre-compute and cache?
Callers of SpanStats like SHOW RANGES WITH DETAILS do not stream results back to the caller, so a streaming solution for SpanStats that lowered the "time to first result" would not get results back to the user any faster, because SHOW RANGES would still block.

Jira issue: CRDB-29134
Epic: CRDB-30635

The text was updated successfully, but these errors were encountered:

knz · 2023-07-19T08:55:31Z

Callers of SpanStats like SHOW RANGES WITH DETAILS do not stream results back to the caller, so a streaming solution for SpanStats that lowered the "time to first result" would not get results back to the user any faster, because SHOW RANGES would still block.

The point of using a stream interface for the low-level API is to not run afoul of the batch size limit. It's OK if SHOW RANGES blocks. That was by design. (That's also why the expensive computation is only performed under WITH DETAILS)

blathers-crl · 2023-07-19T14:24:01Z

Hi @zachlite, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error. This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting `server.span_stats.span_batch_limit` is used to control the size of the batches. The default value has been raised to 1000. Resolves cockroachdb#105638 Epic: CRDB-30635 Release note (ops change): Requests for database details or table details from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject to errors if the number of requested spans is too large.

109464: server: batch large span stats requests r=zachlite a=zachlite Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error. This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting `server.span_stats.span_batch_limit` is used to control the size of the batches. The default value has been raised to 1000. Resolves #105638 Epic: CRDB-30635 Release note (ops change): Requests for database details or table details from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject to errors if the number of requested spans is too large. 109612: serverutils: implement the testserver interfaces indirectly r=stevendanna a=knz First commit from #109591. Epic: CRDB-18499 Fixes #107058. TLDR: unit tests using TestServer/TestCluster now produce warning and notice messages upon certain suspicious API usage. For example: ``` $ dev test //pkg/server/storage_api -f TestStatusGetFiles === RUN TestStatusGetFiles ... pkg/server/storage_api/storage_api_test_test/pkg/server/storage_api/files_test.go:46: (TestStatusGetFiles) NOTICE: .GetStatusClient() called via implicit interface ApplicationLayerInterface; HINT: consider using .ApplicationLayer().GetStatusClient() instead. conditional_wrap.go:193: TIP: consider replacing the test server initialization from: ts, db, kvDB := serverutils.StartServer(t, ...) // or: tc := serverutils.StartCluster(t, ...) ts := tc.Server(0) To: srv, db, kvDB := serverutils.StartServer(t, ...) defer srv.Stop(...) ts := srv.ApplicationLayer() // or: tc := serverutils.StartCluster(t, ...) ts := tc.Server(0).ApplicationLayer() ``` The following warnings/notices are implemented. **ApplicationLayerInterface** The problem here is that the implicit implementation of `ApplicationLayerInterface` inside `TestServerInterface` is likely misused. If/when the test server automatically starts a secondary tenant, the *implicit* `ApplicationLayerInterface` refers to the system tenant, whereas only `.ApplicationLayer()` refers to the secondary tenant. It's likely that the test code should use the latter. For this, a warning is printed when a method from the *implicit* `ApplicationLayerInterface` is called. The warning is not printed if the test has indicated (via `DefaultTestTenant: base.TestIsForStorageLayerAndNeedsASystemTenant`) that it is ever only interested in the storage layer. **StorageLayerInterface** We want to promote the use of the explicit interface accessor `.StorageLayer()` to access the methods of `StorageLayerInterface`. For this, a notice is printed when a method from the *implicit* `StorageLayerInterface` is called. The notice is not printed if the test has indicated (via `DefaultTestTenant: base.TestIsForStorageLayerAndNeedsASystemTenant`) that it is ever only interested in the storage layer. **TenantControlInterface** We want to promote the use of the explicit interface accessor `.TenantController()` to access the methods of `TenantControlInterface`. For this, a notice is printed when a method from the *implicit* `TenantControlInterface` is called. 109863: roachtest: add npgsql tests to ignorelist; not blocklist r=rafiss a=rafiss The previous commit 1928be2 accidentally put them in the wrong list. The blocklist is for tests that are expected to always fail, but a flaky test might pass sometimes. fixes #109806 fixes #109738 Release note: None 109872: sql: coherent handling of hidden session vars in info_schema r=rafiss a=knz Fixes #109870. Epic: CRDB-28893 Release note (bug fix): Certain SQL session variables are meant to be hidden from introspection but were showing up in `information_schema.session_variables`, which was incoherent with the handling in `pg_catalog.pg_settings`. This bug has now been fixed. Co-authored-by: Zach Lite <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error. This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting `server.span_stats.span_batch_limit` is used to control the size of the batches. The default value has been raised to 1000. Resolves #105638 Epic: CRDB-30635 Release note (ops change): Requests for database details or table details from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject to errors if the number of requested spans is too large.

zachlite added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cluster-observability labels Jun 27, 2023

zachlite self-assigned this Jun 27, 2023

This was referenced Jun 27, 2023

roachtest: failover/partial/lease-liveness failed #105285

Closed

roachtest: failover/chaos/read-write/lease=expiration failed [span stats limit] #105274

Closed

knz mentioned this issue Jul 19, 2023

release-23.1: sql: extend SHOW RANGES to include object size estimates #105306

Closed

zachlite added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023

zachlite added the branch-master Failures and bugs on the master branch. label Jul 19, 2023

zachlite mentioned this issue Aug 24, 2023

server: batch large span stats requests #109464

Merged

craig bot closed this as completed in 4e14279 Sep 1, 2023

blathers-crl bot mentioned this issue Sep 1, 2023

release-23.1: server: batch large span stats requests #109902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: allow an arbitrary number of spans in a span stats request #105638

server: allow an arbitrary number of spans in a span stats request #105638

zachlite commented Jun 27, 2023 •

edited by knz

Loading

knz commented Jul 19, 2023

blathers-crl bot commented Jul 19, 2023

server: allow an arbitrary number of spans in a span stats request #105638

server: allow an arbitrary number of spans in a span stats request #105638

Comments

zachlite commented Jun 27, 2023 • edited by knz Loading

knz commented Jul 19, 2023

blathers-crl bot commented Jul 19, 2023

zachlite commented Jun 27, 2023 •

edited by knz

Loading