Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: allow an arbitrary number of spans in a span stats request #105638

Closed
zachlite opened this issue Jun 27, 2023 · 2 comments · Fixed by #109464
Closed

server: allow an arbitrary number of spans in a span stats request #105638

zachlite opened this issue Jun 27, 2023 · 2 comments · Fixed by #109464
Assignees
Labels
branch-master Failures and bugs on the master branch. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@zachlite
Copy link
Contributor

zachlite commented Jun 27, 2023

Is your feature request related to a problem? Please describe.

As of #98490, A single SpanStats request can request stats for multiple spans via its spans []roachpb.Span field. However, the maximum number of spans allowed per request is controlled by the cluster setting server.span_stats.span_batch_limit, which has a default value of 500. The original purpose of server.span_stats.span_batch_limit was to ensure that the caller received a speedy response.

#103128 made SpanStats a dependency of SHOW RANGES WITH DETAILS, which immediately caused roachtest failures where the cluster under test had more ranges than the default 500.

The cluster setting was bumped as a workaround in #105500, but the following realities still exist:

  1. We can expect customers to be faced with this same error, and deal with the friction of needing to bump the cluster setting themselves. Friction is not good.
  2. The roachtest failures are preventing sql: extend SHOW RANGES to include object size estimates #103128 from being backported in its current form.
  3. Even if a customer bumps the setting, there's no batching. This could lead to increased rates of failed requests due to timeouts, failed transactions, etc.

Describe the solution you'd like

A SpanStats request should be able to service an arbitrary number of spans in a reasonable amount of time.
In the wild, the largest clusters can have 1e6 ranges or more, and we must be able to service these cases.

A possible implementation is proposed here:

...The solution will likely involve using bounded batches to fetch the span stats and streaming the results to the client.

Challenges:

  • Determining the appropriate batch size and concurrency levels.
  • Span stats relies on RPC node fan-outs, which are slow.
  • Span stats should be considered authoritative, and to achieve this, it relies on Meta2 scans to look up range descriptors. These transactions add overhead. Perhaps there's profiling that would reveal low-hanging fruit to optimize.
  • Furthermore, if SpanStats is authoritative, does that rule out the option to pre-compute and cache?
  • Callers of SpanStats like SHOW RANGES WITH DETAILS do not stream results back to the caller, so a streaming solution for SpanStats that lowered the "time to first result" would not get results back to the user any faster, because SHOW RANGES would still block.

Jira issue: CRDB-29134
Epic: CRDB-30635

@zachlite zachlite added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cluster-observability labels Jun 27, 2023
@zachlite zachlite self-assigned this Jun 27, 2023
@knz
Copy link
Contributor

knz commented Jul 19, 2023

Callers of SpanStats like SHOW RANGES WITH DETAILS do not stream results back to the caller, so a streaming solution for SpanStats that lowered the "time to first result" would not get results back to the user any faster, because SHOW RANGES would still block.

The point of using a stream interface for the low-level API is to not run afoul of the batch size limit. It's OK if SHOW RANGES blocks. That was by design. (That's also why the expensive computation is only performed under WITH DETAILS)

@zachlite zachlite added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 19, 2023

Hi @zachlite, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@zachlite zachlite added the branch-master Failures and bugs on the master branch. label Jul 19, 2023
zachlite pushed a commit to zachlite/cockroach that referenced this issue Aug 24, 2023
Previously, span stats requests could only request a maximum
of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler
and the secondary tenant's handler. The cluster setting
`server.span_stats.span_batch_limit` is used to control the
size of the batches. The default value has been raised to 1000.

Resolves cockroachdb#105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details
from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject
to errors if the number of requested spans is too large.
zachlite pushed a commit to zachlite/cockroach that referenced this issue Aug 25, 2023
Previously, span stats requests could only request a maximum
of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler
and the secondary tenant's handler. The cluster setting
`server.span_stats.span_batch_limit` is used to control the
size of the batches. The default value has been raised to 1000.

Resolves cockroachdb#105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details
from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject
to errors if the number of requested spans is too large.
zachlite pushed a commit to zachlite/cockroach that referenced this issue Aug 31, 2023
Previously, span stats requests could only request a maximum
of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler
and the secondary tenant's handler. The cluster setting
`server.span_stats.span_batch_limit` is used to control the
size of the batches. The default value has been raised to 1000.

Resolves cockroachdb#105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details
from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject
to errors if the number of requested spans is too large.
craig bot pushed a commit that referenced this issue Sep 1, 2023
109464: server: batch large span stats requests r=zachlite a=zachlite

Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting `server.span_stats.span_batch_limit` is used to control the size of the batches. The default value has been raised to 1000.

Resolves #105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject to errors if the number of requested spans is too large.

109612: serverutils: implement the testserver interfaces indirectly r=stevendanna a=knz

First commit from #109591.
Epic: CRDB-18499
Fixes #107058.

TLDR: unit tests using TestServer/TestCluster now produce warning and
notice messages upon certain suspicious API usage.

For example:
```
$ dev test //pkg/server/storage_api -f TestStatusGetFiles
=== RUN   TestStatusGetFiles
    ...
        pkg/server/storage_api/storage_api_test_test/pkg/server/storage_api/files_test.go:46: (TestStatusGetFiles)
                NOTICE: .GetStatusClient() called via implicit interface ApplicationLayerInterface;
        HINT: consider using .ApplicationLayer().GetStatusClient() instead.
    conditional_wrap.go:193: TIP: consider replacing the test server initialization from:

            ts, db, kvDB := serverutils.StartServer(t, ...)
            // or:
            tc := serverutils.StartCluster(t, ...)
            ts := tc.Server(0)

        To:

            srv, db, kvDB := serverutils.StartServer(t, ...)
            defer srv.Stop(...)
            ts := srv.ApplicationLayer()
            // or:
            tc := serverutils.StartCluster(t, ...)
            ts := tc.Server(0).ApplicationLayer()
```

The following warnings/notices are implemented.

**ApplicationLayerInterface**

The problem here is that the implicit implementation of
`ApplicationLayerInterface` inside `TestServerInterface` is likely
misused. If/when the test server automatically starts a secondary
tenant, the *implicit* `ApplicationLayerInterface` refers to the
system tenant, whereas only `.ApplicationLayer()` refers to the
secondary tenant. It's likely that the test code should use the
latter.

For this, a warning is printed when a method from the *implicit*
`ApplicationLayerInterface` is called.

The warning is not printed if the test has indicated (via
`DefaultTestTenant: base.TestIsForStorageLayerAndNeedsASystemTenant`)
that it is ever only interested in the storage layer.

**StorageLayerInterface**

We want to promote the use of the explicit interface accessor
`.StorageLayer()` to access the methods of `StorageLayerInterface`.

For this, a notice is printed when a method from the *implicit*
`StorageLayerInterface` is called.

The notice is not printed if the test has indicated (via
`DefaultTestTenant: base.TestIsForStorageLayerAndNeedsASystemTenant`)
that it is ever only interested in the storage layer.

**TenantControlInterface**

We want to promote the use of the explicit interface accessor
`.TenantController()` to access the methods of
`TenantControlInterface`.

For this, a notice is printed when a method from the *implicit*
`TenantControlInterface` is called.



109863: roachtest: add npgsql tests to ignorelist; not blocklist r=rafiss a=rafiss

The previous commit 1928be2 accidentally put them in the wrong list. The blocklist is for tests that are expected to always fail, but a flaky test might pass sometimes.

fixes #109806
fixes #109738
Release note: None

109872: sql: coherent handling of hidden session vars in info_schema r=rafiss a=knz

Fixes #109870.
Epic: CRDB-28893

Release note (bug fix): Certain SQL session variables are meant
to be hidden from introspection but were showing up in
`information_schema.session_variables`, which was incoherent with the
handling in `pg_catalog.pg_settings`. This bug has now been fixed.

Co-authored-by: Zach Lite <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in 4e14279 Sep 1, 2023
blathers-crl bot pushed a commit that referenced this issue Sep 1, 2023
Previously, span stats requests could only request a maximum
of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler
and the secondary tenant's handler. The cluster setting
`server.span_stats.span_batch_limit` is used to control the
size of the batches. The default value has been raised to 1000.

Resolves #105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details
from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject
to errors if the number of requested spans is too large.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants