statusccl: temporarily serve /_status/nodes to tenants #93268

matthewtodd · 2022-12-08T17:21:09Z

Fixes #92065
Part of #89949

Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this /_status/nodes endpoint to get the information it needs.

This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259.

In the meantime, as a stop-gap measure, we implement a reduced version of this /_status/nodes endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants.

Release note: None

cockroach-teamcity · 2022-12-08T17:21:19Z

This change is

dhartunian

only have an idea to consider just inlining the code.

Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @matthewtodd)

pkg/server/fanout_clients.go line 106 at r1 (raw file):

		})
	}
	return &resp, err

Don't feel too strongly but since there's no matching impl for the kvFanoutClient, I'd just inline this in the Nodes handler.

pkg/server/status.go line 1516 at r1 (raw file):

	}

	return s.serverIterator.nodes(ctx)

FYI: you probably checked already but this could create some bad failure state in DB Console. It's used to getting an unimplemented 5xx error on this endpoint for tenants, and this will lead it to get a 200 which could trigger additional runtime exceptions elsewhere due to missing data.

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)

pkg/server/status.go line 1516 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

FYI: you probably checked already but this could create some bad failure state in DB Console. It's used to getting an unimplemented 5xx error on this endpoint for tenants, and this will lead it to get a 200 which could trigger additional runtime exceptions elsewhere due to missing data.

Thanks for calling this out. Looking through the selectors that depend on this response, I think we're good, but I'll double-check with fresh eyes in the morning before merging.

pkg/server/fanout_clients.go line 106 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Don't feel too strongly but since there's no matching impl for the kvFanoutClient, I'd just inline this in the Nodes handler.

Good idea, done. (Should keep your getAllNodes / nodesList work easier to reason about, too!)

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)

pkg/server/status.go line 1516 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Thanks for calling this out. Looking through the selectors that depend on this response, I think we're good, but I'll double-check with fresh eyes in the morning before merging.

Found one, tested and fixed. Will merge on green. Thanks again, David!

Fixes #92065 Part of #89949 Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this `/_status/nodes` endpoint to get the information it needs. This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259. In the meantime, as a stop-gap measure, we implement a reduced version of this `/_status/nodes` endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants. Release note: None

matthewtodd · 2022-12-09T16:39:09Z

bors r+

craig · 2022-12-09T17:53:42Z

Build succeeded:

Bazel Essential CI (Cockroach)

89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. Release note (backward-incompatible change): We've stopped supporting/populating the crdb_internal.gossip_network table. It was an internal table with no API guarantees (so perhaps no meriting a release note?). Release note (performance improvement): Significantly reduced CPU usage of the underlying gossip network in large clusters. --- Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done) This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change. I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time. `@irfansharif` `@kvoli` 93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd Part of #89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly. Release note: None 93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes #93849. Release note: None. Epic: CRDB-20293 Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Nick Travers <[email protected]>

Part of cockroachdb#89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in cockroachdb#93268; we'll begin addressing it shortly. Release note: None

95449: sqlstats: include regions in statement_statistics r=matthewtodd a=matthewtodd Part of #89949. This completes the work abandoned in #92259, properly storing the regions in which a statement was executed, rather than waiting until UI render time to try to map (dangerously ephemeral) node IDs to their respective regions. With this work in place, we will be able to remove both the UI mapping code (usages of this [selector][]) and the stop-gap work of #93268. [selector]: https://github.com/cockroachdb/cockroach/blob/0f6333cbd6c78264a59dc435324c0c33d75ea148/pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.selectors.ts#L52-L64 Release note (sql change): A regions field was added to the statistics column of crdb_internal.statement_statistics, reporting the regions of the nodes on which the statement was executed. Co-authored-by: Matthew Todd <[email protected]>

96991: roachtest: update mixed-version backup to use new framework r=srosenberg a=renatolabs This updates the `backup/mixed-version` roachtest to use the recently introduced mixed-version roachtest framework (`mixedversion` package). The main behavior exercised remains the same: backups are taken in mixed-binary state, and those backups are restored and verified at the end of the test. However, this commit also improves the coverage of mixed-version backup testing in a few ways: * **Randomization**. By virtue of using the new framework, most runs will be different from one another since the order of actions taken by the test will be different. Previously, backups would always be taken with 2 nodes in the old version and 2 nodes in the new version. Now, backups can be taken when an arbitrary number of nodes is running the new version. As a consequence, it's also possible that some executions will attempt backups when all nodes are running a new binary version, but the cluster version itself has not been updated. Other points of new randomization include the choice of the node's external dir where backups are stored, which node to connect to when running certain statements, and how much to wait between backups. * **Backup Options**. Backups will randomly be created with `revision_history` enabled, or with an `encryption_passphrase`. * **Downgrades**. The cluster is also downgraded in mixed-version tests. No downgrades happened in that test before this commit. * **Workload**. Instead of using fixed call to `generate_series` to generate data between backups, the test now runs the `bank` workload continuously during the test. A random wait between backups allows the workload to make changes to the underlying table during the test and for the backups to be taken while writes are taking place. * **Finalization**: the test _may_ attempt to create a backup as the upgrade is finalizing (i.e., migrations are running and cluster version is advancing). In addition, this test will also see improved coverage as we make more improvements to test plans generated by the `mixedversion` package. These changes will create more backup scenarios in the future without requiring any code changes to this test. This test has already helped us uncover one backup bug (#97953). Epic: CRDB-19321 Release note: None 98398: statusccl: stop serving /_status/nodes to tenants r=matthewtodd a=matthewtodd Fixes #98057. This reverts the work of #93268, which is no longer necessary now that we are eagerly capturing region information at execution time in #95449. Release note: None Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Matthew Todd <[email protected]>

This was referenced Dec 8, 2022

test: TenantHTTPClient should be usable from multiple subtests #93267

Closed

sqlstats: include regions in statement_statistics. #92259

Closed

matthewtodd marked this pull request as ready for review December 8, 2022 18:34

matthewtodd requested a review from a team December 8, 2022 18:34

matthewtodd requested review from a team as code owners December 8, 2022 18:34

matthewtodd requested a review from a team December 8, 2022 18:34

matthewtodd requested a review from a team as a code owner December 8, 2022 18:34

knz requested a review from dhartunian December 8, 2022 18:38

knz added T-multitenant Issues owned by the multi-tenant virtual team A-multitenancy Related to multi-tenancy and removed T-multitenant Issues owned by the multi-tenant virtual team labels Dec 8, 2022

dhartunian approved these changes Dec 8, 2022

View reviewed changes

matthewtodd commented Dec 8, 2022

View reviewed changes

matthewtodd commented Dec 9, 2022

View reviewed changes

craig bot merged commit 4a641ac into cockroachdb:master Dec 9, 2022

matthewtodd deleted the tenant-nodes branch December 9, 2022 19:16

matthewtodd mentioned this pull request Dec 20, 2022

ui: degrade gracefully when regions aren't known #93985

Merged

This was referenced Jan 11, 2023

sqlstats: include regions in statement_statistics #95089

Closed

sqlstats: include regions in statement_statistics #95449

Merged

matthewtodd mentioned this pull request Mar 6, 2023

statusccl: stop serving /_status/nodes to tenants #98057

Closed

matthewtodd mentioned this pull request Mar 10, 2023

statusccl: stop serving /_status/nodes to tenants #98398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statusccl: temporarily serve /_status/nodes to tenants #93268

statusccl: temporarily serve /_status/nodes to tenants #93268

matthewtodd commented Dec 8, 2022

cockroach-teamcity commented Dec 8, 2022

dhartunian left a comment

matthewtodd left a comment

matthewtodd left a comment

matthewtodd commented Dec 9, 2022

craig bot commented Dec 9, 2022

statusccl: temporarily serve /_status/nodes to tenants #93268

statusccl: temporarily serve /_status/nodes to tenants #93268

Conversation

matthewtodd commented Dec 8, 2022

cockroach-teamcity commented Dec 8, 2022

dhartunian left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

matthewtodd commented Dec 9, 2022

craig bot commented Dec 9, 2022