Skip to content

Commit

Permalink
Merge #76743 #77325
Browse files Browse the repository at this point in the history
76743: pkg/server: Add `TenantRanges` to status server, hook into debug.zip r=rimadeodhar a=abarganier

While a system tenant or host cluster has full access to the debug
information provided by the `Ranges` endpoint in the KV status server,
tenants currently have no way to fetch metadata about their own ranges.

We'd like to expose debug information in the form of range metadata to
tenants, so that they can use this information in `debug.zip`, which
is currently in the process of being exposed (albeit a subset of the
full functionality) to tenants.

To provide this, we implement `TenantRanges` and make it accessible
within the tenant status server. The endpoint will reach across the
Tenant/KV boundary via the tenant Connector interface to call the
associated KV status server implementation. From here, we can lean
on the existing `Ranges` endpoint. We can fan out requests to `Ranges`
for all nodes containing replicas of ranges within the tenant's
keyspace. The caller can then transform the metadata into a more
tenant-appropriate format (e.g. avoiding concepts that break the
'tenant boundary', such as node IDs, replication information, etc).

The results from all nodes are then combined, and returned back to the
tenant caller.

This PR does not contain pagination - this functionality will come
in a follow up PR where we offset based on range startKeys.

This PR also modifies the cluster-wide debug zip config to attempt
a `TenantRanges` request, and generate a file from the response.
This file will provide range metadata for all the leaseholder replicas
available in the tenant's keyspace, which can be used by tenants
for debug purposes, such as identifying hot ranges.

Release note (api change): The status api will now have a newly exposed
`_status/tenant_ranges` endpoint available to tenants, although it's not
currently used except for debug.zip (see following commit).

77325: kvserver: downgrade & augment "slow raft ready" message r=erikgrinaker a=tbg

It doesn't rise up to the level of a `Warning`, rather, it is
informational. While I was here, I also added to the message
that seeing it could indicate that the node (or storage) is
overloaded.

Triggered by an internal question[^1] about this message.

[^1]: https://cockroachlabs.slack.com/archives/CHKQGKYEM/p1646245983917929

Release justification: low-risk logging improvement.
Release note: None


Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
  • Loading branch information
3 people committed Mar 3, 2022
3 parents fd15d3a + ca3badb + 726f83d commit bbdfe48
Show file tree
Hide file tree
Showing 26 changed files with 604 additions and 13 deletions.
172 changes: 172 additions & 0 deletions docs/generated/http/full.md
Original file line number Diff line number Diff line change
Expand Up @@ -1623,6 +1623,178 @@ Tier represents one level of the locality hierarchy.



## TenantRanges

`GET /_status/tenant_ranges`

TenantRanges requests internal details about all range replicas within
the tenant's keyspace.

Support status: [reserved](#support-status)

#### Request Parameters













#### Response Parameters







| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| ranges_by_locality | [TenantRangesResponse.RangesByLocalityEntry](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangesResponse.RangesByLocalityEntry) | repeated | ranges_by_locality maps each range replica to its specified availability zone, as defined within the replica's locality metadata (default key `az`). Replicas without the default available zone key set will fall under the `locality-unset` key. | [reserved](#support-status) |






<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangesResponse.RangesByLocalityEntry"></a>
#### TenantRangesResponse.RangesByLocalityEntry



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| key | [string](#cockroach.server.serverpb.TenantRangesResponse-string) | | | |
| value | [TenantRangesResponse.TenantRangeList](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangesResponse.TenantRangeList) | | | |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangesResponse.TenantRangeList"></a>
#### TenantRangesResponse.TenantRangeList



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| ranges | [TenantRangeInfo](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangeInfo) | repeated | | [reserved](#support-status) |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangeInfo"></a>
#### TenantRangeInfo

TenantRangeInfo provides metadata about a specific range replica,
where concepts not considered to be relevant within the tenant
abstraction (e.g. NodeIDs) are omitted. Instead, Locality information
is used to distinguish replicas.

| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| range_id | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The ID of the Range. | [reserved](#support-status) |
| span | [PrettySpan](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.PrettySpan) | | The pretty-printed key span of the range. | [reserved](#support-status) |
| locality | [Locality](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.Locality) | | Any locality information associated with this specific replica. | [reserved](#support-status) |
| is_leaseholder | [bool](#cockroach.server.serverpb.TenantRangesResponse-bool) | | Whether the range's specific replica is a leaseholder. | [reserved](#support-status) |
| lease_valid | [bool](#cockroach.server.serverpb.TenantRangesResponse-bool) | | Whether the range's specific replica holds a valid lease. | [reserved](#support-status) |
| range_stats | [RangeStatistics](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.RangeStatistics) | | Statistics about the range replica, e.g. QPS, WPS. | [reserved](#support-status) |
| mvcc_stats | [cockroach.storage.enginepb.MVCCStats](#cockroach.server.serverpb.TenantRangesResponse-cockroach.storage.enginepb.MVCCStats) | | MVCC stats about the range replica, e.g. live_bytes. | [reserved](#support-status) |
| read_latches | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | Read count as reported by the range replica's spanlatch.Manager. | [reserved](#support-status) |
| write_latches | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | Write count as reported by the range replica's spanlatch.Manager. | [reserved](#support-status) |
| locks | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The number of locks as reported by the range replica's lockTable. | [reserved](#support-status) |
| locks_with_wait_queues | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The number of locks with non-empty wait-queues as reported by the range replica's lockTable | [reserved](#support-status) |
| lock_wait_queue_waiters | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The aggregate number of waiters in wait-queues across all locks as reported by the range replica's lockTable | [reserved](#support-status) |
| top_k_locks_by_wait_queue_waiters | [TenantRangeInfo.LockInfo](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangeInfo.LockInfo) | repeated | The top-k locks with the most waiters (readers + writers) in their wait-queue, ordered in descending order. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.PrettySpan"></a>
#### PrettySpan



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| start_key | [string](#cockroach.server.serverpb.TenantRangesResponse-string) | | | [reserved](#support-status) |
| end_key | [string](#cockroach.server.serverpb.TenantRangesResponse-string) | | | [reserved](#support-status) |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.Locality"></a>
#### Locality

Locality is an ordered set of key value Tiers that describe a node's
location. The tier keys should be the same across all nodes.

| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| tiers | [Tier](#cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.Tier) | repeated | | [reserved](#support-status) |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.Tier"></a>
#### Tier

Tier represents one level of the locality hierarchy.

| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| key | [string](#cockroach.server.serverpb.TenantRangesResponse-string) | | Key is the name of tier and should match all other nodes. | [reserved](#support-status) |
| value | [string](#cockroach.server.serverpb.TenantRangesResponse-string) | | Value is node specific value corresponding to the key. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.RangeStatistics"></a>
#### RangeStatistics

RangeStatistics describes statistics reported by a range. For internal use
only.

| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| queries_per_second | [double](#cockroach.server.serverpb.TenantRangesResponse-double) | | Queries per second served by this range.<br><br>Note that queries per second will only be known by the leaseholder. All other replicas will report it as 0. | [reserved](#support-status) |
| writes_per_second | [double](#cockroach.server.serverpb.TenantRangesResponse-double) | | Writes per second served by this range. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.TenantRangesResponse-cockroach.server.serverpb.TenantRangeInfo.LockInfo"></a>
#### TenantRangeInfo.LockInfo

LockInfo provides metadata about the state of a single lock
in the range replica's lockTable.

| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| pretty_key | [string](#cockroach.server.serverpb.TenantRangesResponse-string) | | The lock's key in pretty format. | [reserved](#support-status) |
| key | [bytes](#cockroach.server.serverpb.TenantRangesResponse-bytes) | | The lock's key. | [reserved](#support-status) |
| held | [bool](#cockroach.server.serverpb.TenantRangesResponse-bool) | | Is the lock actively held by a transaction, or just a reservation? | [reserved](#support-status) |
| waiters | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The number of waiters in the lock's wait queue. | [reserved](#support-status) |
| waiting_readers | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The number of waiting readers in the lock's wait queue. | [reserved](#support-status) |
| waiting_writers | [int64](#cockroach.server.serverpb.TenantRangesResponse-int64) | | The number of waiting writers in the lock's wait queue. | [reserved](#support-status) |






## Gossip

`GET /_status/gossip/{node_id}`
Expand Down
20 changes: 20 additions & 0 deletions pkg/ccl/kvccl/kvtenantccl/connector.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,11 @@ var _ config.SystemConfigProvider = (*Connector)(nil)
// multi-region primitives.
var _ serverpb.RegionsServer = (*Connector)(nil)

// Connector is capable of finding debug information about the current
// tenant within the cluster. This is necessary for things such as
// debug zip and range reports.
var _ serverpb.TenantStatusServer = (*Connector)(nil)

// Connector is capable of accessing span configurations for secondary tenants.
var _ spanconfig.KVAccessor = (*Connector)(nil)

Expand Down Expand Up @@ -428,6 +433,21 @@ func (c *Connector) Regions(
return resp, nil
}

// TenantRanges implements the serverpb.TenantStatusServer interface
func (c *Connector) TenantRanges(
ctx context.Context, req *serverpb.TenantRangesRequest,
) (resp *serverpb.TenantRangesResponse, _ error) {
if err := c.withClient(ctx, func(ctx context.Context, c *client) error {
var err error
resp, err = c.TenantRanges(ctx, req)
return err
}); err != nil {
return nil, err
}

return resp, nil
}

// FirstRange implements the kvcoord.RangeDescriptorDB interface.
func (c *Connector) FirstRange() (*roachpb.RangeDescriptor, error) {
return nil, status.Error(codes.Unauthenticated, "kvtenant.Proxy does not have access to FirstRange")
Expand Down
1 change: 1 addition & 0 deletions pkg/ccl/serverccl/statusccl/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ go_test(
"//pkg/ccl",
"//pkg/ccl/kvccl",
"//pkg/ccl/utilccl",
"//pkg/keys",
"//pkg/roachpb",
"//pkg/rpc",
"//pkg/security",
Expand Down
62 changes: 62 additions & 0 deletions pkg/ccl/serverccl/statusccl/tenant_status_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (

"github.com/cockroachdb/cockroach/pkg/base"
_ "github.com/cockroachdb/cockroach/pkg/ccl/kvccl"
"github.com/cockroachdb/cockroach/pkg/keys"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/security"
"github.com/cockroachdb/cockroach/pkg/server/serverpb"
Expand Down Expand Up @@ -96,6 +97,10 @@ func TestTenantStatusAPI(t *testing.T) {
t.Run("txn_id_resolution", func(t *testing.T) {
testTxnIDResolutionRPC(ctx, t, testHelper)
})

t.Run("tenant_ranges", func(t *testing.T) {
testTenantRangesRPC(ctx, t, testHelper)
})
}

func TestTenantCannotSeeNonTenantStats(t *testing.T) {
Expand Down Expand Up @@ -978,3 +983,60 @@ func testTxnIDResolutionRPC(ctx context.Context, t *testing.T, helper *tenantTes
run(sqlConn, status, 1 /* coordinatorNodeID */)
})
}

func testTenantRangesRPC(_ context.Context, t *testing.T, helper *tenantTestHelper) {
tenantA := helper.testCluster().tenant(0).tenant.TenantStatusServer().(serverpb.TenantStatusServer)
keyPrefixForA := keys.MakeTenantPrefix(helper.testCluster().tenant(0).tenant.RPCContext().TenantID)
keyPrefixEndForA := keyPrefixForA.PrefixEnd()

tenantB := helper.controlCluster().tenant(0).tenant.TenantStatusServer().(serverpb.TenantStatusServer)
keyPrefixForB := keys.MakeTenantPrefix(helper.controlCluster().tenant(0).tenant.RPCContext().TenantID)
keyPrefixEndForB := keyPrefixForB.PrefixEnd()

resp, err := tenantA.TenantRanges(context.Background(), &serverpb.TenantRangesRequest{})
require.NoError(t, err)
require.NotEmpty(t, resp.RangesByLocality)
for localityKey, rangeList := range resp.RangesByLocality {
require.NotEmpty(t, localityKey)
for _, r := range rangeList.Ranges {
assertStartKeyInRange(t, r.Span.StartKey, keyPrefixForA)
assertEndKeyInRange(t, r.Span.EndKey, keyPrefixForA, keyPrefixEndForA)
}
}

resp, err = tenantB.TenantRanges(context.Background(), &serverpb.TenantRangesRequest{})
require.NoError(t, err)
require.NotEmpty(t, resp.RangesByLocality)
for localityKey, rangeList := range resp.RangesByLocality {
require.NotEmpty(t, localityKey)
for _, r := range rangeList.Ranges {
assertStartKeyInRange(t, r.Span.StartKey, keyPrefixForB)
assertEndKeyInRange(t, r.Span.EndKey, keyPrefixForB, keyPrefixEndForB)
}
}
}

// assertStartKeyInRange compares the pretty printed startKey with the provided
// tenantPrefix key, ensuring that the startKey starts with the tenantPrefix.
func assertStartKeyInRange(t *testing.T, startKey string, tenantPrefix roachpb.Key) {
require.Truef(t, strings.Index(startKey, tenantPrefix.String()) == 0,
fmt.Sprintf("start key %s is outside of the tenant's keyspace (prefix: %v)",
startKey, tenantPrefix.String()))
}

// assertEndKeyInRange compares the pretty printed endKey with the provided
// tenantPrefix and tenantPrefixEnd keys. Ensures that the key starts with
// either the tenantPrefix, or the tenantPrefixEnd (valid as end keys are
// exclusive).
func assertEndKeyInRange(
t *testing.T, endKey string, tenantPrefix roachpb.Key, tenantPrefixEnd roachpb.Key,
) {
require.Truef(t,
strings.Index(endKey, tenantPrefix.String()) == 0 ||
strings.Index(endKey, tenantPrefixEnd.String()) == 0 ||
// Possible if the tenant's ranges fall at the end of the entire keyspace
// range within the cluster.
endKey == "/Max",
fmt.Sprintf("end key %s is outside of the tenant's keyspace (prefix: %v, prefixEnd: %v)",
endKey, tenantPrefix.String(), tenantPrefixEnd.String()))
}
3 changes: 3 additions & 0 deletions pkg/cli/testdata/zip/partial1
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ debug zip --concurrency=1 --cpu-profile-duration=0s /dev/null
[cluster] requesting data for debug/rangelog... received response... converting to JSON... writing binary output: debug/rangelog.json... done
[cluster] requesting data for debug/settings... received response... converting to JSON... writing binary output: debug/settings.json... done
[cluster] requesting data for debug/reports/problemranges... received response... converting to JSON... writing binary output: debug/reports/problemranges.json... done
[cluster] requesting data for debug/tenant_ranges... received response...
[cluster] requesting data for debug/tenant_ranges: last request failed: rpc error: ...
[cluster] requesting data for debug/tenant_ranges: creating error output: debug/tenant_ranges.json.err.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events... writing output: debug/crdb_internal.cluster_contention_events.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows... writing output: debug/crdb_internal.cluster_distsql_flows.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges... writing output: debug/crdb_internal.cluster_database_privileges.txt... done
Expand Down
3 changes: 3 additions & 0 deletions pkg/cli/testdata/zip/partial1_excluded
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ debug zip /dev/null --concurrency=1 --exclude-nodes=2 --cpu-profile-duration=0
[cluster] requesting data for debug/rangelog... received response... converting to JSON... writing binary output: debug/rangelog.json... done
[cluster] requesting data for debug/settings... received response... converting to JSON... writing binary output: debug/settings.json... done
[cluster] requesting data for debug/reports/problemranges... received response... converting to JSON... writing binary output: debug/reports/problemranges.json... done
[cluster] requesting data for debug/tenant_ranges... received response...
[cluster] requesting data for debug/tenant_ranges: last request failed: rpc error: ...
[cluster] requesting data for debug/tenant_ranges: creating error output: debug/tenant_ranges.json.err.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events... writing output: debug/crdb_internal.cluster_contention_events.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows... writing output: debug/crdb_internal.cluster_distsql_flows.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges... writing output: debug/crdb_internal.cluster_database_privileges.txt... done
Expand Down
3 changes: 3 additions & 0 deletions pkg/cli/testdata/zip/partial2
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ debug zip --concurrency=1 --cpu-profile-duration=0 /dev/null
[cluster] requesting data for debug/rangelog... received response... converting to JSON... writing binary output: debug/rangelog.json... done
[cluster] requesting data for debug/settings... received response... converting to JSON... writing binary output: debug/settings.json... done
[cluster] requesting data for debug/reports/problemranges... received response... converting to JSON... writing binary output: debug/reports/problemranges.json... done
[cluster] requesting data for debug/tenant_ranges... received response...
[cluster] requesting data for debug/tenant_ranges: last request failed: rpc error: ...
[cluster] requesting data for debug/tenant_ranges: creating error output: debug/tenant_ranges.json.err.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events... writing output: debug/crdb_internal.cluster_contention_events.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows... writing output: debug/crdb_internal.cluster_distsql_flows.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges... writing output: debug/crdb_internal.cluster_database_privileges.txt... done
Expand Down
3 changes: 3 additions & 0 deletions pkg/cli/testdata/zip/testzip
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ debug zip --concurrency=1 --cpu-profile-duration=1s /dev/null
[cluster] requesting data for debug/rangelog... received response... converting to JSON... writing binary output: debug/rangelog.json... done
[cluster] requesting data for debug/settings... received response... converting to JSON... writing binary output: debug/settings.json... done
[cluster] requesting data for debug/reports/problemranges... received response... converting to JSON... writing binary output: debug/reports/problemranges.json... done
[cluster] requesting data for debug/tenant_ranges... received response...
[cluster] requesting data for debug/tenant_ranges: last request failed: rpc error: ...
[cluster] requesting data for debug/tenant_ranges: creating error output: debug/tenant_ranges.json.err.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events... writing output: debug/crdb_internal.cluster_contention_events.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows... writing output: debug/crdb_internal.cluster_distsql_flows.txt... done
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges... writing output: debug/crdb_internal.cluster_database_privileges.txt... done
Expand Down
5 changes: 5 additions & 0 deletions pkg/cli/testdata/zip/testzip_concurrent
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ zip
[cluster] requesting data for debug/settings: done
[cluster] requesting data for debug/settings: received response...
[cluster] requesting data for debug/settings: writing binary output: debug/settings.json...
[cluster] requesting data for debug/tenant_ranges...
[cluster] requesting data for debug/tenant_ranges: creating error output: debug/tenant_ranges.json.err.txt...
[cluster] requesting data for debug/tenant_ranges: done
[cluster] requesting data for debug/tenant_ranges: last request failed: rpc error: ...
[cluster] requesting data for debug/tenant_ranges: received response...
[cluster] requesting liveness...
[cluster] requesting liveness: converting to JSON...
[cluster] requesting liveness: done
Expand Down
Loading

0 comments on commit bbdfe48

Please sign in to comment.