Introduction of the key visualizer #88353

zachlite · 2022-09-21T15:10:47Z

This PR introduces the Key Visualizer 🔍 🔥 🪳

The Key Visualizer is designed for multi-tenancy, but is currently only implemented for use by the system tenant. Enabling usage by secondary tenants may be considered in the future. Read more about current limitations below.

Usage and configuration

The Key Visualizer is disabled by default. It can be enabled via:

SET CLUSTER SETTING keyvisualizer.enabled = true;

The default sample period is 1 minute, and the sample retention period is 2 weeks.
The sample period can be configured via:

SET CLUSTER SETTING keyvisualizer.sample_interval = <SECONDS>;

The default sample period was kept short for faster debugging cycles. The default we settle on in the future is subject to feedback. If the default sample period is kept short, we'll need to lower the retention period.

The default number of buckets (think sample resolution) is 256.
It can be raised to a maximum value of 1024.

SET CLUSTER SETTING keyvisualizer.max_buckets = <BUCKETS>;

Known issues, and improvements for 23.1

Downsampling strategy. The current downsampling strategy is implemented here. It's not bad, but it can be better. There's currently no guarantee of boundary stability between samples, and there are other heuristics to explore to prioritize preserving resolution in "important" parts of the keyspace.
UI improvements
- Zoom improvements
- Prevent overlapping of X and Y axis labels

cockroach-teamcity · 2022-09-21T15:10:57Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/keys/constants.go line 470 at r1 (raw file):

	SpanConfigurationsTableID           = 47
	RoleIDSequenceID                    = 48
	SpanStatsTenantBoundariesTableID    = 49

please do not use a constant ID here. This is our very last one, and I don't think this is a good use case for it. In the table, set the ID to be zero and let it get dynamically allocated in the relevant migration please.

andreimatei

I did a superficial look to get some bearings. Will do more later.

Reviewed 3 of 65 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

-- commits line 2 at r1:
you need more in here :P

-- commits line 2 at r1:

for internal use only

Unless this means something very specific, scratch it.

-- commits line 2 at r1:
Can we get table names instead of IDs on the web page?

-- commits line 2 at r1:

The Key Visualizer is disabled by default. It can be enabled via:

Why is it disabled? When will it be enabled?

pkg/jobs/jobspb/jobs.proto line 653 at r1 (raw file):

message KeyVisualizerDetails {

being empty, is this needed?

pkg/jobs/jobspb/jobs.proto line 658 at r1 (raw file):

message KeyVisualizerProgress {
  util.hlc.Timestamp last_sample_time = 1 [(gogoproto.nullable) = false];

please comment

pkg/jobs/jobspb/jobs.proto line 1135 at r1 (raw file):

  roachpb.Version creation_cluster_version = 36 [(gogoproto.nullable) = false];

  // NEXT ID: 39.

get rid of this

pkg/keys/constants.go line 470 at r1 (raw file):

	SpanConfigurationsTableID           = 47
	RoleIDSequenceID                    = 48
	SpanStatsTenantBoundariesTableID    = 49

I'm pretty sure we can have dynamic IDs now for system tables. Let's not take 49 if we don't need to.

pkg/keyvisualizer/key_visualizer.go line 11 at r1 (raw file):

// licenses/APL.txt.

package keyvisualizer

do you want to put a big comment here or in a doc.go or link to something explaining the architecture of the thing?

pkg/keyvisualizer/key_visualizer.go line 22 at r1 (raw file):

)

// KVAccessor provides a tenant with access to the keyvispb.KeyVisualizerServer

missing period at end of sentence.

pkg/keyvisualizer/key_visualizer.go line 33 at r1 (raw file):

	GetTenantRanges(
		ctx context.Context,
		tenantID roachpb.TenantID) (*keyvispb.GetTenantRangesResponse, error)

nit: you've formatted all the others with then return values on a new line

pkg/keyvisualizer/key_visualizer.go line 44 at r1 (raw file):

// SpanStatsCollector collects request statistics for tenants.
// SpanStatsCollector will only collect statistics for a tenant after the

This comment is pretty awkward. Consider expanding on the relationship between SpanStatsCollector and KVAccessor.

pkg/keyvisualizer/key_visualizer.go line 46 at r1 (raw file):

// SpanStatsCollector will only collect statistics for a tenant after the
// tenant has called an implementation of KVAccessor.UpdateBoundaries.
type SpanStatsCollector interface {

This seems to have a single implementation. Does it really need to be an interface?

pkg/keyvisualizer/key_visualizer.go line 61 at r1 (raw file):

	// wishes to have request statistics collected for.
	// An error is returned if the watcher is not started successfully.
	Start(ctx context.Context, stopper *stop.Stopper) error

It seems weird to me that watching for updates on whatever table is bundled with something called StatsCollector. It also makes this comment awkward, since SpanStatsCollector is supposed to be a generic interface. Have you considered pulling the watch logic out of this interface, and replacing it with an UpdateBoundaries()?

pkg/keyvisualizer/key_visualizer.go line 65 at r1 (raw file):

what's up with all these spaces? Have you run the formatter on this file?

pkg/keyvisualizer/key_visualizer.go line 74 at r1 (raw file):

	// When it receives the response, it will perform downsampling,
	// normalization, and ultimately write the sample to the tenant's system tables.
	FetchStats(ctx context.Context) error

nit: drop the ctx name in all these function declarations

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 33 at r1 (raw file):

message GetSamplesRequest {
  roachpb.TenantID tenant = 1;
  util.hlc.Timestamp start = 2 [(gogoproto.nullable) = false];

comment pls

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 49 at r1 (raw file):

}

service KeyVisualizer {

comment pls
What are boundaries? What are samples?

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

service KeyVisualizer {

  rpc GetTenantRanges(GetTenantRangesRequest) returns

comment

Are the "ranges" here the same as the "boundaries" in SaveBoundaries? If so, let's change one of the names.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 54 at r1 (raw file):

    (GetTenantRangesResponse) {}

  // SaveBoundaries is issued by the tenant. The responding node is

this comment doesn't really say what this RPC does

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 61 at r1 (raw file):

  // GetSamplesFromAllNodes is issued by the tenant.
  rpc GetSamplesFromAllNodes(GetSamplesRequest) returns (GetSamplesResponse) {}

Elsewhere the pattern is that there's a single RPC for getting data either from one node or from all. The request has a nodeID field, and not populating it means a fan-out. Consider that.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

)

type tenantStatsBucket struct {

say that this implements some btree interface, and add a variable declaration below also ensuring that.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

)

type tenantStatsBucket struct {

tenantStatsBucket doesn't know anything about a tenant, so consider taking "tenant" out of the name

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 40 at r1 (raw file):

type tenantStatsCollector struct {
	stashedBoundaries []*roachpb.Span

I don't know what to make of the name "stashed". Consider scratching it. Or maybe nextBoundaries?
This needs a comment about the relationship with the tree's elements, and how GetSamples() updates one from the other. I find that weird, btw. Maybe we can find another scheme.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 60 at r1 (raw file):

			counter: 0,
		}
		err := t.Insert(&bucket, false)

comment the false inline pls

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 62 at r1 (raw file):

		err := t.Insert(&bucket, false)
		if err != nil {
			return nil, err

please comment on what it means for newTreeWithBoundaries to return an error. I don't think the error handling we do in GetSamples() makes sense. t.Insert() only returns errors if the type of the range is not good, right? Let's panic.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 71 at r1 (raw file):

func (t *tenantStatsCollector) getStats() []*spanstatspb.SpanStats {

	stats := make([]*spanstatspb.SpanStats, 0)

make(..., 0, t.tree.Len()) ?

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 72 at r1 (raw file):

	stats := make([]*spanstatspb.SpanStats, 0)
	// TODO: acquire mutex lock?

well, what's the decision to make here?

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 75 at r1 (raw file):

	it := t.tree.Iterator()
	for {
		i, next := it.Next()

s/next/ok

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 102 at r1 (raw file):

	rff        *rangefeed.Factory
}

nit: var _ keyvisualizer.SpanStatsCollector = *SpanStatsCollector(nil)

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 149 at r1 (raw file):

		if err := protoutil.Unmarshal(boundariesEncoded, &decoded); err != nil {
			log.Warningf(ctx, err.Error())

I think you want to print err, not err.Error(), so that redaction can do its thing.
And put more text in this log pls.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 157 at r1 (raw file):

	}

	w := rangefeedcache.NewWatcher(

please comment all the literals inline with the parameter name

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 166 at r1 (raw file):

		translateEvent,
		func(ctx context.Context, update rangefeedcache.Update) {
			// noop

instead of this comment, comment the argument as /* onUpdate */

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 173 at r1 (raw file):

	// Kick off the rangefeedcache which will retry until the stopper stops.
	if err := rangefeedcache.Start(ctx, stopper, w, func(err error) {
		log.Errorf(ctx, "span stats collector rangefeed error: %s", err.Error())

if Errorf here warranted? If this can happen because another node goes down, consider Warningf.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 175 at r1 (raw file):

		log.Errorf(ctx, "span stats collector rangefeed error: %s", err.Error())
	}); err != nil {
		return err // we're shutting down

consider scratching this comment

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 190 at r1 (raw file):

		collector.increment(span)
	} else {
		return errors.New("could not increment collector for tenant")

say "tenant not found" or something, and add a comment hinting to what the programmer was supposed to have done

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 196 at r1 (raw file):

}

func (s *SpanStatsCollector) saveBoundaries(id roachpb.TenantID,

I think the name "save" is weird. updateBoundaries ?

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 212 at r1 (raw file):

// It implicitly starts a new collection period by clearing the old
// statistics. A sample period is therefore defined by the interval between a
// tenant requesting samples. TODO(zachlite): this will change with

either say more in the TODO, or scratch it. Also put it on a new line.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 214 at r1 (raw file):

// tenant requesting samples. TODO(zachlite): this will change with
// improved fault tolerance mechanisms.
func (s *SpanStatsCollector) GetSamples(

you definitely want to hook up your editor to crlfmt :P

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 217 at r1 (raw file):

	id roachpb.TenantID) ([]spanstatspb.Sample, error) {

	if collector, ok := s.collectors[id]; ok {

invert the if and short-circuit if the tenant isn't found, so you can unindent the body. And make that error message a bit better.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 230 at r1 (raw file):

		// TODO(zachlite): until the collector can stash tenant samples,
		// the collector will only return one sample at a time.
		// While this is the case,

bad wrapping

pkg/kv/kvserver/spanstats/span_stats.go line 13 at r1 (raw file):

package spanstats
//
//import (

what's going on here?

pkg/kv/kvserver/spanstats/spanstatspb/span_stats.proto line 12 at r1 (raw file):

syntax = "proto3";

is kvserver the right place for these protos? Should they be under or around spanstatscollector ?

pkg/kv/kvserver/spanstats/spanstatspb/span_stats.proto line 20 at r1 (raw file):

import "roachpb/data.proto";

message SpanStats {

comments all around please

pkg/sql/catalog/systemschema/system.go line 733 at r1 (raw file):

CREATE TABLE system.tenant_settings (
	tenant_id    INT8 NOT NULL,
	boundaries	 BYTES NOT NULL,

Does this really need to be a proto? It'd probably be much nicer if it were SQL readable (even if the keys would be bytes). I guess you need a couple of parallel arrays, since we don't have record fields. Maybe json is the ticket.

At the very least, say somewhere what proto goes in here.

ajwerner

See #88477

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

andreimatei

One thing I was chatting with @abarganier about is whether we should have the stats data live in the Observability Service to begin with. By now the service is ready to ingest event data. We've also decided to embed the service into CRDB for the 23.1 release, to give people confidence that "it's launching", and I'll be working on that very soon. Maybe Alex can directly help to integrate this patch with the Obs Service.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

a discussion (no related file):
I've gotten rid of the static table ID. As part of that effort, I realized I should not have installed a rangefeed watcher on every store. Now, the rangefeed is installed once per node in pkg/server, and communicates updates to each SpanStatsCollector via node.stores.VisitStores

I've still got some more commenting to do.

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/jobs/jobspb/jobs.proto line 653 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

being empty, is this needed?

I think so. Details and Progress protos are required by the jobs.Record struct.

pkg/jobs/jobspb/jobs.proto line 658 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment

Done.

pkg/jobs/jobspb/jobs.proto line 1135 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

get rid of this

Done.

pkg/keyvisualizer/key_visualizer.go line 22 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

missing period at end of sentence.

Done.

pkg/keyvisualizer/key_visualizer.go line 46 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This seems to have a single implementation. Does it really need to be an interface?

Deleted.

pkg/keyvisualizer/key_visualizer.go line 61 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It seems weird to me that watching for updates on whatever table is bundled with something called StatsCollector. It also makes this comment awkward, since SpanStatsCollector is supposed to be a generic interface. Have you considered pulling the watch logic out of this interface, and replacing it with an UpdateBoundaries()?

Done. Pulling out the watch logic was done as part of the refactor away from a static table ID.

pkg/keyvisualizer/key_visualizer.go line 65 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what's up with all these spaces? Have you run the formatter on this file?

Done.

pkg/keyvisualizer/key_visualizer.go line 74 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: drop the ctx name in all these function declarations

Hmm, what's the motivation for this nit? These methods need a ctx as a parameter.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 33 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment pls

These fields weren't used, so I removed them.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 49 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment pls
What are boundaries? What are samples?

Done.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment

Are the "ranges" here the same as the "boundaries" in SaveBoundaries? If so, let's change one of the names.

No, they're separate-ish. We ask KV for a tenant's ranges, and tell KV to collect statistics for arbitrary spans, which for now, happen to be ranges.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 54 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this comment doesn't really say what this RPC does

Fixed.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 61 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Elsewhere the pattern is that there's a single RPC for getting data either from one node or from all. The request has a nodeID field, and not populating it means a fan-out. Consider that.

Nice! Done.

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say that this implements some btree interface, and add a variable declaration below also ensuring that.

Do you mean this?

var _ interval.Interface = &tenantStatsBucket{}

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

tenantStatsBucket doesn't know anything about a tenant, so consider taking "tenant" out of the name

It doesn't have any tenant-specific data in it, but its purpose is to be a bucket for tenant statistics. I can take it out if you feel strongly about it.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 40 at r1 (raw file):

I find that weird, btw.
Happy to talk more about the lifecycle of a tenant's collector. In the current scheme, GetSamples terminates the current sample period for a tenant, and starts a new one with a fresh bTree. The fresh bTree is configured with the boundaries most recently sent from the tenant.

I'm now leaning towards "preferredBoundaries", or something that decouples boundaries from time. The collector should not care about the frequency of the tenant RPC'ing its preferred boundaries to KV.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 60 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment the false inline pls

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 62 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment on what it means for newTreeWithBoundaries to return an error. I don't think the error handling we do in GetSamples() makes sense. t.Insert() only returns errors if the type of the range is not good, right? Let's panic.

Right. Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 71 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

make(..., 0, t.tree.Len()) ?

Nice. Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 75 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/next/ok

In this instance, it.Next() only returns two values.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 102 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: var _ keyvisualizer.SpanStatsCollector = *SpanStatsCollector(nil)

I deleted the interface, since there's only one instance, but noted.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 149 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think you want to print err, not err.Error(), so that redaction can do its thing.
And put more text in this log pls.

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 157 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment all the literals inline with the parameter name

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 166 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of this comment, comment the argument as /* onUpdate */

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 173 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if Errorf here warranted? If this can happen because another node goes down, consider Warningf.

Done. Also omitted .Error()

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 175 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider scratching this comment

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 190 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say "tenant not found" or something, and add a comment hinting to what the programmer was supposed to have done

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 196 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the name "save" is weird. updateBoundaries ?

This warrants discussion. I felt that updateBoundaries implies that boundaries are installed now , whereas saveBoundaries, saves boundaries for later ¯_(ツ)_/¯

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 212 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

either say more in the TODO, or scratch it. Also put it on a new line.

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 214 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you definitely want to hook up your editor to crlfmt :P

😅 TIL

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 217 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

invert the if and short-circuit if the tenant isn't found, so you can unindent the body. And make that error message a bit better.

Done.

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 230 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

bad wrapping

Done.

pkg/keys/constants.go line 470 at r1 (raw file):

Previously, ajwerner wrote…

please do not use a constant ID here. This is our very last one, and I don't think this is a good use case for it. In the table, set the ID to be zero and let it get dynamically allocated in the relevant migration please.

Done.

pkg/keys/constants.go line 470 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm pretty sure we can have dynamic IDs now for system tables. Let's not take 49 if we don't need to.

Done.

pkg/kv/kvserver/spanstats/span_stats.go line 13 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what's going on here?

Oops. Deleted.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/clusterversion/cockroach_versions.go line 273 at r4 (raw file):

	// RangefeedUseOneStreamPerNode changes rangefeed implementation to use 1 RPC stream per node.
	RangefeedUseOneStreamPerNode
	// NoNonMVCCAddSSTable adds a migration whizch waits for all

revert

pkg/jobs/jobspb/jobs.proto line 653 at r1 (raw file):

Previously, zachlite wrote…

I think so. Details and Progress protos are required by the jobs.Record struct.

Would it be a good idea to teach WrapPayloadDetails() to handle nils (i.e. turn it into also a nil)?
Alternatively, can we use gogo/protobuf/types.Empty ? And perhaps turn it to nil in WrapPayloadDetails().

pkg/keyvisualizer/key_visualizer.go line 74 at r1 (raw file):

Previously, zachlite wrote…

Hmm, what's the motivation for this nit? These methods need a ctx as a parameter.

I'm not saying drop the parameter; I'm only saying you don't need to give it a name.

FetchStats(context.Context) error

The standard name doesn't add anything to the signature.

pkg/keyvisualizer/keyvisjob/job.go line 32 at r4 (raw file):

// Initialize registers the resumer of the key visualizer job.
func Initialize() {

can this be a init()? Any reason why does it need to be called by the Manager?

pkg/keyvisualizer/keyvisjob/job.go line 41 at r4 (raw file):

}

// Resume implements the jobs.Resumer interface.

please comment on what the thing does

pkg/keyvisualizer/keyvisjob/job.go line 41 at r4 (raw file):

}

// Resume implements the jobs.Resumer interface.

please take the opportunity to comment in the Resumer interface what the contract for the return value is. What happens if nil is returned? Very confused.

pkg/keyvisualizer/keyvisjob/job.go line 55 at r4 (raw file):

	if !execCtx.ExecCfg().NodeInfo.LogicalClusterID().Equal(r.job.Payload().CreationClusterID) {
		// When restoring a cluster, we don't want to end up with two instances of
		// the singleton reconciliation job.

what's this "reconciliation" here and below? Is this copy-pasta?

pkg/keyvisualizer/keyvisjob/job.go line 74 at r4 (raw file):

	// there is a problem starting the span stats consumer, this job will
	// aggressively restart at the job system level with no backoff.
	if err := r.job.Update(ctx, nil, func(_ *kv.Txn, md jobs.JobMetadata, ju *jobs.JobUpdater) error {

nit: pls comment the nil inline

pkg/keyvisualizer/keyvisjob/job.go line 85 at r4 (raw file):

	// TODO(zachlite): use a retry
	if err := c.DecideBoundaries(ctx); err != nil {
		log.Warningf(ctx, "decide boundaries failed with...%v", err)

with...%vis weird, and also missing a space. Do failed: %v.
And also include that in the returned error:

err = errors.Wrapf(err,"decide boundaries failed")
log.Warning(ctx, err)
return err

Also, are you sure you want to both log and return? The error seems to be logged again here

pkg/keyvisualizer/keyvisjob/job.go line 102 at r4 (raw file):

// OnFailOrCancel implements the jobs.Resumer interface.
func (r *resumer) OnFailOrCancel(ctx context.Context, _ interface{}, _ error) error {
	if jobs.HasErrJobCanceled(errors.DecodeError(ctx, *r.job.Payload().FinalResumeError)) {

what is the difference between this FinalResumeError and the function error param?

pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):

)

// Manager is the coordinator of the key visualizer subsystem.

I don't understand what the deals with the jobs is. What are we getting from the jobs infra?
IIUC, this manager spawns a new, short-lived jobs every few seconds. Each job collects the samples and terminates. Why do we need to create jobs? Besides the complex code involved, won't we spam the jobs table and the UI with all these jobs every few seconds?

If anything, I would understand if there was a long-running job that I could pause from the UI (assuming that's a thing), and perhaps see some info for free, like when the last samples were collected.

pkg/keyvisualizer/keyvismanager/manager.go line 66 at r4 (raw file):

// Start creates a background task that starts the key visualizer job.
func (m *Manager) Start(ctx context.Context) error {
	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager",

	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager", m.run)

Code quote:

	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager",
		func(ctx context.Context) {
			m.run(ctx)
		})

pkg/keyvisualizer/keyvismanager/manager.go line 126 at r4 (raw file):

}

func (m *Manager) createAndStartJobIfNoneExists(ctx context.Context) (bool, error) {

please comment the bool retval

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

Previously, zachlite wrote…

No, they're separate-ish. We ask KV for a tenant's ranges, and tell KV to collect statistics for arbitrary spans, which for now, happen to be ranges.

oh... Hmm, I find this RPC pretty bizarre then. I think there's a plan to make crdb_internal.ranges_no_leases available to tenants. We can ask around if anyone's looking at that. If not, we can go a level lower and use a kvcoord.RangeIterator. That guy already works for tenants (it's how DistSender works, and also DistSQL planning). Although now that I look at it, it seems kinda slow for iterating over many ranges, since it calls into the DistSender for every Next(). I think in the DistSender we use the range cache, and I think we read more that one range descriptor from the meta range when the cache misses, but still.
@ajwerner I see you played with it here. Do you have thoughts?

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 58 at r4 (raw file):

service KeyVisualizer {

  // GetTenantRanges gets the tenant's ranges from KV as a set of roachpb

bad wrap in the middle of the type name

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 65 at r4 (raw file):

  // SaveBoundaries tells KV about the key spans that the tenant wants
  // statistics for.
  rpc SaveBoundaries(SaveBoundariesRequest) returns (SaveBoundariesResponse) {}

consider UpdateBoundaries

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 68 at r4 (raw file):

  // GetSamples gets the tenant's collected statistics from KV. It
  // initiates a fan-out to all stores.

this comment lies, I think. Not all RPCs do a fan-out.

pkg/server/server.go line 142 at r4 (raw file):

	migrationServer *migrationServer

	// XXX: keyVisualizerServer

xxx

pkg/server/span_stats_server.go line 33 at r4 (raw file):

)

func getTenantRanges(

why extract this function out of its one caller?

pkg/server/span_stats_server.go line 65 at r4 (raw file):

		return nil
	})

nit: no new line here

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):
The motivation to use the jobs infra was for its consistency and coordination guarantees. Specifically, it's active or inactive while a tenant pod is active or inactive, respectively. Because it is set up as a singleton job, we can also guarantee that the key visualizer job will only fire once per tenant, regardless of the number of pods.

IIUC, this shouldn't be too spammy because the short 10-second sample period (job invocation period) that I chose was really just for development purposes. I imagine the period will be 10+ minutes in a production setting.

If anything, I would understand if there was a long-running job that I could pause from the UI (assuming that's a thing), and perhaps see some info for free, like when the last samples were collected.

Running with my assumption that a longer sample period will yield less job table spam, I felt that a forever-running job would just be more work to introduce, because I'd need to checkpoint the last sample time, etc.
What do you think?

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

bizarre

How so? Do you mean the motivation for the SpanStatsConsumer to fetch tenant ranges from KV, or do you mean the implementation that fulfills the RPC?
The fulfillment is done with kvclient.ScanMetaKVs

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):

Because it is set up as a singleton job, we can also guarantee that the key visualizer job will only fire once per tenant, regardless of the number of pods.
...
I felt that a forever-running job would just be more work to introduce, because I'd need to checkpoint the last sample time, etc.

Don't these two contradict each other? If there are no long-running jobs, then what "singletons" are we talking about?
Here's what the code looks like to me; tell me what I'm missing:

Each SQLServer gets a keyvismanager.Manager. So, every SQL pod has one.
Each Manager periodically checks whether any running TypeKeyVisualizer exist and, if it doesn't, creates one. The existence check is expected to fail every time, because jobs are short lived.
Once created, a job runs for a short time, and then terminates (because it returns from Resume()).

So, the more SQLServers a tenant has, the more jobs it runs. Is this right?

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):

Is this right?

You're right. I see my mistake. I had confused checking to see if there's a job with running the job. As you pointed out, the existence check will run on all SQLServers and fail every time.

Ok. So, I will reimplement this as a forever running job.

pkg/keyvisualizer/keyvismanager/manager.go line 66 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager", m.run)

Done.

pkg/keyvisualizer/keyvismanager/manager.go line 126 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment the bool retval

Done.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 58 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

bad wrap in the middle of the type name

🙀 Done.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 68 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this comment lies, I think. Not all RPCs do a fan-out.

I had hoped that "initiates" conveys the distinction between a fan-out and a non-fanout. I'll make this more explicit.

pkg/server/server.go line 142 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

xxx

Done.

pkg/server/span_stats_server.go line 33 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why extract this function out of its one caller?

Fixed.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

Previously, zachlite wrote…

bizarre

How so? Do you mean the motivation for the SpanStatsConsumer to fetch tenant ranges from KV, or do you mean the implementation that fulfills the RPC?
The fulfillment is done with kvclient.ScanMetaKVs

I mean I doubt that the right thing to do is to add a method to this otherwise obscure service for getting info about a tenant's ranges. A tenant can already read the descriptors for its ranges with RangeLookup (that's what the RangeIterator uses under the hood). Introducing a similar RPC should not be done lightly, because of security concerns if nothing else. I think that if that API will be good enough for crdb_internal.ranges, it should be good enough for our purposes. The responses to those requests are larger (full descriptors), but on the flip side the API is paginated (there's a limit). If size is a concern, we can consider optionally returning only some of the descriptor fields. But size does not seem to be a concerned since, on the server side, the caller of ScanMetaKVs gets all the descriptors at once.

Also, pretty much the same RPC is already exists on the status server. That guy checks for an admin user, because it's meant to be called by the CLI (debug zip) apparently. But I think we could figure out how to allow it for tenants. However, it's response format seems a bit idiosyncratic, so I don't know if I'd encourage we use it.

Depending on how motivated you're feeling, I think implementing crdb_internal.ranges_* would be a very beneficial thing to do; there's several issues open about it. I think I would do it with the RangeIterator, but I'd teach the RangeIterator to prefetch a lot of descriptors at once, and I'd teach it to not pollute the range cache with the results if there's memory pressure on the cache.

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 46 at r5 (raw file):

message GetTenantRangesRequest {
  roachpb.TenantID tenant = 1;

I don't think we want fields like this, here or elsewhere. The tenant ID should come from the certificate. You need to do the certificate check anyway (and please do it in this PR), and then, having another TenantID field only makes me wonder if the input was properly sanitized at auth time.
I think we should hook up the auth stuff in this PR, even if other stuff around secondary tenants doesn't work.
But also, I want to check that nothing is too broken with this patch when there are secondary tenants - like for example the overlapping buckets that I was asking about before. Would it be a good idea to disable the jobs for the secondary tenants until they're ready?

pkg/keyvisualizer/keyvispb/key_visualizer.proto line 56 at r5 (raw file):

// boundary. A spanstatspb.Sample contains the statistics collected on
// behalf of a tenant during a collection period.
service KeyVisualizer {

I don't see this service whitelisted in the tenantAuthorizer. Do the RPCs work for tenants? Leave a TODO somewhere.

pkg/server/span_stats_server.go line 48 at r5 (raw file):

	tenantKeySpan := roachpb.Span{
		Key:    tenantPrefix,
		EndKey: tenantPrefix.PrefixEnd(),

This will be KeyMin,KeyMax for the system tenant, I think. Is that what you want?
If there are secondary tenants, will that result in stores having overlapping buckets?

zachlite · 2023-01-10T23:54:15Z

bors r+

rail · 2023-01-11T16:17:33Z

bors single on

rail · 2023-01-11T16:40:37Z

bors r-

craig · 2023-01-11T16:40:40Z

Canceled.

The Key Visualizer enables the visualization of historical KV traffic. It's main objective is to enable CRDB engineers and operators to quickly identify historical and current KV hotspots. When a hotspot is identified, the user can immediately tell: a) Where in the keyspace the hotspot exist(s|ed). b) The time when the hotspot appeared and disappeared. The Key Visualizer can be enabled via `SET CLUSTER_SETTING keyvisualizer.enabled = true;` The visualization represents roachpb.RequestUnion.Requests destined for a specific key-span on a log scale from black to red, representing "cold" and "hot", respectively. The keyspace [Min,Max) is represented on the Y-axis, and time is represented on the X-axis. By default, up to 2 weeks of historical data is persisted and visualized, with a sample period of 1 Minute. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-12182 Release note (ui change): The introduction of the Key Visualizer makes it easy to identify historical hotspots. Three new cluster settings are introduced: keyvisualizer.enabled keyvisualizer.sample_interval keyvisualizer.max_buckets keyvisualizer.enabled enables the key visualizer. keyvisualizer.sample_interval controls the frequency at which the key visualizer collects samples. keyvisualizer.max_buckets controls the maximum number of buckets in a sample. passing jobs and backupccl tests lint fixes fix tests test fixes fix tests fix tests and lints fix tests add key visualizer to advanced debug tests and generated files fix tests

zachlite · 2023-01-24T20:22:56Z

bors r+

craig · 2023-01-24T22:28:56Z

Build succeeded:

Bazel Essential CI (Cockroach)

This log statement was accidentally merged as part of cockroachdb#88353. This patch simply removes the log line. Release note: none

102697: pkg/server/status: remove leftover debug log statement. r=ericharmeling a=abarganier This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none Epic: CRDB-12182 Co-authored-by: Alex Barganier <[email protected]>

This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none

zachlite requested review from a team as code owners September 21, 2022 15:10

zachlite requested a review from a team September 21, 2022 15:10

zachlite requested review from a team as code owners September 21, 2022 15:10

zachlite requested a review from a team September 21, 2022 15:10

zachlite requested a review from a team as a code owner September 21, 2022 15:10

zachlite requested review from Santamaura, irfansharif, andreimatei and koorosh September 21, 2022 15:11

ajwerner reviewed Sep 22, 2022

View reviewed changes

andreimatei reviewed Sep 22, 2022

View reviewed changes

ajwerner reviewed Sep 22, 2022

View reviewed changes

andreimatei reviewed Sep 22, 2022

View reviewed changes

zachlite force-pushed the 220727.keyvis branch from 9be5015 to 68a7d8b Compare September 28, 2022 13:22

zachlite commented Sep 28, 2022

View reviewed changes

zachlite force-pushed the 220727.keyvis branch from 68a7d8b to 8a8ab80 Compare September 28, 2022 17:25

zachlite commented Sep 28, 2022

View reviewed changes

zachlite force-pushed the 220727.keyvis branch from 8a8ab80 to d5306b0 Compare September 28, 2022 19:03

zachlite commented Sep 28, 2022

View reviewed changes

andreimatei reviewed Sep 28, 2022

View reviewed changes

zachlite commented Sep 29, 2022

View reviewed changes

andreimatei reviewed Sep 29, 2022

View reviewed changes

zachlite commented Sep 29, 2022

View reviewed changes

zachlite requested a review from abarganier September 29, 2022 17:55

zachlite force-pushed the 220727.keyvis branch from d5306b0 to ef78a56 Compare September 29, 2022 18:27

andreimatei reviewed Sep 29, 2022

View reviewed changes

zachlite force-pushed the 220727.keyvis branch from 7d6d886 to b4b4580 Compare January 6, 2023 17:04

zachlite requested a review from a team January 6, 2023 17:04

zachlite force-pushed the 220727.keyvis branch 3 times, most recently from 01d2c7c to 83ed00a Compare January 10, 2023 18:40

zachlite requested a review from a team as a code owner January 10, 2023 21:12

zachlite force-pushed the 220727.keyvis branch 2 times, most recently from 0c778ce to 3b29af8 Compare January 10, 2023 22:58

zachlite changed the title ~~introduction of the key visualizer, for internal use only.~~ Introduction of the key visualizer Jan 11, 2023

zachlite force-pushed the 220727.keyvis branch from 3b29af8 to 6cbeece Compare January 23, 2023 19:26

zachlite force-pushed the 220727.keyvis branch from 6cbeece to 7416aab Compare January 24, 2023 19:17

zachlite requested a review from a team as a code owner January 24, 2023 19:17

craig bot merged commit 39a32c8 into cockroachdb:master Jan 24, 2023

cockroach-teamcity mentioned this pull request Jan 25, 2023

PR #88353 - Introduction of the Key Visualizer cockroachdb/docs#16086

Closed

arulajmani mentioned this pull request Apr 25, 2023

kvserver: OptimisticEvalForLocks alloc regression #99024

Closed

abarganier added a commit to abarganier/cockroach that referenced this pull request May 1, 2023

pkg/server/status: remove leftover debug log statement.

1057985

This log statement was accidentally merged as part of cockroachdb#88353. This patch simply removes the log line. Release note: none

abarganier mentioned this pull request May 1, 2023

pkg/server/status: remove leftover debug log statement. #102697

Merged

blathers-crl bot mentioned this pull request May 1, 2023

release-23.1: pkg/server/status: remove leftover debug log statement. #102704

Merged

blathers-crl bot pushed a commit that referenced this pull request May 1, 2023

pkg/server/status: remove leftover debug log statement.

b5c1bc1

This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none

blathers-crl bot mentioned this pull request May 1, 2023

release-23.1.0: pkg/server/status: remove leftover debug log statement. #102705

Merged

blathers-crl bot pushed a commit that referenced this pull request May 1, 2023

pkg/server/status: remove leftover debug log statement.

51cf242

This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction of the key visualizer #88353

Introduction of the key visualizer #88353

zachlite commented Sep 21, 2022 •

edited

Loading

cockroach-teamcity commented Sep 21, 2022

ajwerner left a comment

andreimatei left a comment

ajwerner left a comment

andreimatei left a comment

zachlite left a comment

zachlite left a comment

zachlite left a comment

andreimatei left a comment

zachlite left a comment

andreimatei left a comment

zachlite left a comment

andreimatei left a comment

zachlite commented Jan 10, 2023

rail commented Jan 11, 2023

rail commented Jan 11, 2023

craig bot commented Jan 11, 2023

zachlite commented Jan 24, 2023

craig bot commented Jan 24, 2023

Introduction of the key visualizer #88353

Introduction of the key visualizer #88353

Conversation

zachlite commented Sep 21, 2022 • edited Loading

Usage and configuration

Known issues, and improvements for 23.1

cockroach-teamcity commented Sep 21, 2022

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

zachlite commented Jan 10, 2023

rail commented Jan 11, 2023

rail commented Jan 11, 2023

craig bot commented Jan 11, 2023

zachlite commented Jan 24, 2023

craig bot commented Jan 24, 2023

zachlite commented Sep 21, 2022 •

edited

Loading