Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of the key visualizer #88353

Merged
merged 1 commit into from
Jan 24, 2023
Merged

Conversation

zachlite
Copy link
Contributor

@zachlite zachlite commented Sep 21, 2022

This PR introduces the Key Visualizer 🔍 🔥 🪳

Screen Shot 2022-09-20 at 5 27 05 PM

Screen Shot 2022-09-20 at 5 52 34 PM

The Key Visualizer is designed for multi-tenancy, but is currently only implemented for use by the system tenant. Enabling usage by secondary tenants may be considered in the future. Read more about current limitations below.

Usage and configuration

The Key Visualizer is disabled by default. It can be enabled via:

SET CLUSTER SETTING keyvisualizer.enabled = true;

The default sample period is 1 minute, and the sample retention period is 2 weeks.
The sample period can be configured via:

SET CLUSTER SETTING keyvisualizer.sample_interval = <SECONDS>;

The default sample period was kept short for faster debugging cycles. The default we settle on in the future is subject to feedback. If the default sample period is kept short, we'll need to lower the retention period.

The default number of buckets (think sample resolution) is 256.
It can be raised to a maximum value of 1024.

SET CLUSTER SETTING keyvisualizer.max_buckets = <BUCKETS>;

Known issues, and improvements for 23.1

  • Downsampling strategy. The current downsampling strategy is implemented here. It's not bad, but it can be better. There's currently no guarantee of boundary stability between samples, and there are other heuristics to explore to prioritize preserving resolution in "important" parts of the keyspace.

  • UI improvements

    • Zoom improvements
    • Prevent overlapping of X and Y axis labels

@zachlite zachlite requested review from a team as code owners September 21, 2022 15:10
@zachlite zachlite requested a review from a team September 21, 2022 15:10
@zachlite zachlite requested review from a team as code owners September 21, 2022 15:10
@zachlite zachlite requested a review from a team September 21, 2022 15:10
@zachlite zachlite requested a review from a team as a code owner September 21, 2022 15:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/keys/constants.go line 470 at r1 (raw file):

	SpanConfigurationsTableID           = 47
	RoleIDSequenceID                    = 48
	SpanStatsTenantBoundariesTableID    = 49

please do not use a constant ID here. This is our very last one, and I don't think this is a good use case for it. In the table, set the ID to be zero and let it get dynamically allocated in the relevant migration please.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a superficial look to get some bearings. Will do more later.

Reviewed 3 of 65 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)


-- commits line 2 at r1:
you need more in here :P


-- commits line 2 at r1:

for internal use only

Unless this means something very specific, scratch it.


-- commits line 2 at r1:
Can we get table names instead of IDs on the web page?


-- commits line 2 at r1:

The Key Visualizer is disabled by default. It can be enabled via:

Why is it disabled? When will it be enabled?


pkg/jobs/jobspb/jobs.proto line 653 at r1 (raw file):

message KeyVisualizerDetails {

being empty, is this needed?


pkg/jobs/jobspb/jobs.proto line 658 at r1 (raw file):

message KeyVisualizerProgress {
  util.hlc.Timestamp last_sample_time = 1 [(gogoproto.nullable) = false];

please comment


pkg/jobs/jobspb/jobs.proto line 1135 at r1 (raw file):

  roachpb.Version creation_cluster_version = 36 [(gogoproto.nullable) = false];

  // NEXT ID: 39.

get rid of this


pkg/keys/constants.go line 470 at r1 (raw file):

	SpanConfigurationsTableID           = 47
	RoleIDSequenceID                    = 48
	SpanStatsTenantBoundariesTableID    = 49

I'm pretty sure we can have dynamic IDs now for system tables. Let's not take 49 if we don't need to.


pkg/keyvisualizer/key_visualizer.go line 11 at r1 (raw file):

// licenses/APL.txt.

package keyvisualizer

do you want to put a big comment here or in a doc.go or link to something explaining the architecture of the thing?


pkg/keyvisualizer/key_visualizer.go line 22 at r1 (raw file):

)

// KVAccessor provides a tenant with access to the keyvispb.KeyVisualizerServer

missing period at end of sentence.


pkg/keyvisualizer/key_visualizer.go line 33 at r1 (raw file):

	GetTenantRanges(
		ctx context.Context,
		tenantID roachpb.TenantID) (*keyvispb.GetTenantRangesResponse, error)

nit: you've formatted all the others with then return values on a new line


pkg/keyvisualizer/key_visualizer.go line 44 at r1 (raw file):

// SpanStatsCollector collects request statistics for tenants.
// SpanStatsCollector will only collect statistics for a tenant after the

This comment is pretty awkward. Consider expanding on the relationship between SpanStatsCollector and KVAccessor.


pkg/keyvisualizer/key_visualizer.go line 46 at r1 (raw file):

// SpanStatsCollector will only collect statistics for a tenant after the
// tenant has called an implementation of KVAccessor.UpdateBoundaries.
type SpanStatsCollector interface {

This seems to have a single implementation. Does it really need to be an interface?


pkg/keyvisualizer/key_visualizer.go line 61 at r1 (raw file):

	// wishes to have request statistics collected for.
	// An error is returned if the watcher is not started successfully.
	Start(ctx context.Context, stopper *stop.Stopper) error

It seems weird to me that watching for updates on whatever table is bundled with something called StatsCollector. It also makes this comment awkward, since SpanStatsCollector is supposed to be a generic interface. Have you considered pulling the watch logic out of this interface, and replacing it with an UpdateBoundaries()?


pkg/keyvisualizer/key_visualizer.go line 65 at r1 (raw file):

what's up with all these spaces? Have you run the formatter on this file?


pkg/keyvisualizer/key_visualizer.go line 74 at r1 (raw file):

	// When it receives the response, it will perform downsampling,
	// normalization, and ultimately write the sample to the tenant's system tables.
	FetchStats(ctx context.Context) error

nit: drop the ctx name in all these function declarations


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 33 at r1 (raw file):

message GetSamplesRequest {
  roachpb.TenantID tenant = 1;
  util.hlc.Timestamp start = 2 [(gogoproto.nullable) = false];

comment pls


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 49 at r1 (raw file):

}

service KeyVisualizer {

comment pls
What are boundaries? What are samples?


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

service KeyVisualizer {

  rpc GetTenantRanges(GetTenantRangesRequest) returns

comment

Are the "ranges" here the same as the "boundaries" in SaveBoundaries? If so, let's change one of the names.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 54 at r1 (raw file):

    (GetTenantRangesResponse) {}

  // SaveBoundaries is issued by the tenant. The responding node is

this comment doesn't really say what this RPC does


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 61 at r1 (raw file):

  // GetSamplesFromAllNodes is issued by the tenant.
  rpc GetSamplesFromAllNodes(GetSamplesRequest) returns (GetSamplesResponse) {}

Elsewhere the pattern is that there's a single RPC for getting data either from one node or from all. The request has a nodeID field, and not populating it means a fan-out. Consider that.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

)

type tenantStatsBucket struct {

say that this implements some btree interface, and add a variable declaration below also ensuring that.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

)

type tenantStatsBucket struct {

tenantStatsBucket doesn't know anything about a tenant, so consider taking "tenant" out of the name


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 40 at r1 (raw file):

type tenantStatsCollector struct {
	stashedBoundaries []*roachpb.Span

I don't know what to make of the name "stashed". Consider scratching it. Or maybe nextBoundaries?
This needs a comment about the relationship with the tree's elements, and how GetSamples() updates one from the other. I find that weird, btw. Maybe we can find another scheme.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 60 at r1 (raw file):

			counter: 0,
		}
		err := t.Insert(&bucket, false)

comment the false inline pls


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 62 at r1 (raw file):

		err := t.Insert(&bucket, false)
		if err != nil {
			return nil, err

please comment on what it means for newTreeWithBoundaries to return an error. I don't think the error handling we do in GetSamples() makes sense. t.Insert() only returns errors if the type of the range is not good, right? Let's panic.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 71 at r1 (raw file):

func (t *tenantStatsCollector) getStats() []*spanstatspb.SpanStats {

	stats := make([]*spanstatspb.SpanStats, 0)

make(..., 0, t.tree.Len()) ?


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 72 at r1 (raw file):

	stats := make([]*spanstatspb.SpanStats, 0)
	// TODO: acquire mutex lock?

well, what's the decision to make here?


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 75 at r1 (raw file):

	it := t.tree.Iterator()
	for {
		i, next := it.Next()

s/next/ok


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 102 at r1 (raw file):

	rff        *rangefeed.Factory
}

nit: var _ keyvisualizer.SpanStatsCollector = *SpanStatsCollector(nil)


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 149 at r1 (raw file):

		if err := protoutil.Unmarshal(boundariesEncoded, &decoded); err != nil {
			log.Warningf(ctx, err.Error())

I think you want to print err, not err.Error(), so that redaction can do its thing.
And put more text in this log pls.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 157 at r1 (raw file):

	}

	w := rangefeedcache.NewWatcher(

please comment all the literals inline with the parameter name


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 166 at r1 (raw file):

		translateEvent,
		func(ctx context.Context, update rangefeedcache.Update) {
			// noop

instead of this comment, comment the argument as /* onUpdate */


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 173 at r1 (raw file):

	// Kick off the rangefeedcache which will retry until the stopper stops.
	if err := rangefeedcache.Start(ctx, stopper, w, func(err error) {
		log.Errorf(ctx, "span stats collector rangefeed error: %s", err.Error())

if Errorf here warranted? If this can happen because another node goes down, consider Warningf.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 175 at r1 (raw file):

		log.Errorf(ctx, "span stats collector rangefeed error: %s", err.Error())
	}); err != nil {
		return err // we're shutting down

consider scratching this comment


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 190 at r1 (raw file):

		collector.increment(span)
	} else {
		return errors.New("could not increment collector for tenant")

say "tenant not found" or something, and add a comment hinting to what the programmer was supposed to have done


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 196 at r1 (raw file):

}

func (s *SpanStatsCollector) saveBoundaries(id roachpb.TenantID,

I think the name "save" is weird. updateBoundaries ?


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 212 at r1 (raw file):

// It implicitly starts a new collection period by clearing the old
// statistics. A sample period is therefore defined by the interval between a
// tenant requesting samples. TODO(zachlite): this will change with

either say more in the TODO, or scratch it. Also put it on a new line.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 214 at r1 (raw file):

// tenant requesting samples. TODO(zachlite): this will change with
// improved fault tolerance mechanisms.
func (s *SpanStatsCollector) GetSamples(

you definitely want to hook up your editor to crlfmt :P


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 217 at r1 (raw file):

	id roachpb.TenantID) ([]spanstatspb.Sample, error) {

	if collector, ok := s.collectors[id]; ok {

invert the if and short-circuit if the tenant isn't found, so you can unindent the body. And make that error message a bit better.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 230 at r1 (raw file):

		// TODO(zachlite): until the collector can stash tenant samples,
		// the collector will only return one sample at a time.
		// While this is the case,

bad wrapping


pkg/kv/kvserver/spanstats/span_stats.go line 13 at r1 (raw file):

package spanstats
//
//import (

what's going on here?


pkg/kv/kvserver/spanstats/spanstatspb/span_stats.proto line 12 at r1 (raw file):

syntax = "proto3";

is kvserver the right place for these protos? Should they be under or around spanstatscollector ?


pkg/kv/kvserver/spanstats/spanstatspb/span_stats.proto line 20 at r1 (raw file):

import "roachpb/data.proto";

message SpanStats {

comments all around please


pkg/sql/catalog/systemschema/system.go line 733 at r1 (raw file):

CREATE TABLE system.tenant_settings (
	tenant_id    INT8 NOT NULL,
	boundaries	 BYTES NOT NULL,

Does this really need to be a proto? It'd probably be much nicer if it were SQL readable (even if the keys would be bytes). I guess you need a couple of parallel arrays, since we don't have record fields. Maybe json is the ticket.

At the very least, say somewhere what proto goes in here.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #88477

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I was chatting with @abarganier about is whether we should have the stats data live in the Observability Service to begin with. By now the service is ready to ingest event data. We've also decided to embed the service into CRDB for the 23.1 release, to give people confidence that "it's launching", and I'll be working on that very soon. Maybe Alex can directly help to integrate this patch with the Obs Service.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)

a discussion (no related file):
I've gotten rid of the static table ID. As part of that effort, I realized I should not have installed a rangefeed watcher on every store. Now, the rangefeed is installed once per node in pkg/server, and communicates updates to each SpanStatsCollector via node.stores.VisitStores

I've still got some more commenting to do.


Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/jobs/jobspb/jobs.proto line 653 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

being empty, is this needed?

I think so. Details and Progress protos are required by the jobs.Record struct.


pkg/jobs/jobspb/jobs.proto line 658 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment

Done.


pkg/jobs/jobspb/jobs.proto line 1135 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

get rid of this

Done.


pkg/keyvisualizer/key_visualizer.go line 22 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

missing period at end of sentence.

Done.


pkg/keyvisualizer/key_visualizer.go line 46 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This seems to have a single implementation. Does it really need to be an interface?

Deleted.


pkg/keyvisualizer/key_visualizer.go line 61 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It seems weird to me that watching for updates on whatever table is bundled with something called StatsCollector. It also makes this comment awkward, since SpanStatsCollector is supposed to be a generic interface. Have you considered pulling the watch logic out of this interface, and replacing it with an UpdateBoundaries()?

Done. Pulling out the watch logic was done as part of the refactor away from a static table ID.


pkg/keyvisualizer/key_visualizer.go line 65 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what's up with all these spaces? Have you run the formatter on this file?

Done.


pkg/keyvisualizer/key_visualizer.go line 74 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: drop the ctx name in all these function declarations

Hmm, what's the motivation for this nit? These methods need a ctx as a parameter.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 33 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment pls

These fields weren't used, so I removed them.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 49 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment pls
What are boundaries? What are samples?

Done.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment

Are the "ranges" here the same as the "boundaries" in SaveBoundaries? If so, let's change one of the names.

No, they're separate-ish. We ask KV for a tenant's ranges, and tell KV to collect statistics for arbitrary spans, which for now, happen to be ranges.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 54 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this comment doesn't really say what this RPC does

Fixed.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 61 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Elsewhere the pattern is that there's a single RPC for getting data either from one node or from all. The request has a nodeID field, and not populating it means a fan-out. Consider that.

Nice! Done.

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say that this implements some btree interface, and add a variable declaration below also ensuring that.

Do you mean this?

var _ interval.Interface = &tenantStatsBucket{}

pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 23 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

tenantStatsBucket doesn't know anything about a tenant, so consider taking "tenant" out of the name

It doesn't have any tenant-specific data in it, but its purpose is to be a bucket for tenant statistics. I can take it out if you feel strongly about it.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 40 at r1 (raw file):

I find that weird, btw.
Happy to talk more about the lifecycle of a tenant's collector. In the current scheme, GetSamples terminates the current sample period for a tenant, and starts a new one with a fresh bTree. The fresh bTree is configured with the boundaries most recently sent from the tenant.

I'm now leaning towards "preferredBoundaries", or something that decouples boundaries from time. The collector should not care about the frequency of the tenant RPC'ing its preferred boundaries to KV.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 60 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment the false inline pls

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 62 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment on what it means for newTreeWithBoundaries to return an error. I don't think the error handling we do in GetSamples() makes sense. t.Insert() only returns errors if the type of the range is not good, right? Let's panic.

Right. Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 71 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

make(..., 0, t.tree.Len()) ?

Nice. Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 75 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/next/ok

In this instance, it.Next() only returns two values.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 102 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: var _ keyvisualizer.SpanStatsCollector = *SpanStatsCollector(nil)

I deleted the interface, since there's only one instance, but noted.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 149 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think you want to print err, not err.Error(), so that redaction can do its thing.
And put more text in this log pls.

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 157 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment all the literals inline with the parameter name

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 166 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of this comment, comment the argument as /* onUpdate */

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 173 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if Errorf here warranted? If this can happen because another node goes down, consider Warningf.

Done. Also omitted .Error()


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 175 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider scratching this comment

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 190 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say "tenant not found" or something, and add a comment hinting to what the programmer was supposed to have done

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 196 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the name "save" is weird. updateBoundaries ?

This warrants discussion. I felt that updateBoundaries implies that boundaries are installed now , whereas saveBoundaries, saves boundaries for later ¯_(ツ)_/¯


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 212 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

either say more in the TODO, or scratch it. Also put it on a new line.

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 214 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you definitely want to hook up your editor to crlfmt :P

😅 TIL


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 217 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

invert the if and short-circuit if the tenant isn't found, so you can unindent the body. And make that error message a bit better.

Done.


pkg/keyvisualizer/spanstatscollector/span_stats_collector.go line 230 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

bad wrapping

Done.


pkg/keys/constants.go line 470 at r1 (raw file):

Previously, ajwerner wrote…

please do not use a constant ID here. This is our very last one, and I don't think this is a good use case for it. In the table, set the ID to be zero and let it get dynamically allocated in the relevant migration please.

Done.


pkg/keys/constants.go line 470 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm pretty sure we can have dynamic IDs now for system tables. Let's not take 49 if we don't need to.

Done.


pkg/kv/kvserver/spanstats/span_stats.go line 13 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what's going on here?

Oops. Deleted.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/clusterversion/cockroach_versions.go line 273 at r4 (raw file):

	// RangefeedUseOneStreamPerNode changes rangefeed implementation to use 1 RPC stream per node.
	RangefeedUseOneStreamPerNode
	// NoNonMVCCAddSSTable adds a migration whizch waits for all

revert


pkg/jobs/jobspb/jobs.proto line 653 at r1 (raw file):

Previously, zachlite wrote…

I think so. Details and Progress protos are required by the jobs.Record struct.

Would it be a good idea to teach WrapPayloadDetails() to handle nils (i.e. turn it into also a nil)?
Alternatively, can we use gogo/protobuf/types.Empty ? And perhaps turn it to nil in WrapPayloadDetails().


pkg/keyvisualizer/key_visualizer.go line 74 at r1 (raw file):

Previously, zachlite wrote…

Hmm, what's the motivation for this nit? These methods need a ctx as a parameter.

I'm not saying drop the parameter; I'm only saying you don't need to give it a name.

FetchStats(context.Context) error

The standard name doesn't add anything to the signature.


pkg/keyvisualizer/keyvisjob/job.go line 32 at r4 (raw file):

// Initialize registers the resumer of the key visualizer job.
func Initialize() {

can this be a init()? Any reason why does it need to be called by the Manager?


pkg/keyvisualizer/keyvisjob/job.go line 41 at r4 (raw file):

}

// Resume implements the jobs.Resumer interface.

please comment on what the thing does


pkg/keyvisualizer/keyvisjob/job.go line 41 at r4 (raw file):

}

// Resume implements the jobs.Resumer interface.

please take the opportunity to comment in the Resumer interface what the contract for the return value is. What happens if nil is returned? Very confused.


pkg/keyvisualizer/keyvisjob/job.go line 55 at r4 (raw file):

	if !execCtx.ExecCfg().NodeInfo.LogicalClusterID().Equal(r.job.Payload().CreationClusterID) {
		// When restoring a cluster, we don't want to end up with two instances of
		// the singleton reconciliation job.

what's this "reconciliation" here and below? Is this copy-pasta?


pkg/keyvisualizer/keyvisjob/job.go line 74 at r4 (raw file):

	// there is a problem starting the span stats consumer, this job will
	// aggressively restart at the job system level with no backoff.
	if err := r.job.Update(ctx, nil, func(_ *kv.Txn, md jobs.JobMetadata, ju *jobs.JobUpdater) error {

nit: pls comment the nil inline


pkg/keyvisualizer/keyvisjob/job.go line 85 at r4 (raw file):

	// TODO(zachlite): use a retry
	if err := c.DecideBoundaries(ctx); err != nil {
		log.Warningf(ctx, "decide boundaries failed with...%v", err)

with...%vis weird, and also missing a space. Do failed: %v.
And also include that in the returned error:

err = errors.Wrapf(err,"decide boundaries failed")
log.Warning(ctx, err)
return err

Also, are you sure you want to both log and return? The error seems to be logged again here


pkg/keyvisualizer/keyvisjob/job.go line 102 at r4 (raw file):

// OnFailOrCancel implements the jobs.Resumer interface.
func (r *resumer) OnFailOrCancel(ctx context.Context, _ interface{}, _ error) error {
	if jobs.HasErrJobCanceled(errors.DecodeError(ctx, *r.job.Payload().FinalResumeError)) {

what is the difference between this FinalResumeError and the function error param?


pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):

)

// Manager is the coordinator of the key visualizer subsystem.

I don't understand what the deals with the jobs is. What are we getting from the jobs infra?
IIUC, this manager spawns a new, short-lived jobs every few seconds. Each job collects the samples and terminates. Why do we need to create jobs? Besides the complex code involved, won't we spam the jobs table and the UI with all these jobs every few seconds?

If anything, I would understand if there was a long-running job that I could pause from the UI (assuming that's a thing), and perhaps see some info for free, like when the last samples were collected.


pkg/keyvisualizer/keyvismanager/manager.go line 66 at r4 (raw file):

// Start creates a background task that starts the key visualizer job.
func (m *Manager) Start(ctx context.Context) error {
	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager",
	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager", m.run)

Code quote:

	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager",
		func(ctx context.Context) {
			m.run(ctx)
		})

pkg/keyvisualizer/keyvismanager/manager.go line 126 at r4 (raw file):

}

func (m *Manager) createAndStartJobIfNoneExists(ctx context.Context) (bool, error) {

please comment the bool retval


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

Previously, zachlite wrote…

No, they're separate-ish. We ask KV for a tenant's ranges, and tell KV to collect statistics for arbitrary spans, which for now, happen to be ranges.

oh... Hmm, I find this RPC pretty bizarre then. I think there's a plan to make crdb_internal.ranges_no_leases available to tenants. We can ask around if anyone's looking at that. If not, we can go a level lower and use a kvcoord.RangeIterator. That guy already works for tenants (it's how DistSender works, and also DistSQL planning). Although now that I look at it, it seems kinda slow for iterating over many ranges, since it calls into the DistSender for every Next(). I think in the DistSender we use the range cache, and I think we read more that one range descriptor from the meta range when the cache misses, but still.
@ajwerner I see you played with it here. Do you have thoughts?


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 58 at r4 (raw file):

service KeyVisualizer {

  // GetTenantRanges gets the tenant's ranges from KV as a set of roachpb

bad wrap in the middle of the type name


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 65 at r4 (raw file):

  // SaveBoundaries tells KV about the key spans that the tenant wants
  // statistics for.
  rpc SaveBoundaries(SaveBoundariesRequest) returns (SaveBoundariesResponse) {}

consider UpdateBoundaries


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 68 at r4 (raw file):

  // GetSamples gets the tenant's collected statistics from KV. It
  // initiates a fan-out to all stores.

this comment lies, I think. Not all RPCs do a fan-out.


pkg/server/server.go line 142 at r4 (raw file):

	migrationServer *migrationServer

	// XXX: keyVisualizerServer

xxx


pkg/server/span_stats_server.go line 33 at r4 (raw file):

)

func getTenantRanges(

why extract this function out of its one caller?


pkg/server/span_stats_server.go line 65 at r4 (raw file):

		return nil
	})

nit: no new line here

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):
The motivation to use the jobs infra was for its consistency and coordination guarantees. Specifically, it's active or inactive while a tenant pod is active or inactive, respectively. Because it is set up as a singleton job, we can also guarantee that the key visualizer job will only fire once per tenant, regardless of the number of pods.

IIUC, this shouldn't be too spammy because the short 10-second sample period (job invocation period) that I chose was really just for development purposes. I imagine the period will be 10+ minutes in a production setting.

If anything, I would understand if there was a long-running job that I could pause from the UI (assuming that's a thing), and perhaps see some info for free, like when the last samples were collected.

Running with my assumption that a longer sample period will yield less job table spam, I felt that a forever-running job would just be more work to introduce, because I'd need to checkpoint the last sample time, etc.
What do you think?


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

bizarre

How so? Do you mean the motivation for the SpanStatsConsumer to fetch tenant ranges from KV, or do you mean the implementation that fulfills the RPC?
The fulfillment is done with kvclient.ScanMetaKVs

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):

Because it is set up as a singleton job, we can also guarantee that the key visualizer job will only fire once per tenant, regardless of the number of pods.
...
I felt that a forever-running job would just be more work to introduce, because I'd need to checkpoint the last sample time, etc.

Don't these two contradict each other? If there are no long-running jobs, then what "singletons" are we talking about?
Here's what the code looks like to me; tell me what I'm missing:

  1. Each SQLServer gets a keyvismanager.Manager. So, every SQL pod has one.
  2. Each Manager periodically checks whether any running TypeKeyVisualizer exist and, if it doesn't, creates one. The existence check is expected to fail every time, because jobs are short lived.
  3. Once created, a job runs for a short time, and then terminates (because it returns from Resume()).

So, the more SQLServers a tenant has, the more jobs it runs. Is this right?

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/keyvisualizer/keyvismanager/manager.go line 30 at r4 (raw file):

Is this right?

You're right. I see my mistake. I had confused checking to see if there's a job with running the job. As you pointed out, the existence check will run on all SQLServers and fail every time.

Ok. So, I will reimplement this as a forever running job.


pkg/keyvisualizer/keyvismanager/manager.go line 66 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…
	return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager", m.run)

Done.


pkg/keyvisualizer/keyvismanager/manager.go line 126 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment the bool retval

Done.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 58 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

bad wrap in the middle of the type name

🙀 Done.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 68 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this comment lies, I think. Not all RPCs do a fan-out.

I had hoped that "initiates" conveys the distinction between a fan-out and a non-fanout. I'll make this more explicit.


pkg/server/server.go line 142 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

xxx

Done.


pkg/server/span_stats_server.go line 33 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why extract this function out of its one caller?

Fixed.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 51 at r1 (raw file):

Previously, zachlite wrote…

bizarre

How so? Do you mean the motivation for the SpanStatsConsumer to fetch tenant ranges from KV, or do you mean the implementation that fulfills the RPC?
The fulfillment is done with kvclient.ScanMetaKVs

I mean I doubt that the right thing to do is to add a method to this otherwise obscure service for getting info about a tenant's ranges. A tenant can already read the descriptors for its ranges with RangeLookup (that's what the RangeIterator uses under the hood). Introducing a similar RPC should not be done lightly, because of security concerns if nothing else. I think that if that API will be good enough for crdb_internal.ranges, it should be good enough for our purposes. The responses to those requests are larger (full descriptors), but on the flip side the API is paginated (there's a limit). If size is a concern, we can consider optionally returning only some of the descriptor fields. But size does not seem to be a concerned since, on the server side, the caller of ScanMetaKVs gets all the descriptors at once.

Also, pretty much the same RPC is already exists on the status server. That guy checks for an admin user, because it's meant to be called by the CLI (debug zip) apparently. But I think we could figure out how to allow it for tenants. However, it's response format seems a bit idiosyncratic, so I don't know if I'd encourage we use it.

Depending on how motivated you're feeling, I think implementing crdb_internal.ranges_* would be a very beneficial thing to do; there's several issues open about it. I think I would do it with the RangeIterator, but I'd teach the RangeIterator to prefetch a lot of descriptors at once, and I'd teach it to not pollute the range cache with the results if there's memory pressure on the cache.


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 46 at r5 (raw file):

message GetTenantRangesRequest {
  roachpb.TenantID tenant = 1;

I don't think we want fields like this, here or elsewhere. The tenant ID should come from the certificate. You need to do the certificate check anyway (and please do it in this PR), and then, having another TenantID field only makes me wonder if the input was properly sanitized at auth time.
I think we should hook up the auth stuff in this PR, even if other stuff around secondary tenants doesn't work.
But also, I want to check that nothing is too broken with this patch when there are secondary tenants - like for example the overlapping buckets that I was asking about before. Would it be a good idea to disable the jobs for the secondary tenants until they're ready?


pkg/keyvisualizer/keyvispb/key_visualizer.proto line 56 at r5 (raw file):

// boundary. A spanstatspb.Sample contains the statistics collected on
// behalf of a tenant during a collection period.
service KeyVisualizer {

I don't see this service whitelisted in the tenantAuthorizer. Do the RPCs work for tenants? Leave a TODO somewhere.


pkg/server/span_stats_server.go line 48 at r5 (raw file):

	tenantKeySpan := roachpb.Span{
		Key:    tenantPrefix,
		EndKey: tenantPrefix.PrefixEnd(),

This will be KeyMin,KeyMax for the system tenant, I think. Is that what you want?
If there are secondary tenants, will that result in stores having overlapping buckets?

@zachlite zachlite requested a review from a team January 6, 2023 17:04
@zachlite zachlite force-pushed the 220727.keyvis branch 3 times, most recently from 01d2c7c to 83ed00a Compare January 10, 2023 18:40
@zachlite zachlite requested a review from a team as a code owner January 10, 2023 21:12
@zachlite zachlite force-pushed the 220727.keyvis branch 2 times, most recently from 0c778ce to 3b29af8 Compare January 10, 2023 22:58
@zachlite
Copy link
Contributor Author

bors r+

@zachlite zachlite changed the title introduction of the key visualizer, for internal use only. Introduction of the key visualizer Jan 11, 2023
@rail
Copy link
Member

rail commented Jan 11, 2023

bors single on

@rail
Copy link
Member

rail commented Jan 11, 2023

bors r-

@craig
Copy link
Contributor

craig bot commented Jan 11, 2023

Canceled.

The Key Visualizer enables the visualization of historical KV traffic.
It's main objective is to enable CRDB engineers and operators to
quickly identify historical and current KV hotspots. When a hotspot is
identified, the user can immediately tell: a) Where in the keyspace the
hotspot exist(s|ed). b) The time when the hotspot appeared and
disappeared.

The Key Visualizer can be enabled via `SET CLUSTER_SETTING
keyvisualizer.enabled = true;`

The visualization represents roachpb.RequestUnion.Requests destined for
a specific key-span on a log scale from black to red, representing
"cold" and "hot", respectively. The keyspace [Min,Max) is represented
on the Y-axis, and time is represented on the X-axis. By default, up to
2 weeks of historical data is persisted and visualized, with a sample
period of 1 Minute.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-12182
Release note (ui change): The introduction of the Key Visualizer
makes it easy to identify historical hotspots.
Three new cluster settings are introduced:
keyvisualizer.enabled
keyvisualizer.sample_interval
keyvisualizer.max_buckets

keyvisualizer.enabled enables the key visualizer.
keyvisualizer.sample_interval controls the frequency at which
the key visualizer collects samples.
keyvisualizer.max_buckets controls the maximum number of buckets in
a sample.

passing jobs and backupccl tests

lint fixes

fix tests

test fixes

fix tests

fix tests and lints

fix tests

add key visualizer to advanced debug

tests and generated files

fix tests
@zachlite zachlite requested a review from a team as a code owner January 24, 2023 19:17
@zachlite
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 24, 2023

Build succeeded:

@craig craig bot merged commit 39a32c8 into cockroachdb:master Jan 24, 2023
abarganier added a commit to abarganier/cockroach that referenced this pull request May 1, 2023
This log statement was accidentally merged as part of
cockroachdb#88353.

This patch simply removes the log line.

Release note: none
craig bot pushed a commit that referenced this pull request May 1, 2023
102697: pkg/server/status: remove leftover debug log statement. r=ericharmeling a=abarganier

This log statement was accidentally merged as part of #88353.

This patch simply removes the log line.

Release note: none

Epic: CRDB-12182

Co-authored-by: Alex Barganier <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request May 1, 2023
This log statement was accidentally merged as part of
#88353.

This patch simply removes the log line.

Release note: none
blathers-crl bot pushed a commit that referenced this pull request May 1, 2023
This log statement was accidentally merged as part of
#88353.

This patch simply removes the log line.

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants