-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduction of the key visualizer #88353
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/keys/constants.go
line 470 at r1 (raw file):
SpanConfigurationsTableID = 47 RoleIDSequenceID = 48 SpanStatsTenantBoundariesTableID = 49
please do not use a constant ID here. This is our very last one, and I don't think this is a good use case for it. In the table, set the ID to be zero and let it get dynamically allocated in the relevant migration please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a superficial look to get some bearings. Will do more later.
Reviewed 3 of 65 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)
-- commits
line 2 at r1:
you need more in here :P
for internal use only
Unless this means something very specific, scratch it.
-- commits
line 2 at r1:
Can we get table names instead of IDs on the web page?
The Key Visualizer is disabled by default. It can be enabled via:
Why is it disabled? When will it be enabled?
pkg/jobs/jobspb/jobs.proto
line 653 at r1 (raw file):
message KeyVisualizerDetails {
being empty, is this needed?
pkg/jobs/jobspb/jobs.proto
line 658 at r1 (raw file):
message KeyVisualizerProgress { util.hlc.Timestamp last_sample_time = 1 [(gogoproto.nullable) = false];
please comment
pkg/jobs/jobspb/jobs.proto
line 1135 at r1 (raw file):
roachpb.Version creation_cluster_version = 36 [(gogoproto.nullable) = false]; // NEXT ID: 39.
get rid of this
pkg/keys/constants.go
line 470 at r1 (raw file):
SpanConfigurationsTableID = 47 RoleIDSequenceID = 48 SpanStatsTenantBoundariesTableID = 49
I'm pretty sure we can have dynamic IDs now for system tables. Let's not take 49 if we don't need to.
pkg/keyvisualizer/key_visualizer.go
line 11 at r1 (raw file):
// licenses/APL.txt. package keyvisualizer
do you want to put a big comment here or in a doc.go or link to something explaining the architecture of the thing?
pkg/keyvisualizer/key_visualizer.go
line 22 at r1 (raw file):
) // KVAccessor provides a tenant with access to the keyvispb.KeyVisualizerServer
missing period at end of sentence.
pkg/keyvisualizer/key_visualizer.go
line 33 at r1 (raw file):
GetTenantRanges( ctx context.Context, tenantID roachpb.TenantID) (*keyvispb.GetTenantRangesResponse, error)
nit: you've formatted all the others with then return values on a new line
pkg/keyvisualizer/key_visualizer.go
line 44 at r1 (raw file):
// SpanStatsCollector collects request statistics for tenants. // SpanStatsCollector will only collect statistics for a tenant after the
This comment is pretty awkward. Consider expanding on the relationship between SpanStatsCollector and KVAccessor.
pkg/keyvisualizer/key_visualizer.go
line 46 at r1 (raw file):
// SpanStatsCollector will only collect statistics for a tenant after the // tenant has called an implementation of KVAccessor.UpdateBoundaries. type SpanStatsCollector interface {
This seems to have a single implementation. Does it really need to be an interface?
pkg/keyvisualizer/key_visualizer.go
line 61 at r1 (raw file):
// wishes to have request statistics collected for. // An error is returned if the watcher is not started successfully. Start(ctx context.Context, stopper *stop.Stopper) error
It seems weird to me that watching for updates on whatever table is bundled with something called StatsCollector
. It also makes this comment awkward, since SpanStatsCollector
is supposed to be a generic interface. Have you considered pulling the watch logic out of this interface, and replacing it with an UpdateBoundaries()
?
pkg/keyvisualizer/key_visualizer.go
line 65 at r1 (raw file):
what's up with all these spaces? Have you run the formatter on this file?
pkg/keyvisualizer/key_visualizer.go
line 74 at r1 (raw file):
// When it receives the response, it will perform downsampling, // normalization, and ultimately write the sample to the tenant's system tables. FetchStats(ctx context.Context) error
nit: drop the ctx
name in all these function declarations
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 33 at r1 (raw file):
message GetSamplesRequest { roachpb.TenantID tenant = 1; util.hlc.Timestamp start = 2 [(gogoproto.nullable) = false];
comment pls
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 49 at r1 (raw file):
} service KeyVisualizer {
comment pls
What are boundaries? What are samples?
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 51 at r1 (raw file):
service KeyVisualizer { rpc GetTenantRanges(GetTenantRangesRequest) returns
comment
Are the "ranges" here the same as the "boundaries" in SaveBoundaries
? If so, let's change one of the names.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 54 at r1 (raw file):
(GetTenantRangesResponse) {} // SaveBoundaries is issued by the tenant. The responding node is
this comment doesn't really say what this RPC does
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 61 at r1 (raw file):
// GetSamplesFromAllNodes is issued by the tenant. rpc GetSamplesFromAllNodes(GetSamplesRequest) returns (GetSamplesResponse) {}
Elsewhere the pattern is that there's a single RPC for getting data either from one node or from all. The request has a nodeID field, and not populating it means a fan-out. Consider that.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 23 at r1 (raw file):
) type tenantStatsBucket struct {
say that this implements some btree interface, and add a variable declaration below also ensuring that.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 23 at r1 (raw file):
) type tenantStatsBucket struct {
tenantStatsBucket
doesn't know anything about a tenant, so consider taking "tenant" out of the name
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 40 at r1 (raw file):
type tenantStatsCollector struct { stashedBoundaries []*roachpb.Span
I don't know what to make of the name "stashed". Consider scratching it. Or maybe nextBoundaries
?
This needs a comment about the relationship with the tree's elements, and how GetSamples()
updates one from the other. I find that weird, btw. Maybe we can find another scheme.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 60 at r1 (raw file):
counter: 0, } err := t.Insert(&bucket, false)
comment the false
inline pls
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 62 at r1 (raw file):
err := t.Insert(&bucket, false) if err != nil { return nil, err
please comment on what it means for newTreeWithBoundaries
to return an error. I don't think the error handling we do in GetSamples()
makes sense. t.Insert()
only returns errors if the type of the range is not good, right? Let's panic.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 71 at r1 (raw file):
func (t *tenantStatsCollector) getStats() []*spanstatspb.SpanStats { stats := make([]*spanstatspb.SpanStats, 0)
make(..., 0, t.tree.Len())
?
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 72 at r1 (raw file):
stats := make([]*spanstatspb.SpanStats, 0) // TODO: acquire mutex lock?
well, what's the decision to make here?
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 75 at r1 (raw file):
it := t.tree.Iterator() for { i, next := it.Next()
s/next/ok
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 102 at r1 (raw file):
rff *rangefeed.Factory }
nit: var _ keyvisualizer.SpanStatsCollector = *SpanStatsCollector(nil)
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 149 at r1 (raw file):
if err := protoutil.Unmarshal(boundariesEncoded, &decoded); err != nil { log.Warningf(ctx, err.Error())
I think you want to print err
, not err.Error()
, so that redaction can do its thing.
And put more text in this log pls.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 157 at r1 (raw file):
} w := rangefeedcache.NewWatcher(
please comment all the literals inline with the parameter name
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 166 at r1 (raw file):
translateEvent, func(ctx context.Context, update rangefeedcache.Update) { // noop
instead of this comment, comment the argument as /* onUpdate */
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 173 at r1 (raw file):
// Kick off the rangefeedcache which will retry until the stopper stops. if err := rangefeedcache.Start(ctx, stopper, w, func(err error) { log.Errorf(ctx, "span stats collector rangefeed error: %s", err.Error())
if Errorf
here warranted? If this can happen because another node goes down, consider Warningf
.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 175 at r1 (raw file):
log.Errorf(ctx, "span stats collector rangefeed error: %s", err.Error()) }); err != nil { return err // we're shutting down
consider scratching this comment
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 190 at r1 (raw file):
collector.increment(span) } else { return errors.New("could not increment collector for tenant")
say "tenant not found" or something, and add a comment hinting to what the programmer was supposed to have done
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 196 at r1 (raw file):
} func (s *SpanStatsCollector) saveBoundaries(id roachpb.TenantID,
I think the name "save" is weird. updateBoundaries
?
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 212 at r1 (raw file):
// It implicitly starts a new collection period by clearing the old // statistics. A sample period is therefore defined by the interval between a // tenant requesting samples. TODO(zachlite): this will change with
either say more in the TODO, or scratch it. Also put it on a new line.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 214 at r1 (raw file):
// tenant requesting samples. TODO(zachlite): this will change with // improved fault tolerance mechanisms. func (s *SpanStatsCollector) GetSamples(
you definitely want to hook up your editor to crlfmt :P
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 217 at r1 (raw file):
id roachpb.TenantID) ([]spanstatspb.Sample, error) { if collector, ok := s.collectors[id]; ok {
invert the if and short-circuit if the tenant isn't found, so you can unindent the body. And make that error message a bit better.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 230 at r1 (raw file):
// TODO(zachlite): until the collector can stash tenant samples, // the collector will only return one sample at a time. // While this is the case,
bad wrapping
pkg/kv/kvserver/spanstats/span_stats.go
line 13 at r1 (raw file):
package spanstats // //import (
what's going on here?
pkg/kv/kvserver/spanstats/spanstatspb/span_stats.proto
line 12 at r1 (raw file):
syntax = "proto3";
is kvserver
the right place for these protos? Should they be under or around spanstatscollector
?
pkg/kv/kvserver/spanstats/spanstatspb/span_stats.proto
line 20 at r1 (raw file):
import "roachpb/data.proto"; message SpanStats {
comments all around please
pkg/sql/catalog/systemschema/system.go
line 733 at r1 (raw file):
CREATE TABLE system.tenant_settings ( tenant_id INT8 NOT NULL, boundaries BYTES NOT NULL,
Does this really need to be a proto? It'd probably be much nicer if it were SQL readable (even if the keys would be bytes). I guess you need a couple of parallel arrays, since we don't have record fields. Maybe json is the ticket.
At the very least, say somewhere what proto goes in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #88477
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I was chatting with @abarganier about is whether we should have the stats data live in the Observability Service to begin with. By now the service is ready to ingest event data. We've also decided to embed the service into CRDB for the 23.1 release, to give people confidence that "it's launching", and I'll be working on that very soon. Maybe Alex can directly help to integrate this patch with the Obs Service.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)
9be5015
to
68a7d8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @koorosh, @Santamaura, and @zachlite)
a discussion (no related file):
I've gotten rid of the static table ID. As part of that effort, I realized I should not have installed a rangefeed watcher on every store. Now, the rangefeed is installed once per node in pkg/server, and communicates updates to each SpanStatsCollector via node.stores.VisitStores
I've still got some more commenting to do.
68a7d8b
to
8a8ab80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/jobs/jobspb/jobs.proto
line 653 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
being empty, is this needed?
I think so. Details and Progress protos are required by the jobs.Record struct.
pkg/jobs/jobspb/jobs.proto
line 658 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please comment
Done.
pkg/jobs/jobspb/jobs.proto
line 1135 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
get rid of this
Done.
pkg/keyvisualizer/key_visualizer.go
line 22 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
missing period at end of sentence.
Done.
pkg/keyvisualizer/key_visualizer.go
line 46 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This seems to have a single implementation. Does it really need to be an interface?
Deleted.
pkg/keyvisualizer/key_visualizer.go
line 61 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
It seems weird to me that watching for updates on whatever table is bundled with something called
StatsCollector
. It also makes this comment awkward, sinceSpanStatsCollector
is supposed to be a generic interface. Have you considered pulling the watch logic out of this interface, and replacing it with anUpdateBoundaries()
?
Done. Pulling out the watch logic was done as part of the refactor away from a static table ID.
pkg/keyvisualizer/key_visualizer.go
line 65 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
what's up with all these spaces? Have you run the formatter on this file?
Done.
pkg/keyvisualizer/key_visualizer.go
line 74 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: drop the
ctx
name in all these function declarations
Hmm, what's the motivation for this nit? These methods need a ctx as a parameter.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 33 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
comment pls
These fields weren't used, so I removed them.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 49 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
comment pls
What are boundaries? What are samples?
Done.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 51 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
comment
Are the "ranges" here the same as the "boundaries" in
SaveBoundaries
? If so, let's change one of the names.
No, they're separate-ish. We ask KV for a tenant's ranges, and tell KV to collect statistics for arbitrary spans, which for now, happen to be ranges.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 54 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this comment doesn't really say what this RPC does
Fixed.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 61 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Elsewhere the pattern is that there's a single RPC for getting data either from one node or from all. The request has a nodeID field, and not populating it means a fan-out. Consider that.
Nice! Done.
8a8ab80
to
d5306b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 23 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
say that this implements some btree interface, and add a variable declaration below also ensuring that.
Do you mean this?
var _ interval.Interface = &tenantStatsBucket{}
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 23 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
tenantStatsBucket
doesn't know anything about a tenant, so consider taking "tenant" out of the name
It doesn't have any tenant-specific data in it, but its purpose is to be a bucket for tenant statistics. I can take it out if you feel strongly about it.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 40 at r1 (raw file):
I find that weird, btw.
Happy to talk more about the lifecycle of a tenant's collector. In the current scheme,GetSamples
terminates the current sample period for a tenant, and starts a new one with a fresh bTree. The fresh bTree is configured with the boundaries most recently sent from the tenant.
I'm now leaning towards "preferredBoundaries", or something that decouples boundaries from time. The collector should not care about the frequency of the tenant RPC'ing its preferred boundaries to KV.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 60 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
comment the
false
inline pls
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 62 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please comment on what it means for
newTreeWithBoundaries
to return an error. I don't think the error handling we do inGetSamples()
makes sense.t.Insert()
only returns errors if the type of the range is not good, right? Let's panic.
Right. Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 71 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
make(..., 0, t.tree.Len())
?
Nice. Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 75 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/next/ok
In this instance, it.Next()
only returns two values.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 102 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit:
var _ keyvisualizer.SpanStatsCollector = *SpanStatsCollector(nil)
I deleted the interface, since there's only one instance, but noted.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 149 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think you want to print
err
, noterr.Error()
, so that redaction can do its thing.
And put more text in this log pls.
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 157 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please comment all the literals inline with the parameter name
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 166 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
instead of this comment, comment the argument as
/* onUpdate */
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 173 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
if
Errorf
here warranted? If this can happen because another node goes down, considerWarningf
.
Done. Also omitted .Error()
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 175 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider scratching this comment
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 190 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
say "tenant not found" or something, and add a comment hinting to what the programmer was supposed to have done
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 196 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think the name "save" is weird.
updateBoundaries
?
This warrants discussion. I felt that updateBoundaries
implies that boundaries are installed now , whereas saveBoundaries
, saves boundaries for later ¯_(ツ)_/¯
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 212 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
either say more in the TODO, or scratch it. Also put it on a new line.
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 214 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
you definitely want to hook up your editor to crlfmt :P
😅 TIL
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 217 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
invert the if and short-circuit if the tenant isn't found, so you can unindent the body. And make that error message a bit better.
Done.
pkg/keyvisualizer/spanstatscollector/span_stats_collector.go
line 230 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
bad wrapping
Done.
pkg/keys/constants.go
line 470 at r1 (raw file):
Previously, ajwerner wrote…
please do not use a constant ID here. This is our very last one, and I don't think this is a good use case for it. In the table, set the ID to be zero and let it get dynamically allocated in the relevant migration please.
Done.
pkg/keys/constants.go
line 470 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm pretty sure we can have dynamic IDs now for system tables. Let's not take 49 if we don't need to.
Done.
pkg/kv/kvserver/spanstats/span_stats.go
line 13 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
what's going on here?
Oops. Deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/clusterversion/cockroach_versions.go
line 273 at r4 (raw file):
// RangefeedUseOneStreamPerNode changes rangefeed implementation to use 1 RPC stream per node. RangefeedUseOneStreamPerNode // NoNonMVCCAddSSTable adds a migration whizch waits for all
revert
pkg/jobs/jobspb/jobs.proto
line 653 at r1 (raw file):
Previously, zachlite wrote…
I think so. Details and Progress protos are required by the jobs.Record struct.
Would it be a good idea to teach WrapPayloadDetails()
to handle nil
s (i.e. turn it into also a nil)?
Alternatively, can we use gogo/protobuf/types.Empty
? And perhaps turn it to nil
in WrapPayloadDetails()
.
pkg/keyvisualizer/key_visualizer.go
line 74 at r1 (raw file):
Previously, zachlite wrote…
Hmm, what's the motivation for this nit? These methods need a ctx as a parameter.
I'm not saying drop the parameter; I'm only saying you don't need to give it a name.
FetchStats(context.Context) error
The standard name doesn't add anything to the signature.
pkg/keyvisualizer/keyvisjob/job.go
line 32 at r4 (raw file):
// Initialize registers the resumer of the key visualizer job. func Initialize() {
can this be a init()
? Any reason why does it need to be called by the Manager
?
pkg/keyvisualizer/keyvisjob/job.go
line 41 at r4 (raw file):
} // Resume implements the jobs.Resumer interface.
please comment on what the thing does
pkg/keyvisualizer/keyvisjob/job.go
line 41 at r4 (raw file):
} // Resume implements the jobs.Resumer interface.
please take the opportunity to comment in the Resumer
interface what the contract for the return value is. What happens if nil
is returned? Very confused.
pkg/keyvisualizer/keyvisjob/job.go
line 55 at r4 (raw file):
if !execCtx.ExecCfg().NodeInfo.LogicalClusterID().Equal(r.job.Payload().CreationClusterID) { // When restoring a cluster, we don't want to end up with two instances of // the singleton reconciliation job.
what's this "reconciliation" here and below? Is this copy-pasta?
pkg/keyvisualizer/keyvisjob/job.go
line 74 at r4 (raw file):
// there is a problem starting the span stats consumer, this job will // aggressively restart at the job system level with no backoff. if err := r.job.Update(ctx, nil, func(_ *kv.Txn, md jobs.JobMetadata, ju *jobs.JobUpdater) error {
nit: pls comment the nil
inline
pkg/keyvisualizer/keyvisjob/job.go
line 85 at r4 (raw file):
// TODO(zachlite): use a retry if err := c.DecideBoundaries(ctx); err != nil { log.Warningf(ctx, "decide boundaries failed with...%v", err)
with...%v
is weird, and also missing a space. Do failed: %v
.
And also include that in the returned error:
err = errors.Wrapf(err,"decide boundaries failed")
log.Warning(ctx, err)
return err
Also, are you sure you want to both log and return? The error seems to be logged again here
pkg/keyvisualizer/keyvisjob/job.go
line 102 at r4 (raw file):
// OnFailOrCancel implements the jobs.Resumer interface. func (r *resumer) OnFailOrCancel(ctx context.Context, _ interface{}, _ error) error { if jobs.HasErrJobCanceled(errors.DecodeError(ctx, *r.job.Payload().FinalResumeError)) {
what is the difference between this FinalResumeError
and the function error
param?
pkg/keyvisualizer/keyvismanager/manager.go
line 30 at r4 (raw file):
) // Manager is the coordinator of the key visualizer subsystem.
I don't understand what the deals with the jobs is. What are we getting from the jobs infra?
IIUC, this manager spawns a new, short-lived jobs every few seconds. Each job collects the samples and terminates. Why do we need to create jobs? Besides the complex code involved, won't we spam the jobs table and the UI with all these jobs every few seconds?
If anything, I would understand if there was a long-running job that I could pause from the UI (assuming that's a thing), and perhaps see some info for free, like when the last samples were collected.
pkg/keyvisualizer/keyvismanager/manager.go
line 66 at r4 (raw file):
// Start creates a background task that starts the key visualizer job. func (m *Manager) Start(ctx context.Context) error { return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager",
return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager", m.run)
Code quote:
return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager",
func(ctx context.Context) {
m.run(ctx)
})
pkg/keyvisualizer/keyvismanager/manager.go
line 126 at r4 (raw file):
} func (m *Manager) createAndStartJobIfNoneExists(ctx context.Context) (bool, error) {
please comment the bool retval
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 51 at r1 (raw file):
Previously, zachlite wrote…
No, they're separate-ish. We ask KV for a tenant's ranges, and tell KV to collect statistics for arbitrary spans, which for now, happen to be ranges.
oh... Hmm, I find this RPC pretty bizarre then. I think there's a plan to make crdb_internal.ranges_no_leases
available to tenants. We can ask around if anyone's looking at that. If not, we can go a level lower and use a kvcoord.RangeIterator
. That guy already works for tenants (it's how DistSender works, and also DistSQL planning). Although now that I look at it, it seems kinda slow for iterating over many ranges, since it calls into the DistSender
for every Next()
. I think in the DistSender
we use the range cache, and I think we read more that one range descriptor from the meta range when the cache misses, but still.
@ajwerner I see you played with it here. Do you have thoughts?
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 58 at r4 (raw file):
service KeyVisualizer { // GetTenantRanges gets the tenant's ranges from KV as a set of roachpb
bad wrap in the middle of the type name
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 65 at r4 (raw file):
// SaveBoundaries tells KV about the key spans that the tenant wants // statistics for. rpc SaveBoundaries(SaveBoundariesRequest) returns (SaveBoundariesResponse) {}
consider UpdateBoundaries
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 68 at r4 (raw file):
// GetSamples gets the tenant's collected statistics from KV. It // initiates a fan-out to all stores.
this comment lies, I think. Not all RPCs do a fan-out.
pkg/server/server.go
line 142 at r4 (raw file):
migrationServer *migrationServer // XXX: keyVisualizerServer
xxx
pkg/server/span_stats_server.go
line 33 at r4 (raw file):
) func getTenantRanges(
why extract this function out of its one caller?
pkg/server/span_stats_server.go
line 65 at r4 (raw file):
return nil })
nit: no new line here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/keyvisualizer/keyvismanager/manager.go
line 30 at r4 (raw file):
The motivation to use the jobs infra was for its consistency and coordination guarantees. Specifically, it's active or inactive while a tenant pod is active or inactive, respectively. Because it is set up as a singleton job, we can also guarantee that the key visualizer job will only fire once per tenant, regardless of the number of pods.
IIUC, this shouldn't be too spammy because the short 10-second sample period (job invocation period) that I chose was really just for development purposes. I imagine the period will be 10+ minutes in a production setting.
If anything, I would understand if there was a long-running job that I could pause from the UI (assuming that's a thing), and perhaps see some info for free, like when the last samples were collected.
Running with my assumption that a longer sample period will yield less job table spam, I felt that a forever-running job would just be more work to introduce, because I'd need to checkpoint the last sample time, etc.
What do you think?
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 51 at r1 (raw file):
bizarre
How so? Do you mean the motivation for the SpanStatsConsumer
to fetch tenant ranges from KV, or do you mean the implementation that fulfills the RPC?
The fulfillment is done with kvclient.ScanMetaKVs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/keyvisualizer/keyvismanager/manager.go
line 30 at r4 (raw file):
Because it is set up as a singleton job, we can also guarantee that the key visualizer job will only fire once per tenant, regardless of the number of pods.
...
I felt that a forever-running job would just be more work to introduce, because I'd need to checkpoint the last sample time, etc.
Don't these two contradict each other? If there are no long-running jobs, then what "singletons" are we talking about?
Here's what the code looks like to me; tell me what I'm missing:
- Each
SQLServer
gets akeyvismanager.Manager
. So, every SQL pod has one. - Each
Manager
periodically checks whether any runningTypeKeyVisualizer
exist and, if it doesn't, creates one. The existence check is expected to fail every time, because jobs are short lived. - Once created, a job runs for a short time, and then terminates (because it returns from
Resume()
).
So, the more SQLServers
a tenant has, the more jobs it runs. Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/keyvisualizer/keyvismanager/manager.go
line 30 at r4 (raw file):
Is this right?
You're right. I see my mistake. I had confused checking to see if there's a job with running the job. As you pointed out, the existence check will run on all SQLServer
s and fail every time.
Ok. So, I will reimplement this as a forever running job.
pkg/keyvisualizer/keyvismanager/manager.go
line 66 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
return m.stopper.RunAsyncTask(ctx, "key-visualizer-manager", m.run)
Done.
pkg/keyvisualizer/keyvismanager/manager.go
line 126 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please comment the bool retval
Done.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 58 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
bad wrap in the middle of the type name
🙀 Done.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 68 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this comment lies, I think. Not all RPCs do a fan-out.
I had hoped that "initiates" conveys the distinction between a fan-out and a non-fanout. I'll make this more explicit.
pkg/server/server.go
line 142 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
xxx
Done.
pkg/server/span_stats_server.go
line 33 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why extract this function out of its one caller?
Fixed.
d5306b0
to
ef78a56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, @andreimatei, @irfansharif, @koorosh, @Santamaura, and @zachlite)
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 51 at r1 (raw file):
Previously, zachlite wrote…
bizarre
How so? Do you mean the motivation for the
SpanStatsConsumer
to fetch tenant ranges from KV, or do you mean the implementation that fulfills the RPC?
The fulfillment is done withkvclient.ScanMetaKVs
I mean I doubt that the right thing to do is to add a method to this otherwise obscure service for getting info about a tenant's ranges. A tenant can already read the descriptors for its ranges with RangeLookup
(that's what the RangeIterator
uses under the hood). Introducing a similar RPC should not be done lightly, because of security concerns if nothing else. I think that if that API will be good enough for crdb_internal.ranges
, it should be good enough for our purposes. The responses to those requests are larger (full descriptors), but on the flip side the API is paginated (there's a limit). If size is a concern, we can consider optionally returning only some of the descriptor fields. But size does not seem to be a concerned since, on the server side, the caller of ScanMetaKVs
gets all the descriptors at once.
Also, pretty much the same RPC is already exists on the status server. That guy checks for an admin user, because it's meant to be called by the CLI (debug zip) apparently. But I think we could figure out how to allow it for tenants. However, it's response format seems a bit idiosyncratic, so I don't know if I'd encourage we use it.
Depending on how motivated you're feeling, I think implementing crdb_internal.ranges_*
would be a very beneficial thing to do; there's several issues open about it. I think I would do it with the RangeIterator
, but I'd teach the RangeIterator
to prefetch a lot of descriptors at once, and I'd teach it to not pollute the range cache with the results if there's memory pressure on the cache.
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 46 at r5 (raw file):
message GetTenantRangesRequest { roachpb.TenantID tenant = 1;
I don't think we want fields like this, here or elsewhere. The tenant ID should come from the certificate. You need to do the certificate check anyway (and please do it in this PR), and then, having another TenantID
field only makes me wonder if the input was properly sanitized at auth time.
I think we should hook up the auth stuff in this PR, even if other stuff around secondary tenants doesn't work.
But also, I want to check that nothing is too broken with this patch when there are secondary tenants - like for example the overlapping buckets that I was asking about before. Would it be a good idea to disable the jobs for the secondary tenants until they're ready?
pkg/keyvisualizer/keyvispb/key_visualizer.proto
line 56 at r5 (raw file):
// boundary. A spanstatspb.Sample contains the statistics collected on // behalf of a tenant during a collection period. service KeyVisualizer {
I don't see this service whitelisted in the tenantAuthorizer
. Do the RPCs work for tenants? Leave a TODO somewhere.
pkg/server/span_stats_server.go
line 48 at r5 (raw file):
tenantKeySpan := roachpb.Span{ Key: tenantPrefix, EndKey: tenantPrefix.PrefixEnd(),
This will be KeyMin,KeyMax
for the system tenant, I think. Is that what you want?
If there are secondary tenants, will that result in stores having overlapping buckets?
7d6d886
to
b4b4580
Compare
01d2c7c
to
83ed00a
Compare
0c778ce
to
3b29af8
Compare
bors r+ |
bors single on |
bors r- |
Canceled. |
3b29af8
to
6cbeece
Compare
The Key Visualizer enables the visualization of historical KV traffic. It's main objective is to enable CRDB engineers and operators to quickly identify historical and current KV hotspots. When a hotspot is identified, the user can immediately tell: a) Where in the keyspace the hotspot exist(s|ed). b) The time when the hotspot appeared and disappeared. The Key Visualizer can be enabled via `SET CLUSTER_SETTING keyvisualizer.enabled = true;` The visualization represents roachpb.RequestUnion.Requests destined for a specific key-span on a log scale from black to red, representing "cold" and "hot", respectively. The keyspace [Min,Max) is represented on the Y-axis, and time is represented on the X-axis. By default, up to 2 weeks of historical data is persisted and visualized, with a sample period of 1 Minute. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-12182 Release note (ui change): The introduction of the Key Visualizer makes it easy to identify historical hotspots. Three new cluster settings are introduced: keyvisualizer.enabled keyvisualizer.sample_interval keyvisualizer.max_buckets keyvisualizer.enabled enables the key visualizer. keyvisualizer.sample_interval controls the frequency at which the key visualizer collects samples. keyvisualizer.max_buckets controls the maximum number of buckets in a sample. passing jobs and backupccl tests lint fixes fix tests test fixes fix tests fix tests and lints fix tests add key visualizer to advanced debug tests and generated files fix tests
6cbeece
to
7416aab
Compare
bors r+ |
Build succeeded: |
This log statement was accidentally merged as part of cockroachdb#88353. This patch simply removes the log line. Release note: none
102697: pkg/server/status: remove leftover debug log statement. r=ericharmeling a=abarganier This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none Epic: CRDB-12182 Co-authored-by: Alex Barganier <[email protected]>
This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none
This log statement was accidentally merged as part of #88353. This patch simply removes the log line. Release note: none
This PR introduces the Key Visualizer 🔍 🔥 🪳
The Key Visualizer is designed for multi-tenancy, but is currently only implemented for use by the system tenant. Enabling usage by secondary tenants may be considered in the future. Read more about current limitations below.
Usage and configuration
The Key Visualizer is disabled by default. It can be enabled via:
The default sample period is 1 minute, and the sample retention period is 2 weeks.
The sample period can be configured via:
The default sample period was kept short for faster debugging cycles. The default we settle on in the future is subject to feedback. If the default sample period is kept short, we'll need to lower the retention period.
The default number of buckets (think sample resolution) is 256.
It can be raised to a maximum value of 1024.
Known issues, and improvements for 23.1
Downsampling strategy. The current downsampling strategy is implemented here. It's not bad, but it can be better. There's currently no guarantee of boundary stability between samples, and there are other heuristics to explore to prioritize preserving resolution in "important" parts of the keyspace.
UI improvements