-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: support multi-span statistics endpoint #101378
server: support multi-span statistics endpoint #101378
Conversation
963ca08
to
94a160e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome performance gains!
Reviewed 1 of 18 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @renatolabs, @THardy98, and @zachlite)
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 77 at r1 (raw file):
} // Dial a node for span stats on startup. All nodes will be on the previous version, this is a sanity check. l.Printf("Fetch span stats from node (start).")
Would it be useful to add more context for troubleshooting like the node, startkey, endkey, etc...?
pkg/roachpb/span_stats.go
line 17 at r1 (raw file):
// Put span statistics cluster settings here to avoid import cycle. const DefaultSpanStatsSpanLimit = 500
What is the reason for having a const instead of just specifying it in the register setting? I find it harder to read because you have to look where the const is specified.
Why is this one public, and the setting below private?
pkg/server/span_stats_server.go
line 142 at r1 (raw file):
ri.Seek(ctx, rSpan.Key, kvcoord.Ascending) spanStats := &roachpb.SpanStats{} var fullyContainedKeysBatch []roachpb.Key
Any way to know what the count is to create the exact size array to avoid resizing?
pkg/server/span_stats_server.go
line 318 at r1 (raw file):
func isLegacyRequest(req *roachpb.SpanStatsRequest) bool { // If the start/end key fields are not nil, we have a request using the old request format. return req.StartKey != nil && req.EndKey != nil
Any chance only one value is set? Wondering if this be an || instead of &&
pkg/server/status.go
line 3482 at r1 (raw file):
return nil, errors.New(MixedVersionErr) } if len(req.Spans) > int(roachpb.SpanStatsBatchLimit.Get(&s.st.SV)) {
What is the reason for failing the request? This might occur if the limit changes in the middle of a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mixed version test is awesome. Nice work.
Reviewed 16 of 18 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @j82w, @renatolabs, and @THardy98)
pkg/roachpb/span_stats.proto
line 25 at r1 (raw file):
// result from across the cluster. message SpanStatsRequest { string node_id = 1 [(gogoproto.customname) = "NodeID"];
So these fields are legacy now? See comment below.
pkg/roachpb/span_stats.proto
line 43 at r1 (raw file):
message SpanStatsResponse { cockroach.storage.enginepb.MVCCStats total_stats = 1 [(gogoproto.nullable) = false]; // range_count measures the number of ranges that the request span falls within.
Duplicate range_count
comment?
pkg/roachpb/span_stats.proto
line 47 at r1 (raw file):
// key sorts after the range start, and whose end key sorts before the // range end, will have a range_count value of 1. int32 range_count = 2;
So we've only added new fields to SpanStatsResponse
?
Can you add add some comments marking fields as legacy - and can you include a brief explanation in the comment
about why the field is legacy but not also reserved
.
pkg/server/span_stats_server.go
line 318 at r1 (raw file):
Previously, j82w (Jake) wrote…
Any chance only one value is set? Wondering if this be an || instead of &&
I'll +1 this. StartKey
and EndKey
will almost always be set together, but incase just one is set we should treat the
request as malformed / legacy.
pkg/server/status.go
line 3473 at r1 (raw file):
} // If we receive a request using the old format. if isLegacyRequest(req) {
Excellent. Thanks for pulling this logic out of span_stats_server
.
94a160e
to
13f1c9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @j82w, @renatolabs, and @zachlite)
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 77 at r1 (raw file):
Previously, j82w (Jake) wrote…
Would it be useful to add more context for troubleshooting like the node, startkey, endkey, etc...?
We use the same start/end key for all queries.
pkg/roachpb/span_stats.go
line 17 at r1 (raw file):
Previously, j82w (Jake) wrote…
What is the reason for having a const instead of just specifying it in the register setting? I find it harder to read because you have to look where the const is specified.
Why is this one public, and the setting below private?
I find it convenient for resetting the cluster setting back to its default value (now you have a var instead of a literal). The const is specified in the same file as the cluster setting, should make it not difficult to find.
This one is public so it could be used in a test.
pkg/roachpb/span_stats.proto
line 25 at r1 (raw file):
Previously, zachlite wrote…
So these fields are legacy now? See comment below.
Left a comment.
pkg/roachpb/span_stats.proto
line 43 at r1 (raw file):
Previously, zachlite wrote…
Duplicate
range_count
comment?
Removed.
pkg/roachpb/span_stats.proto
line 47 at r1 (raw file):
Previously, zachlite wrote…
So we've only added new fields to
SpanStatsResponse
?Can you add add some comments marking fields as legacy - and can you include a brief explanation in the comment
about why the field is legacy but not alsoreserved
.
I've left a comment designating these fields as legacy and why we don't reserve them. Left a TODO
for me when the time comes that we can reserve these fields.
pkg/server/span_stats_server.go
line 142 at r1 (raw file):
Previously, j82w (Jake) wrote…
Any way to know what the count is to create the exact size array to avoid resizing?
I don't think so. That would require us knowing beforehand which spans are fully contained by a range and which spans are not (i.e. split across multiple ranges).
pkg/server/span_stats_server.go
line 318 at r1 (raw file):
Previously, zachlite wrote…
I'll +1 this.
StartKey
andEndKey
will almost always be set together, but incase just one is set we should treat the
request as malformed / legacy.
Done.
pkg/server/status.go
line 3473 at r1 (raw file):
Previously, zachlite wrote…
Excellent. Thanks for pulling this logic out of
span_stats_server
.
Done.
pkg/server/status.go
line 3482 at r1 (raw file):
Previously, j82w (Jake) wrote…
What is the reason for failing the request? This might occur if the limit changes in the middle of a request.
We want to limit callers from requesting stats for an excessive number of spans.
I hadn't considered the cluster setting limit being changed mid-request (i.e in a fanout), but I think it's okay if the request errors, reflecting that the limit has changed. I don't see this happening often in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're planning to backport this to 23.1, right? The test added here will start failing very soon (we're in the process of making master 23.2, e.g., #101252).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @j82w, @THardy98, and @zachlite)
pkg/cmd/roachtest/registry/owners.go
line 33 at r2 (raw file):
OwnerDevInf Owner = `dev-inf` OwnerMultiTenant Owner = `multi-tenant` OwnerClusterObs Owner = `cluster-observability`
You might be interested in updating TEAMS.yaml
as well to include a GitHub project ID so that any issues created by roachtest are automatically added to the project board.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 84 at r2 (raw file):
// Ensure the result can be marshalled into a valid span stats response. var spanStats roachpb.SpanStatsResponse err = json.Unmarshal([]byte(res.Stdout), &spanStats)
Nit: this could just be return json.Unmarshal
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 135 at r2 (raw file):
expectedUnknown := assertExpectedError(errOutput.Message, unknownFieldError) if !expectedCurrToPrev && !expectedUnknown { return errors.Newf("expected '%s' or '%s' in error message, got: '%v'", prevToCurrentError, unknownFieldError, errOutput.Error)
Curious about the "or" here: in what conditions will one error be returned vs the other? Also, the error message mentions prevToCurrentError
, but the actual test is checking for currentToPrevError
.
(Same question and note applies to the assertion below, in the "Fanout from current version node" section.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 156 at r2 (raw file):
return errors.Newf("expected '%s' or '%s' in error message, got: '%v'", prevToCurrentError, unknownFieldError, errOutput.Error) } }
I feel like there are also assertions we should make in this branch, which corresponds to the scenario where every node is running 23.1 but the cluster version is still at 22.2.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 169 at r2 (raw file):
// Ensure the result can be marshalled into a valid span stats response. var spanStats roachpb.SpanStatsResponse err = json.Unmarshal([]byte(res.Stdout), &spanStats)
Nit: return json.Unmarshal
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 202 at r2 (raw file):
l.Printf("fetchSpanStatsFromNode - curl result", res.Stdout) return &res, nil }
In all the errors returned by this function, I'd suggest making the error itself more descriptive (return fmt.Errorf("failed to do X Y Z: %w", err)
) instead of returning err
and logging the context. The error itself is logged when the test fails, so you achieve the same purpose but make the test failure report more understandable.
8d78ea1
to
bd1d853
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're planning to backport this to 23.1, right? The test added here will start failing very soon (we're in the process of making master 23.2, e.g., #101252).
Yes.
Is there anything I should do now to account for this?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @j82w, @renatolabs, and @zachlite)
pkg/cmd/roachtest/registry/owners.go
line 33 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
You might be interested in updating
TEAMS.yaml
as well to include a GitHub project ID so that any issues created by roachtest are automatically added to the project board.
Looks like there is already an entry for cluster-observability
would this suffice:
Line 38 in a65c986
cockroachdb/cluster-observability: |
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 84 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: this could just be
return json.Unmarshal
Done.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 135 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Curious about the "or" here: in what conditions will one error be returned vs the other? Also, the error message mentions
prevToCurrentError
, but the actual test is checking forcurrentToPrevError
.(Same question and note applies to the assertion below, in the "Fanout from current version node" section.
Oops, fixed the error messages.
These cases both occur when the previous versioned node receives a request with a new format.
More specifically, I believe that in some cases in the test when we want to test current version node -> previous version node (whether we specifically dial a previous version node or fanout in a mixed cluster from a current version node) c.Node(currVersNodeID)
was still a 22.2 node, and issuing a request from the 22.2 node with the new format caused unknownFieldError
. But in cases when we were issuing the request from a 23.1 node, forwarding it to a 22.2 node, we were get the currentToPrevError
.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 156 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I feel like there are also assertions we should make in this branch, which corresponds to the scenario where every node is running 23.1 but the cluster version is still at 22.2.
Added a fanout where all nodes are on 23.1 in the mixed version state.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 169 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit:
return json.Unmarshal
Done.
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 202 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
In all the errors returned by this function, I'd suggest making the error itself more descriptive (
return fmt.Errorf("failed to do X Y Z: %w", err)
) instead of returningerr
and logging the context. The error itself is logged when the test fails, so you achieve the same purpose but make the test failure report more understandable.
Wrapped the returned errors with the text that was formerly being logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, just mentioning for awareness. We can backport and then delete on master soon after once it becomes 23.2. The important thing is that this test continues to run on the 23.1 branch.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @j82w, @THardy98, and @zachlite)
a discussion (no related file):
Roachtest
pkg/cmd/roachtest/registry/owners.go
line 33 at r2 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
Looks like there is already an entry for
cluster-observability
would this suffice:
Line 38 in a65c986
cockroachdb/cluster-observability:
My bad, somehow I missed it!
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 156 at r2 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
Added a fanout where all nodes are on 23.1 in the mixed version state.
Does this pass reliably? I just realized that my comment was incomplete. Since the framework also performs downgrades, it's possible that all nodes are running 22.2 here. Did you verify this test passes in that case as well?
Extends: cockroachdb#96223 This PR extends the implementation of our SpanStats RPC endpoint to fetch stats for multiple spans at once. By extending the endpoint, we amortize the cost of the RPC's node fanout across all requested spans, whereas previously, we were issuing a fanout per span requested. Additionally, this change batches KV layer requests for ranges fully contained by the span, instead of issuing a request per fully contained range. Release note (sql change): span statistics are unavailable on mixed-version clusters
b6593fa
to
9d142ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian, @j82w, @renatolabs, and @zachlite)
pkg/cmd/roachtest/tests/mixed_version_tenant_span_stats.go
line 156 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Does this pass reliably? I just realized that my comment was incomplete. Since the framework also performs downgrades, it's possible that all nodes are running 22.2 here. Did you verify this test passes in that case as well?
Opened the this test branch to test mixed cluster version when all nodes are 22.2 or 23.1. Passes from my testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian, @j82w, @renatolabs, and @THardy98)
TYFR :) |
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Epic: None
Extends: #96223
This PR extends the implementation of our SpanStats RPC endpoint to fetch stats for multiple spans at once. By extending the endpoint, we amortize the cost of the RPC's node fanout across all requested spans, whereas previously, we were issuing a fanout per span requested. Additionally, this change batches KV layer requests for ranges fully contained by the span, instead of issuing a request per fully contained range.
Note that we do not deprecate the
start_key
andend_key
fields as they're used to determine whether an old node is calling out to a node using the new proto format.The changes here explicitly do not support mixed-version clusters.
BENCHMARK RESULTS
Here are some benchmark results from running:
Note that
HEAD
is actually a temp change to revert to old logic (request per span) andHEAD^
is the new logic (multi-span request). As such the increases in latency/memory are actually reductions.Some notable improvements particularly with requests for spans with fewer ranges. After a point, the raw number of ranges becomes the bottleneck, despite reducing the number of fanouts. Not sure if there is a better way to fetch range statistics but I think the improvement here is enough for this PR. If improvements for fetching range statistics are identified, they can be done in a follow up PR and backported.
Release note (sql change): span statistics are unavailable on mixed-version clusters