-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: make the span stats fan-out more fault tolerant #108456
server: make the span stats fan-out more fault tolerant #108456
Conversation
cc968fd
to
bc57125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali and @zachlite)
-- commits
line 9 at r1:
What is the reason for removing the error from the response? The error in the response allows users to see what the issue is and to avoids ambiguity with a node missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali and @zachlite)
-- commits
line 11 at r1:
How about? Errors that are encountered are logged, and also returned on a property in the response along with the successful requests.
pkg/roachpb/span_stats.go
line 41 at r1 (raw file):
time.Minute, settings.NonNegativeDuration, ).WithPublic()
Does this really need to be public? We shouldn't make cluster setting public unless we expect users to change it.
bc57125
to
30c0898
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the response to include the errors encountered on each node. I've also updated the test to make sure that in the extreme case of failure on each node, stats for each span requested is still accessible.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @nkodali)
Previously, j82w (Jake) wrote…
How about? Errors that are encountered are logged, and also returned on a property in the response along with the successful requests.
Done.
pkg/roachpb/span_stats.go
line 41 at r1 (raw file):
Previously, j82w (Jake) wrote…
Does this really need to be public? We shouldn't make cluster setting public unless we expect users to change it.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 14 files at r1, 5 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @nkodali, and @zachlite)
pkg/server/span_stats_server.go
line 137 at r2 (raw file):
} res.Errors = strings.Join(errorMessages, ",")
Not in love with this. Why not make the proto a repeated string errors
?
30c0898
to
e8d6e16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, and @nkodali)
pkg/server/span_stats_server.go
line 137 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Not in love with this. Why not make the proto a
repeated string errors
?
You're right. Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 14 files at r1, 2 of 7 files at r2, 4 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, @nkodali, and @zachlite)
pkg/server/span_stats_test.go
line 200 at r3 (raw file):
defer log.Scope(t).Close(t) ctx := context.Background() const numNodes = 5
For the test to be more complete, at least of of the nodes should return something valid, to make sure that it's working and we get some info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, @maryliag, and @nkodali)
pkg/server/span_stats_test.go
line 200 at r3 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
For the test to be more complete, at least of of the nodes should return something valid, to make sure that it's working and we get some info.
I see what you're saying. This test is meant to verify the specific code path where there's never a valid result returned from any node. All of the other tests in this file already test the error-free code paths.
I can add an additional test that expects a mix of some errors, some valid results.
e8d6e16
to
5713064
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, @maryliag, and @nkodali)
pkg/server/span_stats_test.go
line 200 at r3 (raw file):
Previously, zachlite wrote…
I see what you're saying. This test is meant to verify the specific code path where there's never a valid result returned from any node. All of the other tests in this file already test the error-free code paths.
I can add an additional test that expects a mix of some errors, some valid results.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @knz, @nkodali, and @zachlite)
pkg/server/span_stats_test.go
line 290 at r4 (raw file):
require.Equal(t, true, containsError(res.Errors, "node 4 timed out")) // There should not be any errors for node 2 or node 5.
you already tested that the length of error message is 3 and you checked for all of them, so there is no need to check these 2, specially because is not checking with right thing, there could have been a different error type on node 2 that is not dialing (same for node 5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @knz, @maryliag, and @nkodali)
pkg/server/span_stats_test.go
line 290 at r4 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
you already tested that the length of error message is 3 and you checked for all of them, so there is no need to check these 2, specially because is not checking with right thing, there could have been a different error type on node 2 that is not dialing (same for node 5).
Discussed offline. Keeping for documentation purposes.
bors r+ |
bors r- |
Canceled. |
FYI the failure on Extended CI:
is a dup of #108534. |
5713064
to
b742181
Compare
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
b742181
to
5ddcdd1
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 5ddcdd1 to blathers/backport-release-23.1-108456: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit adds improved fault tolerance to the span stats fan-out:
Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.
Errors that are encountered are logged, and then appended to the response
in the newly added
Errors
field.Nodes must service requests within the timeout passed to
iterateNodes
.For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.
Resolves #106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.