-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809
Conversation
1e4f82e
to
b17b74e
Compare
@tbg for testing this it seems like the |
b17b74e
to
b1acf17
Compare
nwm, it will cause circuit breaker event, but not an circuit breaker error. |
I think you might have to wait a bit longer. The error you're seeing is likely caused by the problem ranges endpoint trying to reach the down nodes. After ~10s, the nodes should be marked as suspect (not live) so the problem ranges endpoint should be ignoring them. If this doesn't work, then there is a problem with the problem ranges that separately would need to be addressed. But I'm relatively confident that this will work. |
Thanks guys I'll test both methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg hmm I killed the 2 nodes and waited ~30 seconds and the problem ranges endpoint still returns with an error message..after waiting some more time (a few mins) it seems like the entire dashboard doesn't even load anymore.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)
Wait five minutes first, then kill two nodes. I want to rule out that you're accidentally losing quorum on the system ranges, in which case the cluster is pretty dysfunctional. |
Yeah waiting 5 mins then killing the 2 nodes seems to result in the same things observed that were mentioned ^^ |
Let me make sure I understand correctly. I set up a local 3 node cluster, waited for full replication, and brought down n1. Now the problem range endpoint gives me this, which is fine. It doesn't have any data for n1, but how could it, that node is down. It does report data for n2 and n3. (This is on In your experiment, you would look for open circuit breakers reported by nodes that are not down (the circuit breakers are tripped since replicas on live nodes are unable to make progress, as they would have to rely on the down nodes to do so at the replication layer). I have a demo script at 16m30s here that you could copy (it's in the right half of my terminal). When the workload returns all circuit breaker errors, the problem ranges endpoint (if the circuit_breaker_error field was plumbed around correctly) should be populated. The Let me know if it still doesn't work, and I'll take a look at the PR tomorrow and try to repro this for myself. (I'm unable to meet synchronously this week, or I would've offered that; if this drags out until next then we can go through it together). |
b1acf17
to
f72f7a1
Compare
Thanks for the followup! I noticed I was missing an addition in cockroach/pkg/server/status.go Line 1860 in 9ef0fed
|
f72f7a1
to
cbe7628
Compare
So on a whim I followed Andrii's steps for his test (as mentioned above) and upon the workload encountering an error ( If I increase the |
I just tried the above with this PR and it "just worked", not sure what went differently when you tried: https://www.loom.com/share/f560561bae6a42e58a34951226ed54da I created the cluster via:
|
Ah I was starting my cluster the non-roachprod way because I wanted to see hot reloads to the UI. I tried via the roachprod way and confirm it works 👍 thanks Tobias! Strange that it doesn't work when I start cluster directly from binary |
718391a
to
0b35df8
Compare
This PR adds changes to the reports/problemranges and reports/range pages. Ranges with replicas that have a circuit breaker will show up as problem ranges and the circuit breaker error will show up as a row on the status page. Release note (ui change): display circuit breakers in problems ranges and range status
0b35df8
to
0ffc720
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 7 files at r1, 3 of 5 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)
pkg/server/serverpb/status.proto, line 392 at r4 (raw file):
// When the raft log is too large, it can be a symptom of other issues. bool raft_log_too_large = 7; bool circuit_breaker_error = 9;
@Santamaura , what is the main purpose of this field? In case it is used as a flag to indicate presence of circuit breaker errors in response, than would it be possible to check circuit_breaker_error_range_ids
(that's defined below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)
pkg/server/serverpb/status.proto, line 392 at r4 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
@Santamaura , what is the main purpose of this field? In case it is used as a flag to indicate presence of circuit breaker errors in response, than would it be possible to check
circuit_breaker_error_range_ids
(that's defined below)?
It is used to check if there is a circuit breaker error on a per range basis. This flag actually determines whether we add a range id to the circuit_breaker_range_ids
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)
pkg/server/serverpb/status.proto, line 392 at r4 (raw file):
Previously, Santamaura (Alex Santamaura) wrote…
It is used to check if there is a circuit breaker error on a per range basis. This flag actually determines whether we add a range id to the
circuit_breaker_range_ids
.
Is it possible to do the same validation on front-end side to remove this field? Also naming of this field intends to an error
instead of bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)
pkg/server/serverpb/status.proto, line 392 at r4 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
Is it possible to do the same validation on front-end side to remove this field? Also naming of this field intends to an
error
instead ofbool
.
Oh sorry to clarify, the boolean isn't used on the FE, it's used to determine whether to add the range id in the problemRanges
function which will just return the list of id's to the FE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 7 files at r1, 2 of 5 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)
a discussion (no related file):
pkg/server/serverpb/status.proto, line 392 at r4 (raw file):
Previously, Santamaura (Alex Santamaura) wrote…
Oh sorry to clarify, the boolean isn't used on the FE, it's used to determine whether to add the range id in the
problemRanges
function which will just return the list of id's to the FE
Got it! Thanks for making it clear!
bors r+ |
Build failed (retrying...): |
Build succeeded: |
This PR adds changes to the reports/problemranges and reports/range pages.
Ranges with replicas that have a circuit breaker will show up as problem ranges and
the circuit breaker error will show up as a row on the status page.
Release note (ui change): display circuit breakers in problems ranges and range status
Problem Ranges page:
Range status page: