-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] #78284
Comments
|
It's hard to tell from the test setup (since there's a lot going on), but a simple explanation would be that the timeseries query above hangs because well, we lost quorum, so why can you query timeseries which are also stored in the KV store? Timeseries are 3x replicated and we're killing three out of six nodes. Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected. We're definitely seeing unexpected request durations on those ranges, indicating that they are indeed unavailable:
Replica circuit breakers should've let this test fail "gracefully" with a loss of quorum error. However, the SHA hadn't picked up #76146 yet, so breakers were disabled. Going to toss this over to KV for further investigation. |
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 834eaa0e83350486830867b5edd6e8809b52aa55:
Same failure on other branches
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:
Same failure on other branches
|
We are attempting to be smart about this. See this code: cockroach/pkg/cmd/roachtest/tests/follower_reads.go Lines 529 to 566 in 7506753
That logic should wait until all ranges other than the range in the database with ZONE survivability have upreplicated across regions. But this isn't what we see in the logs you posted. Notice the range descriptor in I reproduced this and confirmed that the unavailable timeseries range never achieved region survivability: Something must be going wrong here with the replication reports. I wonder if it's related to async span config reconciliation in some form. |
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ cc07b5e7e670097560cb8412b380484773df1e96:
Same failure on other branches
|
I'm pretty sure I've got this one. My suspicion is that it was caused by #76279 which meant that the system config was not immediately available when we first set the cluster setting to trigger the report, but it's just a hunch really. I'm not totally clear on why it was okay to only look at one report before this change. Maybe it's that report generation timing was such that we always did one iteration of the retry loop and it was there and now, for whatever reason, there's some timing thing involving the rangefeed that means that we have to do more than one iteration. What I do know is that when I added code to print out the table state for a bunch of the tables inside the code to check on the critical localities, but before we actually did the scan to check on them, that it ran 60 times without failing where in the past I was getting 1-2/20. That lead me to wonder if we just weren't waiting for the right thing. Indeed it seems like we weren't. We were just waiting for one report to be written, but there are 3 reports in total and we write the critical localities report second. I'm running it more, I'm at 55 successes with #79977 and it feels right to me. I've removed the release blocker label. My working theory is just that changed the timing and exposed the bug. I don't feel super eager to prove this out further right now, but I'm pretty happy with the answer of the moment. |
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 10e0c5d92f8ef953d6b497b448893bb5044cdd31:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-14044
The text was updated successfully, but these errors were encountered: