-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/restart/nodes=12 failed #102655
Comments
It looks like the same #98928 failure. I'm looking on the n12 which was restarted and it goes to overload One thing I noticed is that DescriptorTable was moved to that node. So maybe it is causing more outage compared to range with We removed release blocker from it last time, do we need to revisit this? |
@kvoli assigning to you as an owner of previous instance of this failure to make a call. |
Removed release blocker as this is a known issue which affects previous versions as well. |
The lease transfer towards DetailsRejoining gossip
Lease transfers towards
However, that shouldn't be possible since we consider re-joining stores as suspect for 30s after they rejoin the cluster (#97532). The leases stop attempting to transfer towards It seems as though even if those leases didn't transfer during the suspect period (like the shouldn't have), the store would have still been getting a couple leases due to the IO overload being slow. I haven't dug into why the store's didn't consider it suspect after startup. I'll keep looking Monday. @andrewbaptist. |
The `kv/restart/nodes=12` relies on correct allocation decisions to pass. This commit increases the logging verbosity of related components so that if/when the test fails, a root cause is easier to establish. Informs: cockroachdb#102655 Epic: none Release note: None
Stressing now with increased verbosity. |
Stressed overnight (12 times) without failure on master. We may have just gotten unlucky here and the test may have failed even with the suspect being respected - (IO overload didn't occur until 50s after node restarted, so there's a 20s gap where it could receive leases regardless). I've opened #102915 to increase the logging verbosity so that if this fails again we have more material to investigate. Going to let this sit for now and wait for future failures. |
102915: roachtest: increase kv/restart/nodes=12 logging verbosity r=andrewbaptist a=kvoli The `kv/restart/nodes=12` relies on correct allocation decisions to pass. This commit increases the logging verbosity of related components so that if/when the test fails, a root cause is easier to establish. Informs: #102655 Epic: none Release note: None Co-authored-by: Austen McClernon <[email protected]>
roachtest.kv/restart/nodes=12 failed with artifacts on master @ 9d6e7baebdc0d8b1a57ce8dc158fec35b48e9e2a:
Parameters: |
Most recent failure is an unrelated panic due to #104007. Fixed by #104082.
Full Stack
|
There have been no failures since Apr 29th (70 days). The last failure on May 27th was due to an unrelated bug mentioned in the above comment. Closing. |
roachtest.kv/restart/nodes=12 failed with artifacts on master @ 4619767bb0b75ac85ddcb76e33c73211e369afed:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-27571
The text was updated successfully, but these errors were encountered: