-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: TestDataDriven/restore-grants
is occasionally hanging
#87129
Comments
cc @cockroachdb/bulk-io |
Informs: cockroachdb#87129 Release note: None Release justification: low risk test only change
87231: backupccl: skipping restore-grant datadriven test r=adityamaru a=adityamaru Informs: #87129 Release note: None Release justification: low risk test only change Co-authored-by: adityamaru <[email protected]>
@adityamaru did you ever get a stack trace from a hang? |
Nope, didn't 😞 I haven't been able to reproduce this locally either the few times I tried stressing it. |
Hi @ajwerner, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
@adityamaru it seems that this fell off your plate. AFAICT this is new functionality and it needs to get sorted or we need to decide that it's not a problem. Can you take a look and decide whether this matters or not? |
Yup looking at it now, last I looked I wasn't able to reproduce this locally to grab stacks, but let me stress this again. |
1500 stress runs over an hour and it hasn't hung. I've opened #88851 in the hope that CI times out and dumps its stacks. |
I can occasionally get this test to hang for a few minutes. For example, if I set the per-test timeout to 3 minutes, I can typically get a timeout eventually. In the cases I've caught so far, it looks like we are waiting on leases:
But, I suppose that should resolve itself within about 5 minutes at worst so I'm not sure it would time out the test overall. |
Adding some logging, it looks like the transaction to insert the lease is retrying repeatedly because of an intent that touches the parent ID.
I'm not 100% sure yet what involves 107 in this transaction, but from a brief peak at the code it looks like it could get read during the Validate call in MustGetDescriptorsByID. 107 is the database that we created right before the stalled grant statement that we have an enqueued schema change job for. Even in successful cases, I see a few of these retries, so it seems plausible that whatever is causing this other transaction to sit around for so long could cause this test to hang for a fairly long time. This is fairly reproducible on master for me just with:
Lowering the timeout here probably helps because perhaps most of the time the conflicting transaction eventually does get pushed by the high priority txn used by the leasing code. |
@ajwerner I'm going to leave GA-blocker on this for now. It seems to be a real issue based on what I've seen this morning. |
I ran this again with some additional logging. It definitely looks like we are caught in a loop where the lease manager's txn pushes the txn that updated the descriptor. But, it doesn't appear to be pushing it far enough, so it just gets conflicted again:
If I comment out this code to force it to be pushed a bit further out: cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go Lines 784 to 813 in 1c37771
Then I am no longer able to reproduce the issue. On the DR side, we can probably renable this test so that we have coverage for restoring grants. We can lower the probability of it hanging substantially by setting |
cockroach/pkg/ccl/backupccl/datadriven_test.go Lines 165 to 166 in 4c2c7da
|
89901: backupccl: unskip restore-grants r=adityamaru a=stevendanna This test was skipped because it timed out. I believe #89900 is the likely cause of the timeout. Since this test doesn't depend on the shorter closed timestamp setting, we can reset them to make the timeout much less likely. Fixes #87129 Release note: None Co-authored-by: Steven Danna <[email protected]>
Example build: https://teamcity.cockroachdb.com/viewLog.html?buildId=6293928&buildTypeId=Cockroach_UnitTests_BazelUnitTests&tab=artifacts#%2Ftmp%2F_tmp%2F73ed31399472b19fb4ec4de3ae2e0b2c%2FlogTestDataDriven2672313397;%2Ftmp%2F_tmp%2F73ed31399472b19fb4ec4de3ae2e0b2c%2FTestBackupRestoreDataDriven2133966575;%2Ftmp%2F_tmp%2F73ed31399472b19fb4ec4de3ae2e0b2c%2FStartServer3283493967;%2Fbazel-testlogs%2Fpkg%2Fccl%2Fbackupccl%2Fbackupccl_test (shard 4 of 16)
It looks like we're hanging on these statements
cockroach/pkg/ccl/backupccl/testdata/backup-restore/restore-grants
Line 37 in 4389df9
In a local, successful run of this test we do see these jobs queued as well but quickly see them resuming execution and completing.
Jira issue: CRDB-19166
The text was updated successfully, but these errors were encountered: