-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
circleci: failed tests: TestSplitSnapshotRace_SplitWins #9204
Comments
Timeout during shutdown. Relevant stacks: https://gist.github.com/3b1ed2aa474014205d79fc7518374fe2 Lots of tasks stuck in dist sender. @tschottdorf? |
Reassigning given that @spencerkimball was recently deep in DistSender. |
TestStoreMetrics had become flaky at some point due to a change in the way intents are resolved. The change to resolution did not introduce the issue, it simply exacerbated an existing problem in the test. The issue occurred because intent resolution is asynchronous for "remote" intents (not on the same range as the transaction record). This test does create a few such intents; this test also stops stores before verifying metrics. With the new resolution strategy, stopping a store was likely to create an unresolved intent, meaning the intent count was 1. This was incompatible with an existing "sanity check" that the intent count was zero, a regression check for cockroachdb#4624. In order to keep the sanity check, I have moved it into a succeedsSoon block *before* the stores are stopped. This fixes cockroachdb#8437; however, the test now appears to be broken due to the same underlying issue as cockroachdb#9204 (an infinite retry in distSender that prevents servers from stopping). The test remains skipped, but the skip message has been modified.
TestStoreMetrics is also failing with this reliably: https://circleci.com/gh/cockroachdb/cockroach/22634 That test stops and restarts stores (using multiTestContext); this issue occurs when one of the stores is unable to close. The stuck task does appear to be in DistSender. TestStoreMetrics is currently skipped; #9311 is fixing another issue specific to that test, but leaves the test skipped with a new skip message. |
Reproduces in 15 seconds on GCE with
The logs are full of:
|
@tamird the above invocation
times out in |
I'll take a look at this test. |
|
But it does get a response and become leader. This triggers a reproposal without waiting for ticks.
The reproposal is rejected, which tries to trigger a snapshot send but it can't, probably because there's already a pending snapshot to the downed node:
It looks like what's happening here is that we're starting the snapshot process to a node that is stopped (either just before or just after we start snapshotting), and when the target node stops, the sending node is not able to fail-fast and release the snapshot so it can proceed to talk to other nodes. |
Wanted to look into this more, but realistically won't be able to. Unassigning. |
@bdarnell, assigning you since you've already dug into it a fair bit. |
The first step for both of these tests is deflaking the setup function itself (many but not all of the failures we see appear to be in the setup function). I first managed to deflake it by dropping in a few random calls to The reason is this TODO. All of those actions are performed on the proposer (i.e. lease holder), except for addToReplicaGCQueue, which should be performed on the node that set it. This test passes most of the time because, surprisingly, the node we're removing is normally the lease holder (everything in this setup function operates in increasing store ID order, so the store with the lowest ID holds the lease except when some rare event causes another node to grab the lease first, and it's always the lowest store ID that we remove). Expiring the leases allows the to-be-removed node to grab the lease and propose its own removal.
|
LocalProposalData is processed only on the proposer, which is unlikely in practice to be the node being removed (but was apparently the case in all of our tests except in rare cases). This deflakes the SplitSnapshotRace tests, which would occasionally end up with the lease on a different store and miss the automatic GC (and because the test uses a manual clock, the background replicaGCQueue scan wouldn't GC them either). Fixes cockroachdb#8416 Fixes cockroachdb#9204
The following test appears to have failed:
#22431:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: