-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamcity: failed test: TestConcurrentRaftSnapshots #39652
Comments
Repros easily with
I'll take a look |
The problem seems to be that we're returning here without an error cockroach/pkg/storage/replica_command.go Lines 1256 to 1262 in c80412a
But the transaction at that point is cockroach/pkg/internal/client/txn.go Line 696 in 48465c5
Now on to finding out how we're losing the error that we should see when trying to commit but don't. |
|
At init() time, the txn proto has not been populated yet. Found while investigating cockroachdb#39652. Release note: None
Ok, I understand this now. This is a race between the txn's heartbeat loop and the commit. It looks like we're managing to commit but before the response enters the txn coord sender, its heartbeat loop misses a heartbeat and marks the txn as aborted here:
However, seems that txcoordsender still returns the (successful) commit response to the client here (except that the cockroach/pkg/storage/replica_command.go Lines 1255 to 1264 in 66acbc0
and so we hit the cockroach/pkg/internal/client/txn.go Lines 693 to 695 in 3d26bc5
which then fails because the txn coord sender was already finalized here cockroach/pkg/kv/txn_coord_sender.go Line 806 in 3d26bc5
The fundamental problem here seems with the heartbeat loop. If the heartbeat loop changes the txn while there is an inflight commit attempt, the client must receive an ambiguous result. In general it seems fraught to have the heartbeats muddle with the state like that; if a heartbeat fails we just want to notify the client. We can do that more easily by setting a flag that is queried only when the client tries to make another request and that, if hit, returns a transaction aborted error. @nvanbenschoten could I throw this your way? I'm going to send a PR to skip this test (though the problem ultimately isn't specific to the test, it seems to hit it reliably now that learners are in there, for whatever reason). |
The test highlights a bug in production code and is too flaky to remain unskipped while we fix said bug. See cockroachdb#39652. Release note: None
39657: storage: skip TestConcurrentRaftSnapshots r=ajwerner a=tbg The test highlights a bug in production code and is too flaky to remain unskipped while we fix said bug. See #39652. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
So it sounds like what's happening is exactly this: cockroach/pkg/kv/txn_interceptor_heartbeater.go Lines 357 to 364 in dc3686f
I really should have taken time to address this back when I was originally thinking through it. I guess I didn't want to start pulling on another thread while in the middle of trying to figure out everything with parallel commits. I think that was broken by this change, which prevents the COMMITTED status from overwriting the ABORTED status. We should log a loud error if anyone tries to update a transaction proto with that transition. I like your idea about the heartbeat loop setting a flag that is queried only when the client tries to make another request. That way, the transaction is always properly updated by its main goroutine through the normal mechanisms. |
39640: storage: unify replica addition and removal paths r=nvanbenschoten a=tbg This continues the reworking of the various replication change APIs with the goal of allowing a) testing of general atomic replication changes b) issuing replica swaps from the replicate queue (in 19.2). For previous steps, see: #39485 #39611 This change is not a pure plumbing PR. Instead, it unifies `(*Replica).addReplica` and `(*Replica).removeReplica` into a method that can do both, `(*Replica).addAndRemoveReplicas`. Given a slice of ReplicationChanges, this method first adds learner replicas corresponding to the desired new voters. After having sent snapshots to all of them, the method issues a configuration change that atomically - upgrades all learners to voters - removes any undesired replicas. Note that no atomic membership changes are *actually* carried out yet. This is because the callers of `addAndRemoveReplicas` pass in only a single change (i.e. an addition or removal), which the method also verifies. Three pieces are missing after this PR: First, we need to be able to instruct raft to carry out atomic configuration changes: https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451 which in particular requires being able to store the ConfState corresponding to a joint configuration in the unreplicated local state (under a new key). Second, we must pass the slice of changes handed to `AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling it first, see: https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891 and https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325 Third, we must to teach the replicate queue to issue the "atomic swaps"; this is the reason we're introducing atomic membership changes in the first place. Release note: None 39656: kv: init heartbeat txn log tag later r=nvanbenschoten a=tbg At init() time, the txn proto has not been populated yet. Found while investigating #39652. This change strikes me as clunky, but I don't have the bandwidth to dig deeper right now. Release note: None 39666: testutils/lint/passes: disable under nightly stress r=mjibson a=mjibson Under stress these error with "go build a: failed to cache compiled Go files". Fixes #39616 Fixes #39541 Fixes #39479 Release note: None 39669: rpc: use gRPC enforced minimum keepalive timeout r=knz a=ajwerner Before this commit we'd experience the following annoying log message from gRPC every time we create a new connection telling us that our setting is being ignored. ``` Adjusting keepalive ping interval to minimum period of 10s ``` Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Fixes cockroachdb#39652. Fixes cockroachdb#39661. Fixes cockroachdb#35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364. The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Release note: None
Fixes cockroachdb#39652. Fixes cockroachdb#39661. Fixes cockroachdb#35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364. The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Release note: None
39699: kv: don't update transaction proto directly from heartbeat loop r=nvanbenschoten a=nvanbenschoten Fixes #39652. Fixes #39661. Fixes #35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364 The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Co-authored-by: Nathan VanBenschoten <[email protected]>
The following tests appear to have failed on master (testrace): TestConcurrentRaftSnapshots
You may want to check for open issues.
#1437952:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: