kvcoord: EndTxn
elision is vulnerable to race conditions
#65587
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
The transaction coordinator elides
EndTxn
requests for non-locking requests in at least two places:https://github.com/cockroachdb/cockroach/blob/10c71993d2d339119aa26c4eeefc7025aac2bfe7/pkg/kv/kvclient/kvcoord/txn_coord_sender.go#L478-L480
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go
Lines 130 to 135 in ee23325
However, this logic is incorrect and vulnerable to race conditions, because the
txnPipeliner
which is responsible for keeping track of locks and in-flight writes only updates this state after the write requests have completed and the responses received:cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
Lines 254 to 259 in 1bfd8cc
Thus, if an
EndTxn
request is sent while the write requests are in flight, it can incorrectly elide theEndTxn
request despite the transaction having taken out locks and written intents.Luckily, I don't believe this affects commits, because these are synchronous (i.e. the client will await the result of the previous command before sending the
EndTxn(commit=true)
). This is even enforced in thetxnGatekeeper
:cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go
Lines 66 to 80 in 73154b3
However, this does not apply to
EndTxn(commit=false)
, i.e. rollbacks, which can be sent asynchronously due to e.g. thetxnHeartbeater
finding an aborted txn or the client disconnecting and cancelling the context. This vulnerability has been confirmed with a test, wherefinalizeNonLockingTxnLocked()
is called even though aPut
request is sent before theEndTxn
.At the very least, this affects cleanup of aborted transactions. However, there may well be other effects of this (or related issues) in more complex scenarios that I haven't discovered yet -- for example, I see that the concurrent request check above is disabled for leaf transactions.
As part of fixing #65458 (which has the same root cause) I am implementing an
EndTxn
barrier interceptor that blocks until pending requests complete. Moving thefinalizeNonLockingTxnLocked
mechanism into or below this barrier interceptor should avoid the race./cc @nvanbenschoten @aliher1911
The text was updated successfully, but these errors were encountered: