kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256

andreimatei · 2018-08-03T19:28:58Z

The intent collector interceptor complains if it receives an EndTransaction without it having collected any intents prior. It returns an error saying that it's a "readonly txn" and the EndTxn should have been elided above it. That error is supposed to signal bugs.

cockroach/pkg/kv/txn_interceptor_intent_collector.go

Line 102 in 8044dea

return nil, roachpb.NewErrorf("cannot commit a read-only transaction")

Unfortunately, that assumption is not quite copacetic with what the layers above it do. There's currently two problems:

If the txn ever encountered an error, the Txn moves to the txnError error, at which point it forgets if it ever performed any writes (i.e. if it came from state txnReadonly or txnWriting, etc..), and so it will never elide future EndTransactions (rollbacks).
If the BeginTxn batch is rejected below the Txn layer, in the TxnCoordSender, then the client.Txn considers the transaction to be writing (cause it sent a BeginTxn), but the intent collector considers it read-only (cause it never saw said BeginTxn). This happens for example if the Stopper is stopped when the BeginTransaction is sent, and the TCS fails to start the heartbeat loop and rejects the batch.

The first problem goes away in #28185 because that PR brakes apart the different txn states and correctly tracks if a BeginTxn was ever sent.

The second problem is more fundamental, caused by the separate tracking of the BeginTxn done in both the TCS and the intent collector.
It's not very clear to me what to do about it. We could try to either share the "did we send a BeginTxn" state. This is a bit complicated by the fact that, in #28185, the BeginTxn tracking is not done by the TCS directly, but by the heartbeat interceptor. So we'd need to create a communication channel between two interceptors. I think what we'd want is for the txnLockGatekeeper to keep track of whether a BeginTxn is really sent to the server, and for the heartbeat interceptor to use that to dictate whether EndTransaction can be elided. Then any interceptor anywhere in the stack can retain the right to reject batches (as they tend to do already) and the intent collector can be left alone with the current assumptions - that if it sees and EndTxn it must have accumulated some intents.
Alternatively, the intent collector could get its own logic for eliding the end transaction, duplicating the existing one.
Yet alternatively, the intent collector could stop assuming anything and conservatively forward unnecessary EndTxns.

@nvanbenschoten for thoughts, if any.
Thanks @jordanlewis for seeing something and saying something.

The error can be demonstrated with

func TestXXX(t *testing.T) {
	defer leaktest.AfterTest(t)()

	s, _, db := serverutils.StartServer(t, base.TestServerArgs{})
	ctx := context.Background()
	s.Stopper().Stop(ctx)

	err := db.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
		key := roachpb.Key("a")
		return txn.Put(ctx, key, "val")
	})
	if err != nil {
		t.Fatal(err)
	}
}

The text was updated successfully, but these errors were encountered:

This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. Fixes cockroachdb#28256 Release note: none

andreimatei · 2018-08-03T21:32:24Z

I've done something in #28185 - the gatekeeper is now in charge of tracking whether we ever sent a BeginTxn. Wasn't that bad.

This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. The metrics generation has also been extracted into a new interceptor. One behavior change introduced by this patch is that heartbeat loops are no longer started for (what the TCS hopes will be) 1PC txns. The motivation was concern over the price of spawning a (shortlived) heartbeat goroutine per txn in the 1PC-heavy "kv" workload. Another one is that the TxnCoordSender doesn't inherit the old Txn logic for swallowing errors on rollbacks. Instead, we're relying on a recent server change to not return errors on rollbacks when the txn record is missing - which was the reason for said swallowing. Fixes cockroachdb#28256 Release note: none

28185: client, kv: move logic out of Txn, rewrite some of the TxnCoordSender r=andreimatei a=andreimatei This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. The metrics generation has also been extracted into a new interceptor. One behavior change introduced by this patch is that heartbeat loops are no longer started for (what the TCS hopes will be) 1PC txns. The motivation was concern over the price of spawning a (shortlived) heartbeat goroutine per txn in the 1PC-heavy "kv" workload. Another one is that the TxnCoordSender doesn't inherit the old Txn logic for swallowing errors on rollbacks. Instead, we're relying on a recent server change to not return errors on rollbacks when the txn record is missing - which was the reason for said swallowing. Fixes #28256 Release note: none Co-authored-by: Andrei Matei <[email protected]>

andreimatei self-assigned this Aug 3, 2018

andreimatei mentioned this issue Aug 3, 2018

client, kv: move logic out of Txn, rewrite some of the TxnCoordSender #28185

Merged

tbg added this to the 2.1 milestone Aug 8, 2018

craig bot closed this as completed in #28185 Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256

kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256

andreimatei commented Aug 3, 2018

andreimatei commented Aug 3, 2018

kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256

kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256

Comments

andreimatei commented Aug 3, 2018

andreimatei commented Aug 3, 2018