kv,client,storage: rationalize TxnCoordSender/client.Txn redundant states #25541

andreimatei · 2018-05-16T03:15:59Z

client.Txn and TCS try to maintain a bunch of logically-redundant state
about whether a transaction is "writing" - essentially whether an
EndTransaction needs to be sent to cleanup up the TCS heartbeat loop and
the server's txn record.
The logic that both parties used for this was complex (e.g. it involved
updates in both Txn and TCS both on the outgoing path and on the
returning path of a batch) and not in sync - sometimes the TCS would
consider the txn as "writing" and the client.Txn wouldn't (e.g. in case
the first writing batch got an ambiguous error).

This patch simplifies things: the idea is that, if a BeginTxn has been
sent, an EndTransaction needs to be sent, period. The client.Txn thus
only keeps track of whether a BeginTxn was sent (except for a 1PC
batch), and it takes charge of starting the TCS' heartbeat loop (by
instructing it explicitly directly to start it before the BeginTxn is
sent). The TCS is no longer burdened with maintaining any state about
whether there is a txn record or not.

As a byproduct, the proto Transaction.Writing flag, which used to have
an unclear meaning, becomes straight forward: if set, the server needs
to check batches against the abort cache. The client is the only one
setting it, the server is the only one checking it. It used to be used
for different purposeses by both the client and server.

Release note: none

cockroach-teamcity · 2018-05-16T03:16:07Z

This change is

bdarnell · 2018-05-16T20:24:58Z

LGTM, but I'm concerned about how this will work in mixed-version clusters. Is it safe to just make the change? (It might be. It's hard to tell how much we were relying on the server-set Writing flag before)

Reviewed 19 of 20 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

pkg/internal/client/txn.go, line 658 at r2 (raw file):

// readOnlyRes is returned by maybeFinishReadonly, informing the caller of what
// cleanup needs to take place.
type readOnlyRes int

Should we just reuse the txnState enum for this? If we do need a new enum, we should give them all a common prefix - writing is a pretty generic name to use at package scope. And a value of writing for readOnlyRes is an odd name.

pkg/sql/sem/tree/stmt.go, line 652 at r2 (raw file):

func (*ShowCreateTable) hiddenFromStats()                   {}
func (*ShowCreateTable) independentFromParallelizedPriors() {}

Removing this from ShowCreateTable without also removing it from ShowCreateView and other statements seems like a bad idea. @nvanbenschoten should weigh in on why the show statements were exempted here.

pkg/storage/batcheval/cmd_begin_transaction.go, line 134 at r2 (raw file):

	// Write the txn record.
	reply.Txn.Writing = true

I don't think it's safe to remove this without a cluster version check. Won't 2.0 nodes rely on this flag being set in BeginTransaction?

Comments from Reviewable

andreimatei · 2018-05-16T22:25:46Z

I think you're right; I've added back the server-side code setting that flag gated by a new version not being active.
PTAL

Review status: 17 of 26 files reviewed at latest revision, 3 unresolved discussions.

pkg/internal/client/txn.go, line 658 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Should we just reuse the txnState enum for this? If we do need a new enum, we should give them all a common prefix - writing is a pretty generic name to use at package scope. And a value of writing for readOnlyRes is an odd name.

ok, shuffled things around and this is gone

pkg/sql/sem/tree/stmt.go, line 652 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Removing this from ShowCreateTable without also removing it from ShowCreateView and other statements seems like a bad idea. @nvanbenschoten should weigh in on why the show statements were exempted here.

ok, added a commit removing most of these

pkg/storage/batcheval/cmd_begin_transaction.go, line 134 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think it's safe to remove this without a cluster version check. Won't 2.0 nodes rely on this flag being set in BeginTransaction?

put it back behind a version check

Comments from Reviewable

nvanbenschoten · 2018-05-17T18:10:13Z

LGTM (mod comments), since it now looks like you're handling things correctly with mixed-version clusters.

Reviewed 16 of 20 files at r1, 10 of 10 files at r3, 3 of 3 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

pkg/internal/client/sender.go, line 112 at r3 (raw file):

// as TxnSenders with GetMeta or AugmentMeta panicing with unimplemented.
// This is a helper mechanism to facilitate testing.
type TxnSenderFunc struct {

I'd keep this as is and add a second TxnSenderStruct type (maybe with a different name). The <InterfaceName>Func pattern is very common.

pkg/internal/client/txn.go, line 72 at r3 (raw file):

		// be initially set when the transaction sends its first batch, but is
		// reset if the transaction is aborted.
		active bool

Did you look into merging this status into the txnState enum?

pkg/internal/client/txn.go, line 952 at r3 (raw file):

		needBeginTxn = haveTxnWrite && (txn.mu.state != txnWriting)
		// We need the EndTxn if we're ever written before or if we're writing now.

s/we're/we've/

pkg/internal/client/txn.go, line 953 at r3 (raw file):

		needBeginTxn = haveTxnWrite && (txn.mu.state != txnWriting)
		// We need the EndTxn if we're ever written before or if we're writing now.
		needEndTxn := txn.mu.state != txnReadOnly || haveTxnWrite

nit: flip this around so that it's easier to read and compare to the previous line:

needEndTxn := haveTxnWrite || (txn.mu.state != txnReadOnly)

pkg/kv/dist_sender_server_test.go, line 2529 at r3 (raw file):

				b := txn.NewBatch()
				b.CPut("a", "cput", "value")
				err := txn.CommitInBatch(ctx, b)

?

pkg/kv/txn_coord_sender.go, line 979 at r3 (raw file):

			log.VEventf(ctx, 2, "transaction heartbeat stopped: %s", ctx.Err())

			// Check if the closer channel had also been closed; in that case, that

Why do we suddenly need this?

pkg/kv/txn_coord_sender.go, line 1288 at r3 (raw file):

	// the current transaction timestamp.
	//
	// A tricky edge case is that of a transaction which "fails" on the

💯

pkg/kv/txn_coord_sender.go, line 1314 at r3 (raw file):

		tc.appendAndCondenseIntentsLocked(ctx, ba, br)

		// Initialize the first update time and maybe start the heartbeat.

:101:

pkg/roachpb/batch.go, line 409 at r3 (raw file):

		} else if et, ok := req.(*EndTransactionRequest); ok {
			h := req.Header()
			str = append(str, fmt.Sprintf("%s(commit:%t) [%s,%s)", req.Method(), et.Commit, h.Key, h.EndKey))

Will an EndTransactionRequest ever have an EndKey?

pkg/roachpb/data.proto, line 338 at r3 (raw file):

  // Transaction.UpdateObservedTimestamp to maintain the sorted order.
  repeated ObservedTimestamp observed_timestamps = 8 [(gogoproto.nullable) = false];
  // Writing is true if the transaction has previously executed a successful

This first sentence is no longer accurate.

Comments from Reviewable

andreimatei · 2018-05-17T18:42:58Z

Review status: 22 of 26 files reviewed at latest revision, 13 unresolved discussions.

pkg/internal/client/sender.go, line 112 at r3 (raw file):