kvcoord: merge budgets for in-flight writes and lock spans #66915

andreimatei · 2021-06-25T22:18:01Z

Before this patch, the txnPipeliner maintained two different memory
budgets:

kv.transaction.max_intents_bytes - a limit on a txn's lock spans
kv.transaction.write_pipelining_max_outstanding_size - a limit on a
txn's in-flight writes

Besides protecting memory usage, these guys also prevent the commit's
Raft command from becoming too big.

Having two budgets for very related things is unnecessary. In-flight
writes frequently turn into lock spans, and so thinking about how one of
the budget feeds into the other is confusing. The exhaustion of the
in-flight budget also had a hand in this, by turning in-flight writes
into locks immediately.

This patch makes write_pipelining_max_outstanding_bytes a no-op.
max_intent_bytes takes on also tracking in-flight writes. A request
whose async consensus writes would push this budget over the limit is
not allowed to perform async-consensus, on the argument that performing
sync writes is better because the locks from those writes can be
condensed.

This patch is also done with an eye towards offering an option to reject
transactions that are about to go over budget. Having a single budget to
think about makes that conceptually simpler.

Release note (general change): The setting
kv.transaction.write_pipelining_max_outstanding_size becomes a no-op.
Its function is folded into the kv.transaction.max_intents_bytes
setting.

cc @cockroachdb/kv

cockroach-teamcity · 2021-06-25T22:18:10Z

This change is

andreimatei · 2021-06-25T22:18:33Z

The commit kvcoord: refactor: extract pipeliner function is #66908.

andreimatei · 2021-06-25T22:24:39Z

One thing I've been thinking is that we might want to take this occasion to introduce a limit on the number of in-flight keys that we attach to a parallel-commit EndTxn, for stability purposes - an abandoned STAGING txn record forces conflicting txns to perform recovery, which consists of work on the order of record's in-flight writes. This seems like potentially a major unexpected burden on that other txn, externalized by the committing txn. Thoughts?

aliher1911

Does it mean roughly that we will check limit in txn_interceptor_committer.canCommitInParallel and stop that if certain limit is exceeded? That makes sense to me as it will only touch large transactions that has less benefits from parallel commits.

Just some nits from my side regarding readability, otherwise looks good.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/condensable_span_set.go, line 80 at r3 (raw file):

//
// maxBytes may be zero, or even negative. In that case, each range will be
// maximally condensed.

Since there's no difference between 0 and negative size maybe we can say maxBytes less than 1 will force maximum possible condencing? Otherwise I read it as 0 and negative have some difference.

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go, line 390 at r3 (raw file):

"cannot perform async consensus because memory budget limit exceeded"

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner_test.go, line 1424 at r3 (raw file):

// along range boundaries when they exceed the maximum intent bytes threshold.
//
// TODO(andrei): Merge this test into TestTxnPipelinerCondenseLockSpans2, which

Is it a long term TODO? I think it would be helpful to at least hightlight what is the difference between checks they make just in case it fails and someone needs to debug it.

nvanbenschoten

One thing I've been thinking is that we might want to take this occasion to introduce a limit on the number of in-flight keys that we attach to a parallel-commit EndTxn, for stability purposes - an abandoned STAGING txn record forces conflicting txns to perform recovery, which consists of work on the order of record's in-flight writes. This seems like potentially a major unexpected burden on that other txn, externalized by the committing txn. Thoughts?

I thought that's what kv.transaction.write_pipelining_max_batch_size did, but it looks like you're right that it's not. The limits in the pipeline do help bound the effective maximum because we'll only let the inFlightWriteSet get so large, but that doesn't help in the case of a single large batch with an EndTxn. So I agree that we should limit parallel commits to some maximum number of in-flight writes to avoid an unbounded fanout factor during txn recovery. I'm thinking somewhere around 1024 keys.

Does it mean roughly that we will check limit in txn_interceptor_committer.canCommitInParallel and stop that if certain limit is exceeded?

Yes, I think that's right. We'll compare len(et.InFlightWrites) against a new cluster setting.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 4 of 4 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)

pkg/kv/kvclient/kvcoord/condensable_span_set.go, line 170 at r3 (raw file):

}

func (s *condensableSpanSet) asSortedSlice() []roachpb.Span {

Consider moving this to condensable_span_set_test.go with a comment that we'll want to make this more efficient if it ever needs to be used in production code.

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go, line 514 at r1 (raw file):

	ctx context.Context, ba roachpb.BatchRequest, br *roachpb.BatchResponse,
) {
	// After adding new writes to the lock footprint, check whether we need to

Did you mean to remove this?

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go, line 251 at r2 (raw file):

	}

	ba.AsyncConsensus = tp.canUseAsyncConsensus(ba)

Want to rebase this second commit away?

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go, line 35 at r3 (raw file):

	true,
)
var _ = settings.RegisterByteSizeSetting(

Can't we just delete this and add it to retiredSettings?

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go, line 340 at r3 (raw file):

// canUseAsyncConsensus checks the conditions necessary for this batch to be
// allowed to set the AsyncConsensus flag.
func (tp *txnPipeliner) canUseAsyncConsensus(ctx context.Context, ba roachpb.BatchRequest) bool {

While we're here, should we short-circuit this method on if _, hasET := ba.GetArg(roachpb.EndTxn); hasET { return false }? Now that this is factored out, it seems like a valuable fast-path.

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go, line 390 at r3 (raw file):

Previously, aliher1911 (Oleg) wrote…

"cannot perform async consensus because memory budget limit exceeded"

Also, does v4 seem too low? I was thinking v2.

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner_test.go, line 81 at r3 (raw file):

//
// iter is the iterator to use for condensing the lock spans. It can be nil, in
// which case the pipeliner will panic if it ever needs to condense.

"needs to condense intent spans"

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner_test.go, line 84 at r3 (raw file):

func makeMockTxnPipeliner(iter condensableSpanSetRangeIterator) (txnPipeliner, *mockLockedSender) {
	mockSender := &mockLockedSender{}
	metrics := MakeTxnMetrics(time.Hour)

nit: just inline these.

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner_test.go, line 583 at r3 (raw file):

	tp, mockSender := makeMockTxnPipeliner(nil /* iter */)

	// Disable write_pipelining_max_outstanding_size, and max_intents_bytes limits.

nit: no comma

pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner_test.go, line 970 at r3 (raw file):

// TestTxnPipelinerMaxInFlightSize tests that batches are not pipelined if doing
// so would push the memory used to track locks and in-flight writes over the
// limit allowed by the kv.transaction.max_intents_bytes.

nit: remove the period and re-wrap.

Lift a defer() into a wrapper function. Release note: None

This patch short-circuits canUseAyncConsensus for EndTxn batches. Release note: None

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/condensable_span_set.go, line 80 at r3 (raw file):

Previously, aliher1911 (Oleg) wrote…

Since there's no difference between 0 and negative size maybe we can say maxBytes less than 1 will force maximum possible condencing? Otherwise I read it as 0 and negative have some difference.

done

pkg/kv/kvclient/kvcoord/condensable_span_set.go, line 170 at r3 (raw file):