Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency,kvserver: limited scans optimistically check for locks #58670

Merged
merged 1 commit into from
May 20, 2021

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Jan 8, 2021

This optimistic checking happens after the evaluation. The evaluation
will already discover any conflicting intents, so the subsequent
checking of the lock table is primarily to find conflicting
unreplicated locks.

This structure should be easy to extend to optimistic latching.

Benchmark numbers from the new kv roachtest that does SFU to
introduce false contention:

name                                           old ops/sec  new ops/sec  delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq   5.68k ± 0%   9.96k ± 1%  +75.46%  (p=0.000 n=8+9)

name                                           old p50      new p50      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    13.1 ± 0%     5.5 ± 0%  -58.02%  (p=0.000 n=8+8)

name                                           old p95      new p95      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    18.9 ± 0%    16.8 ± 0%  -11.11%  (p=0.001 n=8+9)

name                                           old p99      new p99      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    22.0 ± 0%    25.6 ± 2%  +16.57%  (p=0.000 n=8+9)

Microbenchmark numbers for the OptimisticEval microbenchmark,
where the real-contention=true case typically causes optimistic
evaluation to retry.

name                                     old time/op    new time/op    delta
OptimisticEval/real-contention=false-16    5.76ms ± 5%    0.03ms ± 2%  -99.49%  (p=0.000 n=8+10)
OptimisticEval/real-contention=true-16     5.75ms ± 4%    5.63ms ± 5%     ~     (p=0.393 n=10+10)

name                                     old alloc/op   new alloc/op   delta
OptimisticEval/real-contention=false-16    34.3kB ±24%     9.9kB ± 2%  -71.29%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16     33.0kB ±20%    35.7kB ± 9%     ~     (p=0.065 n=8+8)

name                                     old allocs/op  new allocs/op  delta
OptimisticEval/real-contention=false-16       273 ±19%        83 ± 0%  -69.56%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16        268 ± 2%       308 ± 2%  +14.90%  (p=0.000 n=8+8)

Fixes #49973
Informs #9521

Release note (performance improvement): A limited scan now checks
for conflicting locks in an optimistic manner, which means it will
not conflict with locks (typically unreplicated locks) that were
held in the scan's full spans, but were not in the spans that were
scanned until the limit was reached. This behavior can be turned
off by changing the value of the cluster setting
kv.concurrency.optimistic_eval_limited_scans.enabled to false.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing tests. I wanted to get your opinion on the structure first.
I used your #33373 in constructing this PR.

If the structure makes sense, I would also need input on what benchmarks to run, in case we are regressing performance due to failed optimistic evaluations. And I wonder whether there should be a cluster setting that can disable this in case a customer has a regression.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! This is going with option 2 from #49973 (comment), right? What was the reason for the switch? Just that this approach seemed easier to implement and easier to extend to #9521?

I would also need input on what benchmarks to run, in case we are regressing performance due to failed optimistic evaluations.

I think something like the benchmarks run in #33373 would make sense, though we'd need to modify things a little bit. First, we'd need to make sure that the writing transactions were actually grabbing locks. Moving them from implicit SQL txns to explicit SQL txns would do the trick. Next, we'd need to make sure that these locks ended up in the lock table. For the sake of the experiment, consider removing this condition:

if durability == lock.Replicated {

On that note, how will this change interact with locks that need to be discovered and are not in the lockTable? Is the idea that we simply won't hit them?

And I wonder whether there should be a cluster setting that can disable this in case a customer has a regression.

This sounds like a good idea.

Reviewed 6 of 9 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/replica_read.go, line 81 at r1 (raw file):

	}

	if pErr == nil && g.Req.Optimistic {

The pErr == nil part concerns me. In #49973, we discuss correctness in the face of semantic errors observed during optimistic evaluation. For instance, we may throw a ConditionFailedError that should not otherwise be possible because we peek below a lock. Checking for locking conflicts as we scan would avoid this issue, but we aren't doing that here.


pkg/kv/kvserver/replica_read.go, line 103 at r1 (raw file):

	// NB: For optimistic evaluation, used for limited scans, the update to the
	// timestamp cache limits itself to the spans that were read.
	// TODO(sumeer): why are we updating the ts cache if pErr != nil?

See UpdatesTimestampCacheOnError. There are some semantic errors (the same that I'm concerned about above) that still require us to update the timestamp cache.


pkg/kv/kvserver/replica_send.go, line 407 at r1 (raw file):

			}
		case *roachpb.OptimisticEvalConflictsError:
			optimistic = false

Give this a comment that indicates that you're intentionally not dropping latches.


pkg/kv/kvserver/replica_send.go, line 424 at r1 (raw file):

//   error is handled, re-sequencing through the concurrency manager, and
//   executing the request again.
// - Client-side concurrency retry errors are currently limited to conflicts

What does client-side mean in this context? All of this code is on the server.


pkg/kv/kvserver/replica_send.go, line 452 at r1 (raw file):

		// with an error that will propagate back to the client.
	case *roachpb.OptimisticEvalConflictsError:
	// Optimistic evaluation encountered a conflict. The request will

nit: Indentation


pkg/kv/kvserver/replica_send.go, line 763 at r1 (raw file):

		// in the lock table.
		//
		// The heuristic is limit < k * liveCount, where k <= 1. The use of k=1

Is this k coefficient something you still intend to introduce?


pkg/kv/kvserver/concurrency/concurrency_control.go, line 181 at r1 (raw file):

	SequenceReq(context.Context, *Guard, Request) (*Guard, Response, *Error)

	// SequenceOptimisticRetryingAsPessimistic is akin to SequenceReq, but used

Is the only difference between SequenceReq and SequenceOptimisticRetryingAsPessimistic that the latter expects the request's latches to already be held while the former does not? Will this still be the case even when we optimistically latch? In that case, how much does it buy us to re-use the existing latches vs. drop latches and re-acquire pessimistically?

I ask because I'm hoping we can avoid a second entrance into this package. If we figured out a way to get rid of this method then we could align the handling of OptimisticEvalConflictsError much more closely with the other errors in executeBatchWithConcurrencyRetries.

If there's still a good reason to hold on to latches across retries, maybe we can make SequenceReq work when latches are already held. Perhaps if the provided guard's Req.Optimistic flag is set?


pkg/kv/kvserver/concurrency/concurrency_control.go, line 381 at r1 (raw file):

}

func (g *Guard) CheckOptimisticNoConflicts(lockSpansRead *spanset.SpanSet) (ok bool) {

nit: I've been putting these Guard methods in concurrency_manager.go.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 145 at r1 (raw file):

		panic("retry should have non-nil guard")
	}
	if g.Req.Optimistic {

I must be misunderstanding something. Shouldn't this be inverted? Shouldn't the guard have been optimistic?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I'll work on measurements after I fix the tests.

This is going with option 2 from #49973 (comment), right? What was the reason for the switch? Just that this approach seemed easier to implement and easier to extend to #9521?

It is option 1 since the MVCC code will find the intents during evaluation. We additionally need to check the unreplicated locks and that is what happens after evaluation (though CheckOptimisticNoConflicts doesn't care whether a lock is replicated or unreplicated).

On that note, how will this change interact with locks that need to be discovered and are not in the lockTable? Is the idea that we simply won't hit them?

If they need to be discovered the code will see them in the evaluation, add them, and retry pessimistically.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_read.go, line 81 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The pErr == nil part concerns me. In #49973, we discuss correctness in the face of semantic errors observed during optimistic evaluation. For instance, we may throw a ConditionFailedError that should not otherwise be possible because we peek below a lock. Checking for locking conflicts as we scan would avoid this issue, but we aren't doing that here.

There is no change to the MVCC scanning code, so it will find intents.


pkg/kv/kvserver/replica_read.go, line 103 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

See UpdatesTimestampCacheOnError. There are some semantic errors (the same that I'm concerned about above) that still require us to update the timestamp cache.

Thanks. I've enhanced the comment here.


pkg/kv/kvserver/replica_send.go, line 407 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment that indicates that you're intentionally not dropping latches.

Done


pkg/kv/kvserver/replica_send.go, line 424 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What does client-side mean in this context? All of this code is on the server.

I was not thinking clearly. Fixed.

I mildly dislike OptimisticEvalConflictsError being a proto even though it is not going to flow over the network. I'm assuming the others are part of an RPC payload somewhere.


pkg/kv/kvserver/replica_send.go, line 452 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Indentation

Done


pkg/kv/kvserver/replica_send.go, line 763 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this k coefficient something you still intend to introduce?

Not unless we observe regression that can be fixed with a lower k.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 181 at r1 (raw file):

Is the only difference between SequenceReq and SequenceOptimisticRetryingAsPessimistic that the latter expects the request's latches to already be held while the former does not?

Yes, that is the primary difference. There are also some assertions.

Will this still be the case even when we optimistically latch?

I was expecting yes. We've already added the latches to the btree, so we may as well use them and wait until they are "held" for this case (where we are evaluating pessimistically immediately after the optimistic evaluation failed).

In that case, how much does it buy us to re-use the existing latches vs. drop latches and re-acquire pessimistically?

I am wary of adding more work when optimistic evaluation fails.

I ask because I'm hoping we can avoid a second entrance into this package. If we figured out a way to get rid of this method then we could align the handling of OptimisticEvalConflictsError much more closely with the other errors in executeBatchWithConcurrencyRetries.

The change in executeBatchWithConcurrencyRetries is not significant. Is there something particular about it that is bothersome? I am willing to be convinced otherwise.

If there's still a good reason to hold on to latches across retries, maybe we can make SequenceReq work when latches are already held. Perhaps if the provided guard's Req.Optimistic flag is set?

We could make this work if we had an enum in Request that said: PessimisticEval, PessimisticAfterFailedOptimisticEval, OptimisticEval. That would subsume Request.Optimistic and also replace the 2 bools in executeBatchWithConcurrencyRetries. I've gone ahead and made that change.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 381 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I've been putting these Guard methods in concurrency_manager.go.

Done


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 145 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I must be misunderstanding something. Shouldn't this be inverted? Shouldn't the guard have been optimistic?

The retry is pessimistic. This method is used when the preceding attempt was optimistic.
Maybe it should be called SequenceFailedOptimisticRetryingAsPessimistic?

Now removed.

@sumeerbhola sumeerbhola force-pushed the opt_locks branch 4 times, most recently from 95d9b42 to 880ef76 Compare January 12, 2021 19:44
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_test.go, line 13139 at r2 (raw file):

	s, _, db := serverutils.StartServer(b, args)
	defer s.Stopper().Stop(ctx)

Made-up unrepresentative benchmark. My main goal was to get a feel for the real-contention=true case since the evaluation will typically need to retry.

name                                     old time/op    new time/op    delta
OptimisticEval/real-contention=false-16    5.76ms ± 5%    0.03ms ± 2%  -99.49%  (p=0.000 n=8+10)
OptimisticEval/real-contention=true-16     5.75ms ± 4%    5.63ms ± 5%     ~     (p=0.393 n=10+10)

name                                     old alloc/op   new alloc/op   delta
OptimisticEval/real-contention=false-16    34.3kB ±24%     9.9kB ± 2%  -71.29%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16     33.0kB ±20%    35.7kB ± 9%     ~     (p=0.065 n=8+8)

name                                     old allocs/op  new allocs/op  delta
OptimisticEval/real-contention=false-16       273 ±19%        83 ± 0%  -69.56%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16        268 ± 2%       308 ± 2%  +14.90%  (p=0.000 n=8+8)

@sumeerbhola sumeerbhola force-pushed the opt_locks branch 2 times, most recently from 09f644b to 2d2fdf1 Compare January 12, 2021 21:42
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/workload/kv/kv.go, line 440 at r4 (raw file):

	start := timeutil.Now()
	var err error
	if o.config.writesUseSelectForUpdate {

Instead of making temporary changes to lockTable to make it remember uncontended replicated locks, I am changing this to do "select for update", since we need to remember unreplicated locks.

@sumeerbhola sumeerbhola force-pushed the opt_locks branch 3 times, most recently from d4a6ca4 to e237568 Compare January 13, 2021 14:59
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r2, 1 of 3 files at r3, 1 of 1 files at r4, 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/replica_read.go, line 103 at r1 (raw file):

Previously, sumeerbhola wrote…

Thanks. I've enhanced the comment here.

Thanks. Mind mentioning in the second bullet that this is done using the ResumeSpans?


pkg/kv/kvserver/replica_send.go, line 763 at r1 (raw file):

Previously, sumeerbhola wrote…

Not unless we observe regression that can be fixed with a lower k.

Should we define a const k = 1 then?


pkg/kv/kvserver/replica_send.go, line 735 at r5 (raw file):

			cmd.DeclareKeys(desc, ba.Header, inner, latchSpans, lockSpans)
		} else {
			return nil, nil, concurrency.PessimisticEval, errors.Errorf("unrecognized command %s", inner.Method())

nit: can we just return 0 in these error cases?


pkg/kv/kvserver/replica_send.go, line 769 at r5 (raw file):

		limit := ba.Header.MaxSpanRequestKeys
		if limit > 0 && limit < liveCount {
			requestEvalKind = concurrency.OptimisticEval

We aren't actually setting requestEvalKind to concurrency.PessimisticEval, so we're relying on its default value. That might not be the intention.


pkg/kv/kvserver/replica_test.go, line 13139 at r2 (raw file):

Previously, sumeerbhola wrote…

Made-up unrepresentative benchmark. My main goal was to get a feel for the real-contention=true case since the evaluation will typically need to retry.

name                                     old time/op    new time/op    delta
OptimisticEval/real-contention=false-16    5.76ms ± 5%    0.03ms ± 2%  -99.49%  (p=0.000 n=8+10)
OptimisticEval/real-contention=true-16     5.75ms ± 4%    5.63ms ± 5%     ~     (p=0.393 n=10+10)

name                                     old alloc/op   new alloc/op   delta
OptimisticEval/real-contention=false-16    34.3kB ±24%     9.9kB ± 2%  -71.29%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16     33.0kB ±20%    35.7kB ± 9%     ~     (p=0.065 n=8+8)

name                                     old allocs/op  new allocs/op  delta
OptimisticEval/real-contention=false-16       273 ±19%        83 ± 0%  -69.56%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16        268 ± 2%       308 ± 2%  +14.90%  (p=0.000 n=8+8)

This is a very nice validation!


pkg/kv/kvserver/replica_test.go, line 13068 at r5 (raw file):

}

func TestOptimisticEvalRetry(t *testing.T) {

nit: these two tests might be better situated in client_replica_test.go, since they don't need to touch unexported parts of the kvserver package.


pkg/kv/kvserver/replica_test.go, line 13089 at r5 (raw file):

		return txn.Commit(ctx)
	}))
	time.Sleep(10 * time.Millisecond)

Why are these sleeps needed?


pkg/kv/kvserver/replica_test.go, line 13098 at r5 (raw file):

	txn1 := db.NewTxn(ctx, "locking txn")
	_, err = txn1.ScanForUpdate(ctx, "a", "c", 0)

Should we write a second version of this test where we only lock "b" and show that the limited scan does not get stuck because of optimistical evaluation?


pkg/kv/kvserver/concurrency/concurrency_control.go, line 181 at r1 (raw file):

The change in executeBatchWithConcurrencyRetries is not significant. Is there something particular about it that is bothersome? I am willing to be convinced otherwise.

The change you made looks good.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 514 at r5 (raw file):

func (g *Guard) AssertLatches() {
	if !g.HoldingLatches() {
		// TODO: undo debugging change

Note to remember this TODO.


pkg/workload/kv/kv.go, line 440 at r4 (raw file):

Previously, sumeerbhola wrote…

Instead of making temporary changes to lockTable to make it remember uncontended replicated locks, I am changing this to do "select for update", since we need to remember unreplicated locks.

👍

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran with roachprod run sumeer-1610576725-05-n2cpu8:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=64 --splits=1000 --duration=10m0s --span-percent=95 --span-limit=1 --sfu-writes=true {pgurl:1-1} on a single node running crdb, and there isn't any performance difference. At least there is no regression.
I'll try with a higher write fraction, and/or tweak some other things.

Do we need an observable perf improvement on a benchmark as a prerequisite to merging this, or is no regression sufficient, given that a follow-up PR will introduce optimistic latching.

old is master and new is this PR.

name                                       old ops/sec  new ops/sec  delta
v95limited-spans/enc=false/nodes=1          12.8k ± 1%   12.7k ± 1%    ~     (p=0.161 n=9+9)
v95limited-spans/enc=false/nodes=1/cpu=32   45.5k ± 1%   45.7k ± 1%  +0.52%  (p=0.029 n=10+10)

name                                       old p50      new p50      delta
v95limited-spans/enc=false/nodes=1           3.10 ± 0%    3.10 ± 0%    ~     (all equal)
v95limited-spans/enc=false/nodes=1/cpu=32    1.00 ± 0%    1.00 ± 0%    ~     (all equal)

name                                       old p95      new p95      delta
v95limited-spans/enc=false/nodes=1           16.5 ± 2%    16.6 ± 2%    ~     (p=0.484 n=10+9)
v95limited-spans/enc=false/nodes=1/cpu=32    4.50 ± 0%    4.50 ± 0%    ~     (all equal)

name                                       old p99      new p99      delta
v95limited-spans/enc=false/nodes=1           30.4 ± 0%    30.4 ± 0%    ~     (all equal)
v95limited-spans/enc=false/nodes=1/cpu=32    10.3 ± 3%    10.5 ± 0%    ~     (p=0.108 n=10+9)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

@nvanbenschoten
Copy link
Member

You might want to try removing the --splits=1000 and maybe even disable load-based splitting. The benefit of this change will be most pronounced when false contention takes place on a single Range. Otherwise, the Range boundaries will protect against the false contention because DistSender will short circuit the scan early.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I tried that and there was still no difference.
Then I went with the extreme of (a) making the span read not specify any WHERE clause so it is reading everything with LIMIT 1, and (b) making the transactional write sleep for 10ms after the SELECT FOR UPDATE. That didn't show any difference either.
I then realized that with the ~500K write requests each writing to one key, and uniformly random keys in the int64 domain, they are not likely to touch the same key more than once (i.e., only inserts, no updates). So SFU will not acquire any locks.
So I changed the writes to write to key 0 initially and then to key 1. Repeated writes to the same key also results in a few Error: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD - WriteTooOld. Those are now ignored.
Then discovered the optimistic eval path was not being taken because ba.Header.MaxSpanRequestKeys was equal to 2, which was also the LiveCount. Would you know why MaxSpanRequestKeys would be 2 when the LIMIT is 1 -- is this because there is something else in the same batch?
Then added more keys for the writers.

Now the workload shows a significant latency difference for reads, as expected. I am not sure what to make of the write latency being worse -- I'll need to run more times to see if it is a stray occurrence. The number of writes in the optimistic case were much higher so maybe there was more write-write contention. If I am reading correctly, the reads should be scoping their ts cache update to the key they actually read so there shouldn't be any false read-write contention.
And I'll need to cleanup to workload hacks here to make it a reasonable sounding workload.

optimistic
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         588413          980.7      1.2      0.5      4.7      7.3    151.0  span

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          31046           51.7   1213.1   1208.0   1744.8   1811.9   1946.2  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0             15            0.0    110.5    117.4    142.6    142.6    142.6  write-write-err

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0         619474         1032.5     61.9      0.6     75.5   1543.5   1946.2  

pessimistic
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         537133          895.2     28.2     25.2     65.0    100.7    637.5  span

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          28048           46.7    828.4    805.3   1342.2   1543.5   1811.9  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0              8            0.0     71.8     79.7    100.7    100.7    100.7  write-write-err

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0         565189          942.0     67.9     26.2    151.0   1140.9   1811.9  

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

@nvanbenschoten
Copy link
Member

Then discovered the optimistic eval path was not being taken because ba.Header.MaxSpanRequestKeys was equal to 2, which was also the LiveCount. Would you know why MaxSpanRequestKeys would be 2 when the LIMIT is 1 -- is this because there is something else in the same batch?

Hm, I don't know. One reason might be column families, but there's only a single column family in kv. So I really don't know. Would be interesting to track this back up the stack and figure out why that is.

Now the workload shows a significant latency difference for reads, as expected. I am not sure what to make of the write latency being worse -- I'll need to run more times to see if it is a stray occurrence.

These numbers look more reasonable, though the write latency is extremely high in both cases. I guess that's because of the severe contention.

For the sake of the experiment and to avoid some of the issues you hit here with SFU on missing keys and that forced you down to a single key for the entire workload, I still think it would be worth removing this condition and then going back to the INSERTs for one of your trials:

if durability == lock.Replicated {

That should avoid write-write contention issues.

Part of why this PR is so important is because I think we will eventually want to remove that condition, which would allow us to get rid of the "lock discovery" mechanism. Does that match your thinking as well or do you think we'll always avoid placing uncontended replicated locks in the lock table?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of why this PR is so important is because I think we will eventually want to remove that condition, which would allow us to get rid of the "lock discovery" mechanism. Does that match your thinking as well or do you think we'll always avoid placing uncontended replicated locks in the lock table?

I agree that we should remove that condition. But I want to postpone that removal for a future PR. The problem is that maxLocks is a rather crude way of limiting memory, and when we hit that limit we drop all locks. I've added some comments around those code blocks about this.

I cleaned up the benchmark to remove the key generation hacks. It now uses sequential keys with a cycle of 1000, so real contention is low. With this change the overall numbers are sane wrt the 10ms sleep after SFU:

name                                           old ops/sec  new ops/sec  delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq   5.68k ± 0%   9.96k ± 1%  +75.46%  (p=0.000 n=8+9)

name                                           old p50      new p50      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    13.1 ± 0%     5.5 ± 0%  -58.02%  (p=0.000 n=8+8)

name                                           old p95      new p95      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    18.9 ± 0%    16.8 ± 0%  -11.11%  (p=0.001 n=8+9)

name                                           old p99      new p99      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    22.0 ± 0%    25.6 ± 2%  +16.57%  (p=0.000 n=8+9)

The breakdown (from on 1 run) is

pess
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        2984204         4973.7     12.0     13.6     21.0     25.2     75.5  span

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         156473          260.8     16.7     16.3     22.0     28.3     75.5  write

opt
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        5708856         9514.8      5.6      5.2     10.0     13.1     35.7  span

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         301406          502.3     22.2     22.0     29.4     32.5     48.2  write

I plan to ignore the higher write latency for opt given that the throughput for writes is much higher.

I'll start working on

  • unit tests
  • play around with remembering replicated locks and report some numbers.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/replica_read.go, line 103 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks. Mind mentioning in the second bullet that this is done using the ResumeSpans?

Done


pkg/kv/kvserver/replica_send.go, line 763 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we define a const k = 1 then?

Done


pkg/kv/kvserver/replica_send.go, line 735 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: can we just return 0 in these error cases?

Done (I was getting carried away with type conformance in Go).


pkg/kv/kvserver/replica_send.go, line 769 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We aren't actually setting requestEvalKind to concurrency.PessimisticEval, so we're relying on its default value. That might not be the intention.

That was intentional but I agree it is error prone. I've made it explicit now.


pkg/kv/kvserver/replica_test.go, line 13068 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: these two tests might be better situated in client_replica_test.go, since they don't need to touch unexported parts of the kvserver package.

Done


pkg/kv/kvserver/replica_test.go, line 13089 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why are these sleeps needed?

I wasn't sure about our situation with monotonic clocks in tests. I don't want the SFU transaction to slip in earlier than these writes, and the scan to slip in earlier than the SFU transaction.


pkg/kv/kvserver/replica_test.go, line 13098 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we write a second version of this test where we only lock "b" and show that the limited scan does not get stuck because of optimistical evaluation?

Done


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 514 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Note to remember this TODO.

Done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should remove that condition. But I want to postpone that removal for a future PR. The problem is that maxLocks is a rather crude way of limiting memory, and when we hit that limit we drop all locks. I've added some comments around those code blocks about this.

I think there was a misunderstanding. I don't at all think that we should remove that condition for real here. I just thought it would be an easy way to test out the benefit of this change without needing to jump through the hoops presented by SFU. But I guess the idea here is to get something that we can actually merge, right?

Regardless, those new benchmarks look very promising! Thanks for running them.

I plan to ignore the higher write latency for opt given that the throughput for writes is much higher.

That seems like a reasonable explanation.

Reviewed 8 of 8 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/client_replica_test.go, line 3838 at r6 (raw file):

	for !done {
		select {
		case err := <-readDone:

Should we assert that the read does not finish before txn1 commits?


pkg/kv/kvserver/replica_read.go, line 260 at r6 (raw file):

		req = req.ShallowCopy()
		req.SetHeader(header)
		baReq.MustSetInner(req)

This line and the next look like they are doing more than they are. Does baCopy.Requests[j].MustSetInner(req) work?


pkg/kv/kvserver/replica_test.go, line 13089 at r5 (raw file):

Previously, sumeerbhola wrote…

I wasn't sure about our situation with monotonic clocks in tests. I don't want the SFU transaction to slip in earlier than these writes, and the scan to slip in earlier than the SFU transaction.

Got it. Clocks in tests should all be monotonic, so this shouldn't be needed.


pkg/workload/kv/kv.go, line 435 at r6 (raw file):

	var err error
	if o.config.writesUseSelectForUpdate {
		var tx *pgx.Tx

This could use the crdb.ExecuteTx helper to handle retryable errors transparently. Then you wouldn't need the tryHandleWriteErr logic.

@sumeerbhola sumeerbhola force-pushed the opt_locks branch 2 times, most recently from c6ad473 to 5ce2d60 Compare May 12, 2021 16:08
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finally come back to this and won't let it slide again.

I've addressed the comments, and done some minor cleanup in the interface:Request no longer stores RequestEvalKind since it should be immutable, and instead it is passed explicitly in SequenceReq and remembered in Guard.

There is also a cluster setting to disable optimistic evaluation -- the followup PR to do optimistic latching will be controlled by the same setting.

I'll start working on the remaining unit tests and ping when they are ready.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/client_replica_test.go, line 3838 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we assert that the read does not finish before txn1 commits?

Ah yes. It was an oversight -- the purpose of removedLocks was to do exactly that. Done.


pkg/kv/kvserver/replica_read.go, line 260 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This line and the next look like they are doing more than they are. Does baCopy.Requests[j].MustSetInner(req) work?

Done


pkg/workload/kv/kv.go, line 435 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This could use the crdb.ExecuteTx helper to handle retryable errors transparently. Then you wouldn't need the tryHandleWriteErr logic.

I didn't know about the existence of crdb.ExecuteTx. I've left this unchanged and added a comment with justification.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests to the concurrency package. PTAL

I am unsure whether or how to directly test the changes to executeReadOnlyBatch (no test calls it directly) or executeBatchWithConcurrencyRetries, since I'll need to create some contention too. Are the new higher level tests in client_replica_test.go sufficient?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I'm glad we're coming back to it.

Are the new higher level tests in client_replica_test.go sufficient?

I think the new tests are sufficient for what we're trying to test.

Reviewed 21 of 21 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/client_replica_test.go, line 3784 at r7 (raw file):

}

func TestOptimisticEvalRetry(t *testing.T) {

nit: consider adding comments here and on the next test. Also, there's a decent amount of redundancy. Should we make them a single test with a pair of subtests? That would also make it more clear what changes between them.


pkg/kv/kvserver/replica_read.go, line 97 at r7 (raw file):

		// Gather the spans that were read. For now we ignore the latch spans, but
		// when we stop waiting for latches in optimistic evaluation we will use
		// these to check latches first.

Could you add a note here that we distinguish the spans in the request from the spans that were actually read using resume spans in the response?


pkg/kv/kvserver/replica_read.go, line 215 at r7 (raw file):

}

func (r *Replica) collectSpansRead(

Let's give this a comment.


pkg/kv/kvserver/replica_send.go, line 33 at r7 (raw file):

	"when true, limited scans are optimistically evaluated in the sense of not checking for "+
		"conflicting locks up front for the full key range of the scan, and instead subsequently "+
		"checked for conflicts only over the key range that was read",

s/checked/checking/


pkg/kv/kvserver/replica_send.go, line 851 at r7 (raw file):

	requestEvalKind = concurrency.PessimisticEval
	if isReadOnly && optimisticEvalLimitedScans.Get(&r.ClusterSettings().SV) {

Now that we're (as of #61583) issuing more point reads, I wonder whether we should gate this logic on cases where we can't guarantee that the MaxSpanRequestKeys won't be exceeded ahead of time. For instance, we probably don't want to optimistically evaluate a single GetRequest with a MaxSpanRequestKeys = 10.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 324 at r7 (raw file):

// evaluation attempt. Optimistic evaluation is used for requests involving
// limited scans, where the checking of locks and latches may be (partially)
// postponed until after evaluation, using Guard.CheckOptimisticNoConflicts.

"after evaluation, once the limit has been applied and the key spans have been constrained,"


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 192 at r7 (raw file):

	for {
		if !holdsLatches {

Is holdsLatches equivalent to g.HoldingLatches()? If so, do we need this variable throughout these methods?


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 217 at r7 (raw file):

				panic("Optimistic locking should not have a non-nil lockTableGuard")
			}
			log.Event(ctx, "scanning lock table for conflicting locks")

"optimistically scanning ..."?


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 530 at r7 (raw file):

	if g.ltg == nil {
		return true
	}

Should we assert that g.EvalKind is what we expect?


pkg/kv/kvserver/concurrency/concurrency_manager_test.go, line 524 at r7 (raw file):

}

func scanRequests(

nit: move this above testReq.


pkg/kv/kvserver/concurrency/lock_table.go, line 463 at r7 (raw file):

	g.spans = spanSet
	g.sa = spanset.NumSpanAccess - 1
	g.ss = spanset.SpanScope(0)

Do we need to set g.index?


pkg/sql/logictest/logic_test.go, line 29 at r7 (raw file):

func TestLogic(t *testing.T) {
	defer leaktest.AfterTest(t)()
	RunLogicTest(t, TestServerArgs{}, "testdata/logic_test/select_for_update")

Did you mean to revert this?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @yuzefovich)


pkg/kv/kvserver/client_replica_test.go, line 3784 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider adding comments here and on the next test. Also, there's a decent amount of redundancy. Should we make them a single test with a pair of subtests? That would also make it more clear what changes between them.

Added comments for both tests. I've pulled out the setup code. The remaining stuff has small differences (different spans; timer and select in one test) so I think it is better for understandability to not abstract that out.


pkg/kv/kvserver/replica_read.go, line 97 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a note here that we distinguish the spans in the request from the spans that were actually read using resume spans in the response?

Done


pkg/kv/kvserver/replica_read.go, line 215 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's give this a comment.

Done


pkg/kv/kvserver/replica_send.go, line 33 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/checked/checking/

Done


pkg/kv/kvserver/replica_send.go, line 851 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Now that we're (as of #61583) issuing more point reads, I wonder whether we should gate this logic on cases where we can't guarantee that the MaxSpanRequestKeys won't be exceeded ahead of time. For instance, we probably don't want to optimistically evaluate a single GetRequest with a MaxSpanRequestKeys = 10.

Good point. It looks like GetRequests are not permitted based on the comment at

// If a limit is provided, the batch can contain only the following range
// request types:
// - ScanRequest
// - ReverseScanRequest
// - DeleteRangeRequest
// - RevertRangeRequest
// - ResolveIntentRangeRequest
//
// The following two requests types are also allowed in the batch, although
// the limit has no effect on them:
// - QueryIntentRequest
// - EndTxnRequest
which is enforced at
// If a limit is provided, the batch can contain only the following range
// request types:
// - ScanRequest
// - ReverseScanRequest
// - DeleteRangeRequest
// - RevertRangeRequest
// - ResolveIntentRangeRequest
//
// The following two requests types are also allowed in the batch, although
// the limit has no effect on them:
// - QueryIntentRequest
// - EndTxnRequest

I've added another line to the comment below about why this heuristic is quite crude.


pkg/kv/kvserver/replica_test.go, line 13089 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Got it. Clocks in tests should all be monotonic, so this shouldn't be needed.

Removed sleeps


pkg/kv/kvserver/concurrency/concurrency_control.go, line 324 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"after evaluation, once the limit has been applied and the key spans have been constrained,"

Done


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 192 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is holdsLatches equivalent to g.HoldingLatches()? If so, do we need this variable throughout these methods?

Removed


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 217 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"optimistically scanning ..."?

Done


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 530 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we assert that g.EvalKind is what we expect?

Done


pkg/kv/kvserver/concurrency/concurrency_manager_test.go, line 524 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this above testReq.

Done


pkg/kv/kvserver/concurrency/lock_table.go, line 463 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to set g.index?

Good catch. Done. I think it wasn't a real bug since ScanOptimistic had it set to -1 already.


pkg/sql/logictest/logic_test.go, line 29 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to revert this?

oops. Fixed.

I am unsure about what is happening with the test failures in https://teamcity.cockroachdb.com/viewLog.html?buildId=2974289&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test

E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +testdata/logic_test/select_for_update:459: SELECT s, s1 FROM p2 JOIN p1_0 USING (i) ORDER BY s FOR SHARE NOWAIT
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +expected:
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +could not obtain lock on row in interleaved table
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +got:
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +pq: could not obtain lock on row (i)=(5) in p1_0@primary

I added an EXPLAIN ANALYZE, which shows a merge join between the parent and the child, as expected. The lock is held on p1_0, so the error statement is reasonable. Looking at NewLockNotAvailableError, it seems that the "could not obtain lock on row in interleaved table" error is used when there is no table descriptor. Nothing should have changed in this PR to affect that.

@yuzefovich for suggestions on how to get to the bottom of this.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/sql/logictest/logic_test.go, line 29 at r7 (raw file):

Previously, sumeerbhola wrote…

oops. Fixed.

I am unsure about what is happening with the test failures in https://teamcity.cockroachdb.com/viewLog.html?buildId=2974289&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test

E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +testdata/logic_test/select_for_update:459: SELECT s, s1 FROM p2 JOIN p1_0 USING (i) ORDER BY s FOR SHARE NOWAIT
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +expected:
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +could not obtain lock on row in interleaved table
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +got:
E210512 22:03:38.412844 22946 sql/logictest/logic.go:3556  [-] 7 +pq: could not obtain lock on row (i)=(5) in p1_0@primary

I added an EXPLAIN ANALYZE, which shows a merge join between the parent and the child, as expected. The lock is held on p1_0, so the error statement is reasonable. Looking at NewLockNotAvailableError, it seems that the "could not obtain lock on row in interleaved table" error is used when there is no table descriptor. Nothing should have changed in this PR to affect that.

@yuzefovich for suggestions on how to get to the bottom of this.

The way merge join works is that it reads some rows from the left input, then some from the right input. It's possible that the order in which the conflicting locks are encountered is non-deterministic, so we might hit an error on the left input or on the right input. My intuition is to adjust the error matching pattern to allow for both errors since it doesn't seem important to me to get this error perfect.

Here is a diff that I had 40 successful runs of the logic test on your branch with:

diff --git a/pkg/sql/logictest/testdata/logic_test/select_for_update b/pkg/sql/logictest/testdata/logic_test/select_for_update
index 5c697d8fcf..251eab1eac 100644
--- a/pkg/sql/logictest/testdata/logic_test/select_for_update
+++ b/pkg/sql/logictest/testdata/logic_test/select_for_update
@@ -433,7 +433,7 @@ SELECT * FROM p1_0 FOR SHARE NOWAIT
 query error pgcode 55P03 could not obtain lock on row in interleaved table
 SELECT * FROM p1_1 FOR SHARE NOWAIT
 
-query error pgcode 55P03 could not obtain lock on row \(i\)=\(7\) in p2@primary
+query error pgcode 55P03 could not obtain lock on row (\(i\)=\(7\) in p2@primary|in interleaved table)
 SELECT s, s1 FROM p2 JOIN p1_0 USING (i) ORDER BY s FOR SHARE NOWAIT
 
 user root
@@ -456,7 +456,7 @@ SELECT * FROM p1_0 FOR SHARE NOWAIT
 query error pgcode 55P03 could not obtain lock on row in interleaved table
 SELECT * FROM p1_1 FOR SHARE NOWAIT
 
-query error pgcode 55P03 could not obtain lock on row in interleaved table
+query error pgcode 55P03 could not obtain lock on row (\(i\)=\(5\) in p1_0@primary|in interleaved table)
 SELECT s, s1 FROM p2 JOIN p1_0 USING (i) ORDER BY s FOR SHARE NOWAIT
 
 user root

It's possible that we have some kind of bug in attempting to guess the table descriptor which the conflicting key belongs to, but because both row-by-row and vectorized engines can hit this flake, the bug is probably pretty deep. I'm not sure whether it is worth getting to the bottom of this since KeyToDescTranslator interface seems like "best effort" to me.

@sumeerbhola
Copy link
Collaborator Author

Thanks Yahor!

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 14 of 14 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/replica_send.go, line 851 at r7 (raw file):

Previously, sumeerbhola wrote…

Good point. It looks like GetRequests are not permitted based on the comment at

// If a limit is provided, the batch can contain only the following range
// request types:
// - ScanRequest
// - ReverseScanRequest
// - DeleteRangeRequest
// - RevertRangeRequest
// - ResolveIntentRangeRequest
//
// The following two requests types are also allowed in the batch, although
// the limit has no effect on them:
// - QueryIntentRequest
// - EndTxnRequest
which is enforced at
// If a limit is provided, the batch can contain only the following range
// request types:
// - ScanRequest
// - ReverseScanRequest
// - DeleteRangeRequest
// - RevertRangeRequest
// - ResolveIntentRangeRequest
//
// The following two requests types are also allowed in the batch, although
// the limit has no effect on them:
// - QueryIntentRequest
// - EndTxnRequest

I've added another line to the comment below about why this heuristic is quite crude.

I think that comment is stale. GetRequest learned how to properly handle MaxSpanRequestKeys in #61583.

This optimistic checking happens after the evaluation. The evaluation
will already discover any conflicting intents, so the subsequent
checking of the lock table is primarily to find conflicting
unreplicated locks.

This structure should be easy to extend to optimistic latching.

Benchmark numbers from the new kv roachtest that does SFU to
introduce false contention:
name                                           old ops/sec  new ops/sec  delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq   5.68k ± 0%   9.96k ± 1%  +75.46%  (p=0.000 n=8+9)

name                                           old p50      new p50      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    13.1 ± 0%     5.5 ± 0%  -58.02%  (p=0.000 n=8+8)

name                                           old p95      new p95      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    18.9 ± 0%    16.8 ± 0%  -11.11%  (p=0.001 n=8+9)

name                                           old p99      new p99      delta
kv95limited-spans/enc=false/nodes=1/splt=0/seq    22.0 ± 0%    25.6 ± 2%  +16.57%  (p=0.000 n=8+9)

Microbenchmark numbers for the OptimisticEval microbenchmark,
where the real-contention=true case typically causes optimistic
evaluation to retry.

name                                     old time/op    new time/op    delta
OptimisticEval/real-contention=false-16    5.76ms ± 5%    0.03ms ± 2%  -99.49%  (p=0.000 n=8+10)
OptimisticEval/real-contention=true-16     5.75ms ± 4%    5.63ms ± 5%     ~     (p=0.393 n=10+10)

name                                     old alloc/op   new alloc/op   delta
OptimisticEval/real-contention=false-16    34.3kB ±24%     9.9kB ± 2%  -71.29%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16     33.0kB ±20%    35.7kB ± 9%     ~     (p=0.065 n=8+8)

name                                     old allocs/op  new allocs/op  delta
OptimisticEval/real-contention=false-16       273 ±19%        83 ± 0%  -69.56%  (p=0.000 n=9+10)
OptimisticEval/real-contention=true-16        268 ± 2%       308 ± 2%  +14.90%  (p=0.000 n=8+8)

Fixes cockroachdb#49973
Informs cockroachdb#9521

Release note (performance improvement): A limited scan now checks
for conflicting locks in an optimistic manner, which means it will
not conflict with locks (typically unreplicated locks) that were
held in the scan's full spans, but were not in the spans that were
scanned until the limit was reached. This behavior can be turned
off by changing the value of the cluster setting
kv.concurrency.optimistic_eval_limited_scans.enabled to false.
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/replica_send.go, line 851 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think that comment is stale. GetRequest learned how to properly handle MaxSpanRequestKeys in #61583.

I put in the wrong link for where the enforcement was happening -- I had intended dist_sender.go. But on reading again, I realized I overlooked the GetRequest in the switch statement.
I've fixed the code comment in api.proto and changed the logic here. PTAL

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

@sumeerbhola
Copy link
Collaborator Author

TFTR!

@sumeerbhola
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 20, 2021

Build succeeded:

@craig craig bot merged commit 66336f2 into cockroachdb:master May 20, 2021
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Jul 1, 2021
In cockroachdb#58670 we overloaded `splits: -1` to also disable load-based
splitting. This rather dramatically changes the req/sec these tests
can achieve, producing worrying drops in our historical performance
monitoring.

This adds a new option for disabling load splitting so that we can
keep the behaviour of the previous tests.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 5, 2021
67107: roachtest: re-enable load-splitting for some kv tests r=tbg a=stevendanna

In #58670 we overloaded `splits: -1` to also disable load-based
splitting. This rather dramatically changes the req/sec these tests
can achieve, producing worrying drops in our historical performance
monitoring.

This adds a new option for disabling load splitting so that we can
keep the behaviour of the previous tests.

Release note: None

Co-authored-by: Steven Danna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: Scans with limit scan too much of lockTable
4 participants