kv: Scans with limit scan too much of lockTable #49973

nvanbenschoten · 2020-06-08T18:13:12Z

This is analogous to #9521, but for the lockTable.

Now that the lockTable is scanned ahead of evaluation, limited scans observe locks over their entire key range instead of just up to the point where they hit their limit. This can artificially increase contention in much the same way it does in #9521. The difference is that in #9521, limited scans will wait excessively long for artificially contending operations to complete. In this issue, limited scans will wait excessively long for the entire transaction that issued the artificially contending operations to complete.

The design of the lockTable took this into consideration, though we haven't implemented a solution for this issue yet. Namely, the lockTable is built using an immutable btree which supports O(1) snapshots in the same way the the latchManager does. The point of this is 1) to be able to scan the lockTable incrementally once we introduce a lockAwareIterator and 2) to potentially do exactly what we're doing in #33373 and push that prototype over the finish line. #33373 evaluate limited scans optimistically without latching until after the fact, when it has determined the full bounds of the scan. The ability to snapshot the lockTable means that we could do the same thing there.

The saving grace for v20.1 is that most intents don't actually end up in the lockTable because replicated locks are only pulled into the lockTable when they are discovered in the MVCC keyspace. This means that in most cases, this is a non-issue under low to moderate contention. However, it can become an issue under either high contention or when unreplicated locks are used heavily. A patch I'm about to push will mitigate the latter concern.

The text was updated successfully, but these errors were encountered:

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

sumeerbhola · 2020-06-09T13:34:39Z

This is relevant for the separated lock table prototype that I am working on. So I'd like to clarify the approach and correctness.
There are two options:

Check for conflicting locks during optimistic request evaluation: To keep the simplifications in mvcc.go regarding not knowing about intents and locks, one could have a special optimistic iterator that the code in mvcc.go is oblivious to, and the iterator would declare an error when encountering a lock conflict. This would cause an early failure so is potentially better than checking after the fact, but it is potentially more complicated.
Check for conflicting locks after optimistic request evaluation.

Do you have a preference?

Correctness: lockState holds mutable information, unlike latch. So we can't tell if conflicted with a lock since it may have changed. But it seems latching saves us.

latching will prevent other requests from starting after read request R has placed its latches in AcquireOptimistic. So the interesting ones are:
- Other requests that have started before R but have not yet acquired/released locks. But we will see these when we do the latch check after optimistic evaluation, so they are not a concern.
- Other requests that have completed and hold a lock. No other request can start after R starts that will change those locks.
Concurrent releases of unreplicated locks aren't actually harmful since that conflict is gone. But concurrent releases of replicated locks are harmful since they can change a provisional value to a committed value. Neither can happen, so we will see previously acquired locks when doing the consistency check after evaluation.

Does this sound right?

nvanbenschoten · 2020-06-09T21:33:38Z

Do you have a preference?

I think I prefer the first option for two reasons:

it avoids useless work in the case that we do hit a conflicting lock, allowing us to fail earlier
it lets us get around answering the question of what we return from the iterator in the case that we hit a lock. For option 2, do we pretend the lock + its provisional value aren't there? Do we pretend they are already committed? This is important because mvcc.go may bail early if it detects certain logical invariant violations. For instance, Something like a CPut will return a ConditionFailedError in some cases, so we'd need to catch those and potentially swallow them after checking for conflicting locks if we went with option 2.

Does this sound right?

I had to go back and remind myself that AcquireOptimistic does, in-fact, place latches that block out other requests, it just doesn't wait on existing latches to start with. Given that, I agree with your reasoning.

Your point about concurrent releases of replicated locks being harmful because of the adjusted provisional value in interesting. We ideally want to get to a point where we can perform intent resolution without acquiring latches through the use of some indirection in an intentAwareIterator. I don't think that's in conflict with what we're discussing here. But to ensure correctness, we need to ensure that if the iterator returns the value for a pre-resolution value, the lock-table check cannot return the post-resolution lock state. The other three orderings (pre-resolution => old lock-table, post-resolution => old lock-table, post-resolution => new lock-table) are all valid.

So in that sense, I think the first option you listed above might make it easier to guarantee this, because we can enforce the invariant at a single atomicity point (during iteration) instead of trying to coordinate between what the iterator returns and what we later see in the lock table when we perform the optimistic conflict check. In other words, option 1 still ensures that we check the lock-table for a given key before ever observing the corresponding MVCC state, it just does so dynamically while scanning instead of upfront like we do today. Option 2 inverts the order of these two operations, which I suspect will lead to complexity here.

Does that line up with your thinking?

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

sumeerbhola · 2020-06-10T12:10:56Z

For instance, Something like a CPut will return a ConditionFailedError in some cases, so we'd need to catch those and potentially swallow them after checking for conflicting locks if we went with option 2.

That makes sense. Option 1 it is.

But to ensure correctness, we need to ensure that if the iterator returns the value for a pre-resolution value, the lock-table check cannot return the post-resolution lock state. The other three orderings (pre-resolution => old lock-table, post-resolution => old lock-table, post-resolution => new lock-table) are all valid.
So in that sense, I think the first option you listed above might make it easier to guarantee this, because we can enforce the invariant at a single atomicity point (during iteration) instead of trying to coordinate between what the iterator returns and what we later see in the lock table when we perform the optimistic conflict check.

I wasn't thinking about intent resolution without latches -- I'll keep it in mind. Since we will create the storage.Iterator in the lockTable before any optimistic evaluation, it should be easy to ensure that concurrently resolved intents are noticed when doing the check if we happened to do option 2.
For request evaluation in option 1 we'll need to order the creation of the storage.Iterator such that the one on the lock table key space is created before the one on the MVCC key space (since we don't want to create snapshots for request evaluation) so that we don't see a provisional value and not see the MVCCMetadata. Seeing the MVCCMetadata without the provisional value is fine since either the MVCCMetadata has a timestamp above this read and we were going to ignore the provisional value (and resolution will keep that invariant), or it is below this read in which case we are going to immediately exit this optimistic evaluation without trying to look for the missing provisional value.

49891: physicalplan: preevaluate subqueries on LocalExprs and always set LocalExprs r=yuzefovich a=yuzefovich **physicalplan: preevaluate subqueries on LocalExprs** When the plan is local, we do not serialize expressions. Previously, in such a case we would also not preevaluate the subqueries in the expressions which made `EXPLAIN (VEC)` return unexpected plan (there would `tree.Subquery` in the expression which we don't support in the vectorized, so we would wrap the plan). Now we will preevalute the subqueries before storing in the processor spec. AFAICT it affects only this explain variant and nothing else. Release note: None **colexec: improve expression parsing** This commit introduces `colexec.ExprHelper` that helps with expression processing. Previously, we were allocating a new `execinfra.ExprHelper` and were calling `Init` on it in order to get the typed expression from possibly serialized representation of each expression. Now, this new expression helper is reused between all expressions in the flow on a single node. There is one caveat, however: we need to make sure that we force deserialization of the expressions during `SupportsVectorized` check if the flow is scheduled to be run on a remote node (different from the one that is performing the check). This is necessary to make sure that the remote nodes will be able to deserialize the expressions without encountering errors (if we don't force the serialization during the check, we will use `LocalExpr` - if available - and might not catch things that we don't support). Release note: None **physicalplan: always store LocalExpr** Previously, we would set either `LocalExpr` (unserialized expression, only when we have the full plan on a single node) or `Expr` (serialized expression, when we have distributed plan as well as in some tests). However, we could be setting both and making best effort to reuse unserialized `LocalExpr` on the gateway even if the plan is distributed. And this commit adds such behavior. Fixes: #49810. Release note: None 49966: roachtest: adjust tpchvec and tpcdsvec r=yuzefovich a=yuzefovich **roachtest: add new tpchvec config** This commit adds a new `tpchvec/perf_no_stats` config that is the same as `tpchvec/perf` except for the fact that stats creation is disabled. The plans without stats are likely to be different, so it gives us an easy way to get more test coverage. One caveat here is that query 9 without stats takes insanely long to run, so some new plumbing has been added to skip that query. Additionally, `tpcdsvec` has been adjusted. The test runs all queries with and without stats present with `on` and `off` vectorize options. However, when stats are not present, `on` config will be reduced to `off` because of `vectorize_row_count_threshold` heuristic. This commit disables that heuristic. Release note: None **roachtest: switch the config order in tpchvec/perf** Let's see whether it makes difference to occasional failures of `tpchvec/perf` which are very hard to explain. This commit also changes the workload command for `perf` config to run only against node 1, thus, eliminating one possible source of "randomness" for the failures. Addresses: #49955. Release note: None 49980: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade r=nvanbenschoten a=nvanbenschoten Fixes #49658. Informs #9521. Informs #49973. Related to #49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for #49973 and avoids the 99th percentile latency regression observed in #49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in #49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like #33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run: #49658. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2020-06-10T21:18:56Z

For request evaluation in option 1 we'll need to order the creation of the storage.Iterator such that the one on the lock table key space is created before the one on the MVCC key space.

Yes, we'll need to ensure this for the reasons you listed. I think that's how things would naturally be ordered, anyway, so we shouldn't have to do anything special here.

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

(There are bugs here, since existing tests fail. And I have not added new tests yet -- I would like an opinion on the high-level approach before fixing those.) This is a cleanup in preparation for the future, and also has some, probably minor, immediate benefits. In the future, the lock table will support multiple intents for the same key if all but one are known to be finalized. So the finalizedTxnCache belongs in the lock table data-structure. Additionally, we will support intent resolution without holding latches, which has some implications on data-structure consistency: request evaluation will not be allowed to add discovered intents to the lock table since the discovery may be stale. This PR is not changing this discovery behavior since we need it for now (due to interleaved intents), but it moves us along the path towards the lock table data-structure not relying on external behavior for maintaining its in-memory "cache" of locks. Specifically, removing intents from the lock table when the intent is still present in the engine is not principled. We currently do this in two places: - for optimizing limited scans: a later PR will fix this properly by checking the lock table after request evaluation, as outlined in cockroachdb#49973. - using the finalizedTxnCache in the lockTableWaiterImpl: this use is changed in this PR. The code in the lock table also does removal of intents before resolution, but there is a TODO to fix that in the future. It should be easier to do this with the behavior contained in the lock table. The immediate benefits, which may not have any practical significance, are: - We no longer resolve unreplicated locks -- they are simply removed. - A replicated lock is removed from the lock table data-structure only when the requester has finished a scan and is in a position to do resolution. Earlier one could remove the lock but block on another lock, and not do intent resolution on the first lock. This would cause wasteful evaluation of other requests. Release note: None

This is a cleanup in preparation for the future, and also has some, possibly minor, immediate benefits. In the future, the lock table will support multiple intents for the same key if all but one are known to be finalized. So the finalizedTxnCache belongs in the lock table data-structure. Additionally, we will support intent resolution without holding latches, which has some implications on data-structure consistency: request evaluation will not be allowed to add discovered intents to the lock table since the discovery may be stale. This PR is not changing this discovery behavior since we need it for now (due to interleaved intents), but it moves us along the path towards the lock table data-structure not relying on external behavior for maintaining its in-memory "cache" of locks. Specifically, removing intents from the lock table when the intent is still present in the engine is not principled. We currently do this in two places: - for optimizing limited scans: a later PR will fix this properly by checking the lock table after request evaluation, as outlined in cockroachdb#49973. - using the finalizedTxnCache in the lockTableWaiterImpl: this use is changed in this PR. The code in the lock table also does removal of intents before resolution, but there is a TODO to fix that in the future. It should be easier to do this with the behavior contained in the lock table. The immediate benefits, which may not have any practical significance, are: - We no longer resolve unreplicated locks -- they are simply removed. - A replicated lock is removed from the lock table data-structure only when the requester has finished a scan and is in a position to do resolution. Earlier one could remove the lock but block on another lock, and not do intent resolution on the first lock. This would cause wasteful evaluation of other requests. Informs cockroachdb#41720 Release note: None

57947: concurrency: move finalizedTxnCache into lock table r=sumeerbhola a=sumeerbhola This is a cleanup in preparation for the future, and also has some, possibly minor, immediate benefits. In the future, the lock table will support multiple intents for the same key if all but one are known to be finalized. So the finalizedTxnCache belongs in the lock table data-structure. Additionally, we will support intent resolution without holding latches, which has some implications on data-structure consistency: request evaluation will not be allowed to add discovered intents to the lock table since the discovery may be stale. This PR is not changing this discovery behavior since we need it for now (due to interleaved intents), but it moves us along the path towards the lock table data-structure not relying on external behavior for maintaining its in-memory "cache" of locks. Specifically, removing intents from the lock table when the intent is still present in the engine is not principled. We currently do this in two places: - for optimizing limited scans: a later PR will fix this properly by checking the lock table after request evaluation, as outlined in #49973. - using the finalizedTxnCache in the lockTableWaiterImpl: this use is changed in this PR. The code in the lock table also does removal of intents before resolution, but there is a TODO to fix that in the future. It should be easier to do this with the behavior contained in the lock table. The immediate benefits, which may not have any practical significance, are: - We no longer resolve unreplicated locks -- they are simply removed. - A replicated lock is removed from the lock table data-structure only when the requester has finished a scan and is in a position to do resolution. Earlier one could remove the lock but block on another lock, and not do intent resolution on the first lock. This would cause wasteful evaluation of other requests. Informs #41720 Release note: None Co-authored-by: sumeerbhola <[email protected]>

This optimistic checking happens after the evaluation. The evaluation will already discover any conflicting intents, so the subsequent checking of the lock table is primarily to find conflicting unreplicated locks. This structure should be easy to extend to optimistic latching. Fixes cockroachdb#49973 Informs cockroachdb#9521 Release note: None

This optimistic checking happens after the evaluation. The evaluation will already discover any conflicting intents, so the subsequent checking of the lock table is primarily to find conflicting unreplicated locks. This structure should be easy to extend to optimistic latching. Benchmark numbers from the new kv roachtest that does SFU to introduce false contention: name old ops/sec new ops/sec delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 5.68k ± 0% 9.96k ± 1% +75.46% (p=0.000 n=8+9) name old p50 new p50 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 13.1 ± 0% 5.5 ± 0% -58.02% (p=0.000 n=8+8) name old p95 new p95 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 18.9 ± 0% 16.8 ± 0% -11.11% (p=0.001 n=8+9) name old p99 new p99 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 22.0 ± 0% 25.6 ± 2% +16.57% (p=0.000 n=8+9) Fixes cockroachdb#49973 Informs cockroachdb#9521 Release note: None

This optimistic checking happens after the evaluation. The evaluation will already discover any conflicting intents, so the subsequent checking of the lock table is primarily to find conflicting unreplicated locks. This structure should be easy to extend to optimistic latching. Benchmark numbers from the new kv roachtest that does SFU to introduce false contention: name old ops/sec new ops/sec delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 5.68k ± 0% 9.96k ± 1% +75.46% (p=0.000 n=8+9) name old p50 new p50 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 13.1 ± 0% 5.5 ± 0% -58.02% (p=0.000 n=8+8) name old p95 new p95 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 18.9 ± 0% 16.8 ± 0% -11.11% (p=0.001 n=8+9) name old p99 new p99 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 22.0 ± 0% 25.6 ± 2% +16.57% (p=0.000 n=8+9) Microbenchmark numbers for the OptimisticEval microbenchmark, where the real-contention=true case typically causes optimistic evaluation to retry. name old time/op new time/op delta OptimisticEval/real-contention=false-16 5.76ms ± 5% 0.03ms ± 2% -99.49% (p=0.000 n=8+10) OptimisticEval/real-contention=true-16 5.75ms ± 4% 5.63ms ± 5% ~ (p=0.393 n=10+10) name old alloc/op new alloc/op delta OptimisticEval/real-contention=false-16 34.3kB ±24% 9.9kB ± 2% -71.29% (p=0.000 n=9+10) OptimisticEval/real-contention=true-16 33.0kB ±20% 35.7kB ± 9% ~ (p=0.065 n=8+8) name old allocs/op new allocs/op delta OptimisticEval/real-contention=false-16 273 ±19% 83 ± 0% -69.56% (p=0.000 n=9+10) OptimisticEval/real-contention=true-16 268 ± 2% 308 ± 2% +14.90% (p=0.000 n=8+8) Fixes cockroachdb#49973 Informs cockroachdb#9521 Release note (performance improvement): A limited scan now checks for conflicting locks in an optimistic manner, which means it will not conflict with locks (typically unreplicated locks) that were held in the scan's full spans, but were not in the spans that were scanned until the limit was reached. This behavior can be turned off by changing the value of the cluster setting kv.concurrency.optimistic_eval_limited_scans to false.

This optimistic checking happens after the evaluation. The evaluation will already discover any conflicting intents, so the subsequent checking of the lock table is primarily to find conflicting unreplicated locks. This structure should be easy to extend to optimistic latching. Benchmark numbers from the new kv roachtest that does SFU to introduce false contention: name old ops/sec new ops/sec delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 5.68k ± 0% 9.96k ± 1% +75.46% (p=0.000 n=8+9) name old p50 new p50 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 13.1 ± 0% 5.5 ± 0% -58.02% (p=0.000 n=8+8) name old p95 new p95 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 18.9 ± 0% 16.8 ± 0% -11.11% (p=0.001 n=8+9) name old p99 new p99 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 22.0 ± 0% 25.6 ± 2% +16.57% (p=0.000 n=8+9) Microbenchmark numbers for the OptimisticEval microbenchmark, where the real-contention=true case typically causes optimistic evaluation to retry. name old time/op new time/op delta OptimisticEval/real-contention=false-16 5.76ms ± 5% 0.03ms ± 2% -99.49% (p=0.000 n=8+10) OptimisticEval/real-contention=true-16 5.75ms ± 4% 5.63ms ± 5% ~ (p=0.393 n=10+10) name old alloc/op new alloc/op delta OptimisticEval/real-contention=false-16 34.3kB ±24% 9.9kB ± 2% -71.29% (p=0.000 n=9+10) OptimisticEval/real-contention=true-16 33.0kB ±20% 35.7kB ± 9% ~ (p=0.065 n=8+8) name old allocs/op new allocs/op delta OptimisticEval/real-contention=false-16 273 ±19% 83 ± 0% -69.56% (p=0.000 n=9+10) OptimisticEval/real-contention=true-16 268 ± 2% 308 ± 2% +14.90% (p=0.000 n=8+8) Fixes cockroachdb#49973 Informs cockroachdb#9521 Release note (performance improvement): A limited scan now checks for conflicting locks in an optimistic manner, which means it will not conflict with locks (typically unreplicated locks) that were held in the scan's full spans, but were not in the spans that were scanned until the limit was reached. This behavior can be turned off by changing the value of the cluster setting kv.concurrency.optimistic_eval_limited_scans.enabled to false.

58670: concurrency,kvserver: limited scans optimistically check for locks r=sumeerbhola a=sumeerbhola This optimistic checking happens after the evaluation. The evaluation will already discover any conflicting intents, so the subsequent checking of the lock table is primarily to find conflicting unreplicated locks. This structure should be easy to extend to optimistic latching. Benchmark numbers from the new kv roachtest that does SFU to introduce false contention: ``` name old ops/sec new ops/sec delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 5.68k ± 0% 9.96k ± 1% +75.46% (p=0.000 n=8+9) name old p50 new p50 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 13.1 ± 0% 5.5 ± 0% -58.02% (p=0.000 n=8+8) name old p95 new p95 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 18.9 ± 0% 16.8 ± 0% -11.11% (p=0.001 n=8+9) name old p99 new p99 delta kv95limited-spans/enc=false/nodes=1/splt=0/seq 22.0 ± 0% 25.6 ± 2% +16.57% (p=0.000 n=8+9) ``` Microbenchmark numbers for the OptimisticEval microbenchmark, where the real-contention=true case typically causes optimistic evaluation to retry. ``` name old time/op new time/op delta OptimisticEval/real-contention=false-16 5.76ms ± 5% 0.03ms ± 2% -99.49% (p=0.000 n=8+10) OptimisticEval/real-contention=true-16 5.75ms ± 4% 5.63ms ± 5% ~ (p=0.393 n=10+10) name old alloc/op new alloc/op delta OptimisticEval/real-contention=false-16 34.3kB ±24% 9.9kB ± 2% -71.29% (p=0.000 n=9+10) OptimisticEval/real-contention=true-16 33.0kB ±20% 35.7kB ± 9% ~ (p=0.065 n=8+8) name old allocs/op new allocs/op delta OptimisticEval/real-contention=false-16 273 ±19% 83 ± 0% -69.56% (p=0.000 n=9+10) OptimisticEval/real-contention=true-16 268 ± 2% 308 ± 2% +14.90% (p=0.000 n=8+8) ``` Fixes #49973 Informs #9521 Release note (performance improvement): A limited scan now checks for conflicting locks in an optimistic manner, which means it will not conflict with locks (typically unreplicated locks) that were held in the scan's full spans, but were not in the spans that were scanned until the limit was reached. This behavior can be turned off by changing the value of the cluster setting kv.concurrency.optimistic_eval_limited_scans.enabled to false. Co-authored-by: sumeerbhola <[email protected]>

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Jun 8, 2020

nvanbenschoten mentioned this issue Jun 8, 2020

kv/concurrency: drop uncontended replicated lock on unreplicated upgrade #49980

Merged

nvanbenschoten mentioned this issue Jun 11, 2020

release-20.1: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade #50119

Merged

sumeerbhola self-assigned this Dec 3, 2020

nvanbenschoten mentioned this issue Dec 10, 2020

sql: performance regression in Movr #56923

Closed

sumeerbhola mentioned this issue Dec 15, 2020

concurrency: move finalizedTxnCache into lock table #57947

Merged

sumeerbhola mentioned this issue Jan 8, 2021

concurrency,kvserver: limited scans optimistically check for locks #58670

Merged

craig bot closed this as completed in 31847ac May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: Scans with limit scan too much of lockTable #49973

kv: Scans with limit scan too much of lockTable #49973

nvanbenschoten commented Jun 8, 2020

sumeerbhola commented Jun 9, 2020

nvanbenschoten commented Jun 9, 2020

sumeerbhola commented Jun 10, 2020

nvanbenschoten commented Jun 10, 2020

kv: Scans with limit scan too much of lockTable #49973

kv: Scans with limit scan too much of lockTable #49973

Comments

nvanbenschoten commented Jun 8, 2020

sumeerbhola commented Jun 9, 2020

nvanbenschoten commented Jun 9, 2020

sumeerbhola commented Jun 10, 2020

nvanbenschoten commented Jun 10, 2020