storage: evaluate limited scans optimistically without latching #33373

nvanbenschoten · 2018-12-27T07:33:45Z

Fixes #9521.
Supersedes #31904.

SQL has a tendency to create scans which cover a range's entire key span, though looking only to return a finite number of results. These requests end up blocking on writes that are holding latches over keys that the scan will not actually touch. In reality, when there is a scan with a limit, the actual affected key span ends up being a small subset of the range's key span.

This change creates a new "optimistic evaluation" path for read-only requests. When evaluating optimistically, the batch will sequence itself with the latch manager, but will not wait to acquire all of its latches. Instead, it begins evaluating immediately and verifies that it would not have needed to wait on any latch acquisitions after-the-fact. When performing this verification, it uses knowledge about the limited set of keys that the scan actually looked at. If there are no conflicts, the scan succeeds. If there are conflicts, the request waits for all of its latch acquisition attempts to finish and re-evaluates.

This PR replaces #31904. The major difference between the two is that this PR exploits the structure of the latch manager to efficiently perform optimistic latch acquisition and after-the-fact verification of conflicts. Doing this requires keeping no extra state because it uses the immutable snapshots that the latch manager now captures during sequencing. The other major difference is that this PR does not release latches after a failed optimistic evaluation.

NOTE: a prevalent theory about the pathological case addressed in this PR was that overestimated read latches would serialize with write latches and write latches would serialize with overestimated read latches, causing all requests on a range to serialize. I wasn't seeing this "chaining" occur in practice. It turns out that the "timestamp awareness" in the latch manager should avoid this behavior in most cases because later writes will have higher timestamps than earlier reads, so they won't wait on them to execute. An argument could be made that because the hypothesized serialization of requests doesn't actually exist, the added complexity of this change isn't justified. I'm curious what reviewers think.

Benchmark Results

name                                   old ops/sec  new ops/sec  delta
kv95/cores=16/nodes=3/splits=3          51.9k ± 0%   51.7k ± 1%     ~     (p=0.400 n=3+3)
kvS70-L1/cores=16/nodes=3/splits=3      24.1k ± 4%   27.7k ± 1%  +14.75%  (p=0.100 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3      24.5k ± 1%   27.5k ± 1%  +12.08%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3   16.0k ± 1%   16.6k ± 2%   +3.79%  (p=0.100 n=3+3)

name                                   old p50(ms)  new p50(ms)  delta
kv95/cores=16/nodes=3/splits=3           0.70 ± 0%    0.70 ± 0%     ~     (all equal)
kvS70-L1/cores=16/nodes=3/splits=3       1.07 ± 6%    0.90 ± 0%  -15.62%  (p=0.100 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3       1.10 ± 0%    0.90 ± 0%  -18.18%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3    1.80 ± 0%    1.67 ± 4%   -7.41%  (p=0.100 n=3+3)

name                                   old p99(ms)  new p99(ms)  delta
kv95/cores=16/nodes=3/splits=3           1.80 ± 0%    1.80 ± 0%     ~     (all equal)
kvS70-L1/cores=16/nodes=3/splits=3       5.77 ±32%    4.70 ± 0%     ~     (p=0.400 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3       5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3    6.90 ± 3%    7.33 ± 8%     ~     (p=0.400 n=3+3)

S<num> = --span-percent=<num>, L<num> = --span-limit=<num>

Release note (performance improvement): improved performance on workloads which mix OLAP queries with inserts and updates.

cockroach-teamcity · 2018-12-27T07:33:54Z

This change is

tbg

I have to review the main commit more closely, but on the face of it the complexity seems manageable. Whether this should be merged or not depends on the impact it would have on real-world workloads. Intuitively the change seems to make a lot of sense to me, though I don't know all the situations in which suitable scans are emitted by the SQL layer. My expectation is that we'll see them during DistSQL scans (with a relatively large limit), though in that case the scans would eventually scan all of the range and so would likely conflict with any writes each individual scan would "avoid". OLAP mixed with updates seemed to have been the original motivator in #31904, though no data was provided to back this up.

Does this change move the needle on TPCC at all? What about TPCH (or are we in no position to even consider that?)

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 2284 at r4 (raw file):

				resp := br.Responses[i].GetInner().(*roachpb.ScanResponse)
				if resp.ResumeSpan != nil {
					if start.Equal(resp.ResumeSpan.Key) {

Annoying that there's no "better" way to catch this. Can you add a unit test, though?

petermattis

Does this change move the needle on TPCC at all? What about TPCH (or are we in no position to even consider that?)

TPCH is read-only, so this change shouldn't affect it. What it could affect is running TPCC and then periodically issuing an OLAP-style query? I believe a workload like that is what motivated this change.

Reviewable status: complete! 0 of 0 LGTMs obtained

nvanbenschoten · 2019-01-02T13:17:27Z

My expectation is that we'll see them during DistSQL scans (with a relatively large limit), though in that case the scans would eventually scan all of the range and so would likely conflict with any writes each individual scan would "avoid".

Yeah, this is what inspired the "if the limit is smaller than the number of live keys in the range" check.

I ran a tpccbench run with this change and one without it and ended up getting the exact same warehouse count (1535). I'll run a few more and then move on to other workloads like Peter's TPC-C + periodic TPC-H suggestion.

Release note: None

This does not escape. Release note: None

The test was asserting that interfering requests interfered, but it wasn't asserting that non-interfering requests didn't interfere. Release note: None

Not related to cockroachdb#32149. Before this change, we would treat a scan with limit 0 as a point lookup when updating the timestamp cache. After the change, we no longer update the timestamp cache if a scan had a limit of 0 and never looked at any keys. Release note: None

Fixes cockroachdb#9521. Supersedes cockroachdb#31904. SQL has a tendency to create scans which cover a range's entire key span, though looking only to return a finite number of results. These requests end up blocking on writes that are holding latches over keys that the scan will not actually touch. In reality, when there is a scan with a limit, the actual affected key span ends up being a small subset of the range's key span. This change creates a new "optimistic evaluation" path for read-only requests. When evaluating optimistically, the batch will sequence itself with the latch manager, but will not wait to acquire all of its latches. Instead, it begins evaluating immediately and verifies that it would not have needed to wait on any latch acquisitions after-the-fact. When performing this verification, it uses knowledge about the limited set of keys that the scan actually looked at. If there are no conflicts, the scan succeeds. If there are conflicts, the request waits for all of its latch acquisition attempts to finish and re-evaluates. This PR replaces cockroachdb#31904. The major difference between the two is that this PR exploits the structure of the latch manager to efficiently perform optimistic latch acquisition and after-the-fact verification of conflicts. Doing this requires keeping no extra state because it uses the immutable snapshots that the latch manager now captures during sequencing. The other major difference is that this PR does not release latches after a failed optimistic evaluation. NOTE: a prevalent theory of the pathological case with this behavior was that overestimated read latches would serialize with write latches, causing all requests on a range to serialize. I wasn't seeing this in practice. It turns out that the "timestamp awareness" in the latch manager should avoid this behavior in most cases because later writes will have higher timestamps than earlier reads. The effect of this is that they won't be considered to interfere by the latch manager. Still, large clusters with a high amount of clock skew could see a bounded variant of this situation. _### Benchmark Results ``` name old ops/sec new ops/sec delta kv95/cores=16/nodes=3/splits=3 51.9k ± 0% 51.7k ± 1% ~ (p=0.400 n=3+3) kvS70-L1/cores=16/nodes=3/splits=3 24.1k ± 4% 27.7k ± 1% +14.75% (p=0.100 n=3+3) kvS70-L5/cores=16/nodes=3/splits=3 24.5k ± 1% 27.5k ± 1% +12.08% (p=0.100 n=3+3) kvS70-L1000/cores=16/nodes=3/splits=3 16.0k ± 1% 16.6k ± 2% +3.79% (p=0.100 n=3+3) name old p50(ms) new p50(ms) delta kv95/cores=16/nodes=3/splits=3 0.70 ± 0% 0.70 ± 0% ~ (all equal) kvS70-L1/cores=16/nodes=3/splits=3 1.07 ± 6% 0.90 ± 0% -15.62% (p=0.100 n=3+3) kvS70-L5/cores=16/nodes=3/splits=3 1.10 ± 0% 0.90 ± 0% -18.18% (p=0.100 n=3+3) kvS70-L1000/cores=16/nodes=3/splits=3 1.80 ± 0% 1.67 ± 4% -7.41% (p=0.100 n=3+3) name old p99(ms) new p99(ms) delta kv95/cores=16/nodes=3/splits=3 1.80 ± 0% 1.80 ± 0% ~ (all equal) kvS70-L1/cores=16/nodes=3/splits=3 5.77 ±32% 4.70 ± 0% ~ (p=0.400 n=3+3) kvS70-L5/cores=16/nodes=3/splits=3 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.100 n=3+3) kvS70-L1000/cores=16/nodes=3/splits=3 6.90 ± 3% 7.33 ± 8% ~ (p=0.400 n=3+3) ``` _S<num> = --span-percent=<num>, L<num> = --span-limit=<num>_ Release note (performance improvement): improved performance on workloads which mix OLAP queries with inserts and updates.

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

49891: physicalplan: preevaluate subqueries on LocalExprs and always set LocalExprs r=yuzefovich a=yuzefovich **physicalplan: preevaluate subqueries on LocalExprs** When the plan is local, we do not serialize expressions. Previously, in such a case we would also not preevaluate the subqueries in the expressions which made `EXPLAIN (VEC)` return unexpected plan (there would `tree.Subquery` in the expression which we don't support in the vectorized, so we would wrap the plan). Now we will preevalute the subqueries before storing in the processor spec. AFAICT it affects only this explain variant and nothing else. Release note: None **colexec: improve expression parsing** This commit introduces `colexec.ExprHelper` that helps with expression processing. Previously, we were allocating a new `execinfra.ExprHelper` and were calling `Init` on it in order to get the typed expression from possibly serialized representation of each expression. Now, this new expression helper is reused between all expressions in the flow on a single node. There is one caveat, however: we need to make sure that we force deserialization of the expressions during `SupportsVectorized` check if the flow is scheduled to be run on a remote node (different from the one that is performing the check). This is necessary to make sure that the remote nodes will be able to deserialize the expressions without encountering errors (if we don't force the serialization during the check, we will use `LocalExpr` - if available - and might not catch things that we don't support). Release note: None **physicalplan: always store LocalExpr** Previously, we would set either `LocalExpr` (unserialized expression, only when we have the full plan on a single node) or `Expr` (serialized expression, when we have distributed plan as well as in some tests). However, we could be setting both and making best effort to reuse unserialized `LocalExpr` on the gateway even if the plan is distributed. And this commit adds such behavior. Fixes: #49810. Release note: None 49966: roachtest: adjust tpchvec and tpcdsvec r=yuzefovich a=yuzefovich **roachtest: add new tpchvec config** This commit adds a new `tpchvec/perf_no_stats` config that is the same as `tpchvec/perf` except for the fact that stats creation is disabled. The plans without stats are likely to be different, so it gives us an easy way to get more test coverage. One caveat here is that query 9 without stats takes insanely long to run, so some new plumbing has been added to skip that query. Additionally, `tpcdsvec` has been adjusted. The test runs all queries with and without stats present with `on` and `off` vectorize options. However, when stats are not present, `on` config will be reduced to `off` because of `vectorize_row_count_threshold` heuristic. This commit disables that heuristic. Release note: None **roachtest: switch the config order in tpchvec/perf** Let's see whether it makes difference to occasional failures of `tpchvec/perf` which are very hard to explain. This commit also changes the workload command for `perf` config to run only against node 1, thus, eliminating one possible source of "randomness" for the failures. Addresses: #49955. Release note: None 49980: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade r=nvanbenschoten a=nvanbenschoten Fixes #49658. Informs #9521. Informs #49973. Related to #49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for #49973 and avoids the 99th percentile latency regression observed in #49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in #49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like #33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run: #49658. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

nvanbenschoten · 2021-06-07T20:15:58Z

Closing in favor of #66059, which implements this change in the framework of the concurrency manager.

nvanbenschoten requested review from spencerkimball and a team December 27, 2018 07:33

tbg reviewed Dec 28, 2018

View reviewed changes

petermattis reviewed Dec 28, 2018

View reviewed changes

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

nvanbenschoten added 5 commits July 11, 2019 16:30

workload/kv: add --span-limit flag to LIMIT spanning queries

d488b79

Release note: None

storage/spanlatch: pass snapshot by reference to wait

4e37e26

This does not escape. Release note: None

storage: improve TestReplicaLatchingTimestampNonInterference

8125e87

The test was asserting that interfering requests interfered, but it wasn't asserting that non-interfering requests didn't interfere. Release note: None

nvanbenschoten force-pushed the nvanbenschoten/optimisticScan branch from 91c2060 to cc26940 Compare July 11, 2019 20:48

nvanbenschoten mentioned this pull request Oct 21, 2019

storage: separated lock-table keyspace #41720

Closed

nvanbenschoten mentioned this pull request Jun 8, 2020

kv: Scans with limit scan too much of lockTable #49973

Closed

nvanbenschoten mentioned this pull request Jun 8, 2020

kv/concurrency: drop uncontended replicated lock on unreplicated upgrade #49980

Merged

nvanbenschoten mentioned this pull request Jun 11, 2020

release-20.1: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade #50119

Merged

andreimatei mentioned this pull request Jun 12, 2020

kv: Scans with limit acquire excessively large latches #9521

Closed

nvanbenschoten mentioned this pull request Apr 26, 2021

storage: Push timestamps more aggressively #25579

Closed

nvanbenschoten closed this Jun 7, 2021

nvanbenschoten deleted the nvanbenschoten/optimisticScan branch June 11, 2021 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: evaluate limited scans optimistically without latching #33373

storage: evaluate limited scans optimistically without latching #33373

nvanbenschoten commented Dec 27, 2018

cockroach-teamcity commented Dec 27, 2018

tbg left a comment

petermattis left a comment

nvanbenschoten commented Jan 2, 2019

nvanbenschoten commented Jun 7, 2021

storage: evaluate limited scans optimistically without latching #33373

storage: evaluate limited scans optimistically without latching #33373

Conversation

nvanbenschoten commented Dec 27, 2018

Benchmark Results

cockroach-teamcity commented Dec 27, 2018

tbg left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 2, 2019

nvanbenschoten commented Jun 7, 2021