Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: evaluate limited scans optimistically without latching #33373

Conversation

nvanbenschoten
Copy link
Member

Fixes #9521.
Supersedes #31904.

SQL has a tendency to create scans which cover a range's entire key span, though looking only to return a finite number of results. These requests end up blocking on writes that are holding latches over keys that the scan will not actually touch. In reality, when there is a scan with a limit, the actual affected key span ends up being a small subset of the range's key span.

This change creates a new "optimistic evaluation" path for read-only requests. When evaluating optimistically, the batch will sequence itself with the latch manager, but will not wait to acquire all of its latches. Instead, it begins evaluating immediately and verifies that it would not have needed to wait on any latch acquisitions after-the-fact. When performing this verification, it uses knowledge about the limited set of keys that the scan actually looked at. If there are no conflicts, the scan succeeds. If there are conflicts, the request waits for all of its latch acquisition attempts to finish and re-evaluates.

This PR replaces #31904. The major difference between the two is that this PR exploits the structure of the latch manager to efficiently perform optimistic latch acquisition and after-the-fact verification of conflicts. Doing this requires keeping no extra state because it uses the immutable snapshots that the latch manager now captures during sequencing. The other major difference is that this PR does not release latches after a failed optimistic evaluation.

NOTE: a prevalent theory about the pathological case addressed in this PR was that overestimated read latches would serialize with write latches and write latches would serialize with overestimated read latches, causing all requests on a range to serialize. I wasn't seeing this "chaining" occur in practice. It turns out that the "timestamp awareness" in the latch manager should avoid this behavior in most cases because later writes will have higher timestamps than earlier reads, so they won't wait on them to execute. An argument could be made that because the hypothesized serialization of requests doesn't actually exist, the added complexity of this change isn't justified. I'm curious what reviewers think.

Benchmark Results

name                                   old ops/sec  new ops/sec  delta
kv95/cores=16/nodes=3/splits=3          51.9k ± 0%   51.7k ± 1%     ~     (p=0.400 n=3+3)
kvS70-L1/cores=16/nodes=3/splits=3      24.1k ± 4%   27.7k ± 1%  +14.75%  (p=0.100 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3      24.5k ± 1%   27.5k ± 1%  +12.08%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3   16.0k ± 1%   16.6k ± 2%   +3.79%  (p=0.100 n=3+3)

name                                   old p50(ms)  new p50(ms)  delta
kv95/cores=16/nodes=3/splits=3           0.70 ± 0%    0.70 ± 0%     ~     (all equal)
kvS70-L1/cores=16/nodes=3/splits=3       1.07 ± 6%    0.90 ± 0%  -15.62%  (p=0.100 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3       1.10 ± 0%    0.90 ± 0%  -18.18%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3    1.80 ± 0%    1.67 ± 4%   -7.41%  (p=0.100 n=3+3)

name                                   old p99(ms)  new p99(ms)  delta
kv95/cores=16/nodes=3/splits=3           1.80 ± 0%    1.80 ± 0%     ~     (all equal)
kvS70-L1/cores=16/nodes=3/splits=3       5.77 ±32%    4.70 ± 0%     ~     (p=0.400 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3       5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3    6.90 ± 3%    7.33 ± 8%     ~     (p=0.400 n=3+3)

S<num> = --span-percent=<num>, L<num> = --span-limit=<num>

Release note (performance improvement): improved performance on workloads which mix OLAP queries with inserts and updates.

@nvanbenschoten nvanbenschoten requested review from spencerkimball and a team December 27, 2018 07:33
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to review the main commit more closely, but on the face of it the complexity seems manageable. Whether this should be merged or not depends on the impact it would have on real-world workloads. Intuitively the change seems to make a lot of sense to me, though I don't know all the situations in which suitable scans are emitted by the SQL layer. My expectation is that we'll see them during DistSQL scans (with a relatively large limit), though in that case the scans would eventually scan all of the range and so would likely conflict with any writes each individual scan would "avoid". OLAP mixed with updates seemed to have been the original motivator in #31904, though no data was provided to back this up.

Does this change move the needle on TPCC at all? What about TPCH (or are we in no position to even consider that?)

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 2284 at r4 (raw file):

				resp := br.Responses[i].GetInner().(*roachpb.ScanResponse)
				if resp.ResumeSpan != nil {
					if start.Equal(resp.ResumeSpan.Key) {

Annoying that there's no "better" way to catch this. Can you add a unit test, though?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change move the needle on TPCC at all? What about TPCH (or are we in no position to even consider that?)

TPCH is read-only, so this change shouldn't affect it. What it could affect is running TPCC and then periodically issuing an OLAP-style query? I believe a workload like that is what motivated this change.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@nvanbenschoten
Copy link
Member Author

My expectation is that we'll see them during DistSQL scans (with a relatively large limit), though in that case the scans would eventually scan all of the range and so would likely conflict with any writes each individual scan would "avoid".

Yeah, this is what inspired the "if the limit is smaller than the number of live keys in the range" check.

I ran a tpccbench run with this change and one without it and ended up getting the exact same warehouse count (1535). I'll run a few more and then move on to other workloads like Peter's TPC-C + periodic TPC-H suggestion.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
This does not escape.

Release note: None
The test was asserting that interfering requests interfered, but it
wasn't asserting that non-interfering requests didn't interfere.

Release note: None
Not related to cockroachdb#32149.

Before this change, we would treat a scan with limit 0 as a point
lookup when updating the timestamp cache. After the change, we no
longer update the timestamp cache if a scan had a limit of 0 and
never looked at any keys.

Release note: None
Fixes cockroachdb#9521.
Supersedes cockroachdb#31904.

SQL has a tendency to create scans which cover a range's entire key
span, though looking only to return a finite number of results. These
requests end up blocking on writes that are holding latches over keys
that the scan will not actually touch. In reality, when there is a
scan with a limit, the actual affected key span ends up being a small
subset of the range's key span.

This change creates a new "optimistic evaluation" path for read-only
requests. When evaluating optimistically, the batch will sequence itself
with the latch manager, but will not wait to acquire all of its latches.
Instead, it begins evaluating immediately and verifies that it would not
have needed to wait on any latch acquisitions after-the-fact. When
performing this verification, it uses knowledge about the limited set
of keys that the scan actually looked at. If there are no conflicts, the
scan succeeds. If there are conflicts, the request waits for all of its
latch acquisition attempts to finish and re-evaluates.

This PR replaces cockroachdb#31904. The major difference between the two is that
this PR exploits the structure of the latch manager to efficiently
perform optimistic latch acquisition and after-the-fact verification
of conflicts. Doing this requires keeping no extra state because it
uses the immutable snapshots that the latch manager now captures during
sequencing. The other major difference is that this PR does not release
latches after a failed optimistic evaluation.

NOTE: a prevalent theory of the pathological case with this behavior
was that overestimated read latches would serialize with write latches,
causing all requests on a range to serialize. I wasn't seeing this
in practice. It turns out that the "timestamp awareness" in the
latch manager should avoid this behavior in most cases because later
writes will have higher timestamps than earlier reads. The effect of
this is that they won't be considered to interfere by the latch manager.
Still, large clusters with a high amount of clock skew could see a
bounded variant of this situation.

_### Benchmark Results

```
name                                   old ops/sec  new ops/sec  delta
kv95/cores=16/nodes=3/splits=3          51.9k ± 0%   51.7k ± 1%     ~     (p=0.400 n=3+3)
kvS70-L1/cores=16/nodes=3/splits=3      24.1k ± 4%   27.7k ± 1%  +14.75%  (p=0.100 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3      24.5k ± 1%   27.5k ± 1%  +12.08%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3   16.0k ± 1%   16.6k ± 2%   +3.79%  (p=0.100 n=3+3)

name                                   old p50(ms)  new p50(ms)  delta
kv95/cores=16/nodes=3/splits=3           0.70 ± 0%    0.70 ± 0%     ~     (all equal)
kvS70-L1/cores=16/nodes=3/splits=3       1.07 ± 6%    0.90 ± 0%  -15.62%  (p=0.100 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3       1.10 ± 0%    0.90 ± 0%  -18.18%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3    1.80 ± 0%    1.67 ± 4%   -7.41%  (p=0.100 n=3+3)

name                                   old p99(ms)  new p99(ms)  delta
kv95/cores=16/nodes=3/splits=3           1.80 ± 0%    1.80 ± 0%     ~     (all equal)
kvS70-L1/cores=16/nodes=3/splits=3       5.77 ±32%    4.70 ± 0%     ~     (p=0.400 n=3+3)
kvS70-L5/cores=16/nodes=3/splits=3       5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.100 n=3+3)
kvS70-L1000/cores=16/nodes=3/splits=3    6.90 ± 3%    7.33 ± 8%     ~     (p=0.400 n=3+3)
```

_S<num> = --span-percent=<num>, L<num> = --span-limit=<num>_

Release note (performance improvement): improved performance on workloads
which mix OLAP queries with inserts and updates.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/optimisticScan branch from 91c2060 to cc26940 Compare July 11, 2019 20:48
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 8, 2020
Fixes cockroachdb#49658.
Informs cockroachdb#9521.
Informs cockroachdb#49973.
Related to cockroachdb#49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to
drop write-uncontended locks when upgraded from the Unreplicated to
Replicated durability in much the same way we drop Replicated locks when
first acquired. This is possible because a Replicated lock is also
stored as an MVCC intent, so it does not need to also be stored in the
lockTable if writers are not queuing on it. This is beneficial because
it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile
latency regression observed in cockroachdb#49658. Since we aren't currently great
at avoiding excessive contention on limited scans when locks are in the
lockTable, it's better the keep locks out of the lockTable when
possible.

If any of the readers do truly contend with this lock even after their
limit has been applied, they will notice during their MVCC scan and
re-enter the queue (possibly recreating the lock through
AddDiscoveredLock). Still, in practice this seems to work well in
avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a
bit of a hack and I am very interested in fixing this fully in the
future (through an approach like cockroachdb#33373 or by incrementally consulting
the lockTable in a `lockAwareIterator`), but for now, I don't see a
downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one
of the demos we like to run.

Release note (performance improvement): limited SELECT statements
now do a better job avoiding unnecessary contention with UPDATE and
SELECT FOR UPDATE statements.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 10, 2020
Fixes cockroachdb#49658.
Informs cockroachdb#9521.
Informs cockroachdb#49973.
Related to cockroachdb#49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to
drop write-uncontended locks when upgraded from the Unreplicated to
Replicated durability in much the same way we drop Replicated locks when
first acquired. This is possible because a Replicated lock is also
stored as an MVCC intent, so it does not need to also be stored in the
lockTable if writers are not queuing on it. This is beneficial because
it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile
latency regression observed in cockroachdb#49658. Since we aren't currently great
at avoiding excessive contention on limited scans when locks are in the
lockTable, it's better the keep locks out of the lockTable when
possible.

If any of the readers do truly contend with this lock even after their
limit has been applied, they will notice during their MVCC scan and
re-enter the queue (possibly recreating the lock through
AddDiscoveredLock). Still, in practice this seems to work well in
avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a
bit of a hack and I am very interested in fixing this fully in the
future (through an approach like cockroachdb#33373 or by incrementally consulting
the lockTable in a `lockAwareIterator`), but for now, I don't see a
downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one
of the demos we like to run.

Release note (performance improvement): limited SELECT statements
now do a better job avoiding unnecessary contention with UPDATE and
SELECT FOR UPDATE statements.
craig bot pushed a commit that referenced this pull request Jun 10, 2020
49891: physicalplan: preevaluate subqueries on LocalExprs and always set LocalExprs r=yuzefovich a=yuzefovich

**physicalplan: preevaluate subqueries on LocalExprs**

When the plan is local, we do not serialize expressions. Previously, in
such a case we would also not preevaluate the subqueries in the
expressions which made `EXPLAIN (VEC)` return unexpected plan (there
would `tree.Subquery` in the expression which we don't support in the
vectorized, so we would wrap the plan). Now we will preevalute the
subqueries before storing in the processor spec. AFAICT it affects only
this explain variant and nothing else.

Release note: None

**colexec: improve expression parsing**

This commit introduces `colexec.ExprHelper` that helps with expression
processing. Previously, we were allocating a new `execinfra.ExprHelper`
and were calling `Init` on it in order to get the typed expression from
possibly serialized representation of each expression. Now, this new
expression helper is reused between all expressions in the flow on
a single node.

There is one caveat, however: we need to make sure that we force
deserialization of the expressions during `SupportsVectorized` check if
the flow is scheduled to be run on a remote node (different from the one
that is performing the check). This is necessary to make sure that the
remote nodes will be able to deserialize the expressions without
encountering errors (if we don't force the serialization during the
check, we will use `LocalExpr` - if available - and might not catch
things that we don't support).

Release note: None

**physicalplan: always store LocalExpr**

Previously, we would set either `LocalExpr` (unserialized expression,
only when we have the full plan on a single node) or `Expr` (serialized
expression, when we have distributed plan as well as in some tests).
However, we could be setting both and making best effort to reuse
unserialized `LocalExpr` on the gateway even if the plan is distributed.
And this commit adds such behavior.

Fixes: #49810.

Release note: None

49966: roachtest: adjust tpchvec and tpcdsvec r=yuzefovich a=yuzefovich

**roachtest: add new tpchvec config**

This commit adds a new `tpchvec/perf_no_stats` config that is the same
as `tpchvec/perf` except for the fact that stats creation is disabled.
The plans without stats are likely to be different, so it gives us an
easy way to get more test coverage. One caveat here is that query
9 without stats takes insanely long to run, so some new plumbing has
been added to skip that query.

Additionally, `tpcdsvec` has been adjusted. The test runs all queries
with and without stats present with `on` and `off` vectorize options.
However, when stats are not present, `on` config will be reduced to
`off` because of `vectorize_row_count_threshold` heuristic. This commit
disables that heuristic.

Release note: None

**roachtest: switch the config order in tpchvec/perf**

Let's see whether it makes difference to occasional failures of
`tpchvec/perf` which are very hard to explain.

This commit also changes the workload command for `perf` config to run
only against node 1, thus, eliminating one possible source of
"randomness" for the failures.

Addresses: #49955.

Release note: None

49980: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade r=nvanbenschoten a=nvanbenschoten

Fixes #49658.
Informs #9521.
Informs #49973.
Related to #49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for #49973 and avoids the 99th percentile latency regression observed in #49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible.

If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in #49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like #33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run: #49658.

Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 11, 2020
Fixes cockroachdb#49658.
Informs cockroachdb#9521.
Informs cockroachdb#49973.
Related to cockroachdb#49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to
drop write-uncontended locks when upgraded from the Unreplicated to
Replicated durability in much the same way we drop Replicated locks when
first acquired. This is possible because a Replicated lock is also
stored as an MVCC intent, so it does not need to also be stored in the
lockTable if writers are not queuing on it. This is beneficial because
it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile
latency regression observed in cockroachdb#49658. Since we aren't currently great
at avoiding excessive contention on limited scans when locks are in the
lockTable, it's better the keep locks out of the lockTable when
possible.

If any of the readers do truly contend with this lock even after their
limit has been applied, they will notice during their MVCC scan and
re-enter the queue (possibly recreating the lock through
AddDiscoveredLock). Still, in practice this seems to work well in
avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a
bit of a hack and I am very interested in fixing this fully in the
future (through an approach like cockroachdb#33373 or by incrementally consulting
the lockTable in a `lockAwareIterator`), but for now, I don't see a
downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one
of the demos we like to run.

Release note (performance improvement): limited SELECT statements
now do a better job avoiding unnecessary contention with UPDATE and
SELECT FOR UPDATE statements.
@nvanbenschoten
Copy link
Member Author

Closing in favor of #66059, which implements this change in the framework of the concurrency manager.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/optimisticScan branch June 11, 2021 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: Scans with limit acquire excessively large latches
4 participants