Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in covering/duplicate indexes in 20.1 #49658

Closed
keith-mcclellan opened this issue May 28, 2020 · 24 comments · Fixed by #49980
Closed

Performance regression in covering/duplicate indexes in 20.1 #49658

keith-mcclellan opened this issue May 28, 2020 · 24 comments · Fixed by #49980
Assignees
Labels
A-partitioning C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@keith-mcclellan
Copy link
Contributor

keith-mcclellan commented May 28, 2020

Describe the problem
Tested using MovR, when applying duplicate covering indexes to the promo code table for fast reads in multiple datacenters, the database seems to only use the primary index and not the index closest to the read.

To Reproduce
Step thru MovR in a multi-region config(easier to do in the 19.2 version), create duplicate indexes and pin to multi regions. p99 on 20.1 is ~20ms, on 19.2 it was <5ms

geo-partitioned-replicas-demo.sh in https://github.com/cockroachlabs/scripted-demos

@keith-mcclellan keith-mcclellan added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-partitioning labels May 28, 2020
@awoods187 awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label May 28, 2020
@cockroachdb cockroachdb deleted a comment from blathers-crl bot May 28, 2020
@RaduBerinde
Copy link
Member

What is the query which always uses the primary index? I ran the script and looked at the SELECT FROM promo_codes query and it does look like it's using a duplicated index.

@RaduBerinde
Copy link
Member

Here's an EXPLAIN ANALYZE for the query on n9: linky

@RaduBerinde
Copy link
Member

It's worth noting that the p90 latency I see is <2ms while the p99 indeed is ~20ms.

The KV execution latencies are very interesting. p90 is <2ms across the board but p99 is ~30ms on a bunch of nodes (and very small on others).

image
image

Looping in @nvanbenschoten who is on call for KV.

@RaduBerinde
Copy link
Member

The cluster I created is at http://radu-multi-region-georeplicas-0001.roachprod.crdb.io:26258/

@nvanbenschoten
Copy link
Member

I'm probably missing something, but how do we know that the impact on p99 latency is due to the SELECT FROM promo_codes query? It looks like we're just watching top-line p99 latency.

Also, we are inserting into this table in an explicit transaction (INSERT INTO promo_codes ...), so this could be responsible. If one of these reads hits an intent, it's going to have to wait for the intent to be resolved. That would be expected.

@RaduBerinde
Copy link
Member

Yes, I think the database not using duplicate indexes properly (which would only be relevant for that particular query) was just a guess. The problem (as far as I understand) is that the overall p99 latency is 20ms when it was <5ms in 19.2.

@nvanbenschoten
Copy link
Member

Got it. @keith-mcclellan Has anything changed in the workload? Have periodic INSERT INTO promo_codes statements always been part of it? If so then we'd always expect some high latency in the tail.

on 19.2 it was <5ms

Have we validated this recently or is this coming from memory?

@nvanbenschoten
Copy link
Member

I've been looking at traces from this cluster and nothing sticks out. The only time I see latency jump above 3ms for any query is when it hits contention and needs to wait on the completion of another transaction.

Given that there is contention in the workload and a few transactions perform a series of statements and take 20-30ms in aggregate, I'm not surprised that we see p99 latency in the 10s of ms, even with partitioning.

@nvanbenschoten
Copy link
Member

I just ran the demo with v19.2.4 and confirmed that I do see p99 latencies below 5ms. Interestingly, p90 and p99.9 latencies appear to be almost identical between v19.2.4 and v20.1.1, so it's just p99 that differs.

@nvanbenschoten
Copy link
Member

SET CLUSTER SETTING "sql.defaults.implicit_select_for_update.enabled" = false; appears to fix the issue in v20.1.1.

I think what's happening is that the first and third most frequent statements are SELECT ... FROM vehicles ... LIMIT ... and the fourth most frequent statement is UPDATE vehicles ... and implicit SFU changes the way these statements interact. Specifically, it causes the UPDATE statement to acquire a lock on its row earlier, which evens out the odds of it winning in situations where these statements contend. This improved fairness is obviously a trade-off, and in this case, it appears to push enough contention under the p99 threshold that we see this jump in latency. On the other hand, it doesn't appear to affect p90 or p99.9 and we see the improved fairness pull down the rest of the tail latencies significantly - p99.99 drops from 230ms to 66ms with implicit SFU.

So I'm not really sure what the next step here is. I wouldn't really consider this a "bug" in the database, but it's certainly a behavioral change in v20.1 that's having an adverse effect on the metric we're interested in for this demo. We are aware of longer-term improvements in the database that may help retain most of the benefits of SFU without the downsides (#49684), but they are not on the roadmap for the next release. We could shift around the workload distribution in movr a little bit to push the contention back above the 99th percentile (e.g. halve the number of UPDATE vehicle statements we run). Or we could disable implicit SFU in movr using the enable_implicit_select_for_update session variable until we do get around to addressing #49684.

@nstewart may have opinions here.

@keith-mcclellan
Copy link
Contributor Author

@nvanbenschoten Thanks for looking into this... so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases. Correct?

@nstewart what do you suggest? Should we modify the demo to change the query balance or should we consider setting this DCL command ahead of running the demo? Or can we get #49684 prioritized?

@nstewart
Copy link
Contributor

I definitely wouldn't change the clusters setting, @keith-mcclellan. To your point, we wouldn't want to do this in a real-world scenario. I will change the query balance as a stop-gap, but obviously we wouldn't have this luxury if this was a customer workload, which MovR is attempting to replicate.

@johnrk @nvanbenschoten do we have any way to see from telemetry if actual customers are running into this? I'm inclined to revisit the priority of #49684 if so, but I don't know anything about the cost or actual customer benefit so I'll defer to the KV team here.

@nvanbenschoten would TPC-E help us catch these types of regressions? If not, would it be helpful to have MovR as a non-micro benchmark as a canary in the coal mine?

@jseldess
Copy link
Contributor

In my opinion, this type of change should be called out as "backward-incompatible" in our release notes: https://www.cockroachlabs.com/docs/releases/v20.1.0.html#backward-incompatible-changes. Even though it didn't "break" anything, application changes are required to maintain consistent, expected performance. @johnrk, @nvanbenschoten, if you agree, please work with @rmloveland to update those docs.

@RaduBerinde
Copy link
Member

I think calling this "backward incompatible" is a bit much. Depending on what you look at (p99 vs p99.99) it looks like either a regression or a significant improvement. Which of those figures matters more to a user - it depends.

@jseldess
Copy link
Contributor

Yeah, "backward incompatible" may be too much, but it does seem like we should communicate changes like this, with suggested steps, if we know about them.

@ajwerner
Copy link
Contributor

Even more than terming this backwards incompatible, I find the idea of even terming this a regression to be controversial. Slight changes to latency distributions (which this is), are generally not documented changes. Furthermore, I suspect this indicates an improvement in average and maximum latency. My guess is nobody would want to actually in practice choose to implement the "workaround" that was proposed.

If anything, I think this highlights that our choice to display arbitrary percentiles of a distribution is a mistake. If we showed the full distribution of latencies then the user would likely be happy with the upgrade rather than alarmed. The reason we choose to display these arbitrary percentiles is a limitation of the way we down-sample histograms for long-term storage. Ideally we'd encode and store the entire histogram (as happens when Prometheus is used to monitor a cockroach cluster) then we could use something like a heatmap or a ridgeplot to visualize the entire latency distribution.

image
image

https://github.com/ryantxu/grafana-ridgeline-panel
https://www.circonus.com/2018/03/grafana-heatmaps-with-irondb-have-landed/

@nstewart
Copy link
Contributor

cc @piyush-singh

@jseldess
Copy link
Contributor

jseldess commented May 29, 2020 via email

@RaduBerinde
Copy link
Member

Maybe we can update the doc to look at p90?

@nvanbenschoten
Copy link
Member

so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases.

Yes, that's correct. But as Nate mentioned, that's not a great solution.

do we have any way to see from telemetry if actual customers are running into this

Do we track full-system latency percentiles in telemetry? If not, I don't know how we would be able to figure this out. And even with that, it wouldn't tell us the whole story.

would TPC-E help us catch these types of regressions

Interestingly, TPC-E does have a much more realistic read/write ratio than TPC-C, so it might pick this kind of thing up. If we're running MovR in front of customers, though, I don't think there's a substitute for monitoring its performance directly.

I tend to agree with what @RaduBerinde and @ajwerner said above. "backwards incompatable" is not the right term here. "Regression" is more appropriate, but only if it's properly contextualized as a regression in a portion of the latency distribution for certain workloads. Unfortunately, when it comes to contention handling, you're rarely going to make progress that is an improvement across the board because, at the end of the day, you have multiple transactions competing for the same limited shared resource.

To this point, I was just talking to @sumeerbhola about #49684. There might be a way that we can make this change in such a way that it's less invasive by coupling the "upgrade" locking strength with the "unreplicated" lock durability. This should move the needle on p99 here, but it would also make changes elsewhere. For one, it would re-introduce a large class of transaction retries that we considered important to get rid of. It would probably also hurt top-end throughput on a workload like YCSB-A. So it's not the kind of thing that's appropriate for a point release.

@nvanbenschoten
Copy link
Member

I think I found a small patch inside the lockTable implementation that we can make in v20.1.x to recover the p99 latency on this workload (and others like it) without needing to make much of a compromise anywhere else. It's a bit hacky but appears to be working well. The patch drives p99 latencies on the workload back down to ~3.3ms.

By the way, I meant to thank whoever put this script together (@keith-mcclellan?). These kinds of investigations are orders of magnitude easier to work through with an automated reproduction.

@awoods187
Copy link
Contributor

Nicely done @nvanbenschoten!

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 8, 2020
Fixes cockroachdb#49658.
Informs cockroachdb#9521.
Informs cockroachdb#49973.
Related to cockroachdb#49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to
drop write-uncontended locks when upgraded from the Unreplicated to
Replicated durability in much the same way we drop Replicated locks when
first acquired. This is possible because a Replicated lock is also
stored as an MVCC intent, so it does not need to also be stored in the
lockTable if writers are not queuing on it. This is beneficial because
it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile
latency regression observed in cockroachdb#49658. Since we aren't currently great
at avoiding excessive contention on limited scans when locks are in the
lockTable, it's better the keep locks out of the lockTable when
possible.

If any of the readers do truly contend with this lock even after their
limit has been applied, they will notice during their MVCC scan and
re-enter the queue (possibly recreating the lock through
AddDiscoveredLock). Still, in practice this seems to work well in
avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a
bit of a hack and I am very interested in fixing this fully in the
future (through an approach like cockroachdb#33373 or by incrementally consulting
the lockTable in a `lockAwareIterator`), but for now, I don't see a
downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one
of the demos we like to run.

Release note (performance improvement): limited SELECT statements
now do a better job avoiding unnecessary contention with UPDATE and
SELECT FOR UPDATE statements.
craig bot pushed a commit that referenced this issue Jun 10, 2020
49891: physicalplan: preevaluate subqueries on LocalExprs and always set LocalExprs r=yuzefovich a=yuzefovich

**physicalplan: preevaluate subqueries on LocalExprs**

When the plan is local, we do not serialize expressions. Previously, in
such a case we would also not preevaluate the subqueries in the
expressions which made `EXPLAIN (VEC)` return unexpected plan (there
would `tree.Subquery` in the expression which we don't support in the
vectorized, so we would wrap the plan). Now we will preevalute the
subqueries before storing in the processor spec. AFAICT it affects only
this explain variant and nothing else.

Release note: None

**colexec: improve expression parsing**

This commit introduces `colexec.ExprHelper` that helps with expression
processing. Previously, we were allocating a new `execinfra.ExprHelper`
and were calling `Init` on it in order to get the typed expression from
possibly serialized representation of each expression. Now, this new
expression helper is reused between all expressions in the flow on
a single node.

There is one caveat, however: we need to make sure that we force
deserialization of the expressions during `SupportsVectorized` check if
the flow is scheduled to be run on a remote node (different from the one
that is performing the check). This is necessary to make sure that the
remote nodes will be able to deserialize the expressions without
encountering errors (if we don't force the serialization during the
check, we will use `LocalExpr` - if available - and might not catch
things that we don't support).

Release note: None

**physicalplan: always store LocalExpr**

Previously, we would set either `LocalExpr` (unserialized expression,
only when we have the full plan on a single node) or `Expr` (serialized
expression, when we have distributed plan as well as in some tests).
However, we could be setting both and making best effort to reuse
unserialized `LocalExpr` on the gateway even if the plan is distributed.
And this commit adds such behavior.

Fixes: #49810.

Release note: None

49966: roachtest: adjust tpchvec and tpcdsvec r=yuzefovich a=yuzefovich

**roachtest: add new tpchvec config**

This commit adds a new `tpchvec/perf_no_stats` config that is the same
as `tpchvec/perf` except for the fact that stats creation is disabled.
The plans without stats are likely to be different, so it gives us an
easy way to get more test coverage. One caveat here is that query
9 without stats takes insanely long to run, so some new plumbing has
been added to skip that query.

Additionally, `tpcdsvec` has been adjusted. The test runs all queries
with and without stats present with `on` and `off` vectorize options.
However, when stats are not present, `on` config will be reduced to
`off` because of `vectorize_row_count_threshold` heuristic. This commit
disables that heuristic.

Release note: None

**roachtest: switch the config order in tpchvec/perf**

Let's see whether it makes difference to occasional failures of
`tpchvec/perf` which are very hard to explain.

This commit also changes the workload command for `perf` config to run
only against node 1, thus, eliminating one possible source of
"randomness" for the failures.

Addresses: #49955.

Release note: None

49980: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade r=nvanbenschoten a=nvanbenschoten

Fixes #49658.
Informs #9521.
Informs #49973.
Related to #49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for #49973 and avoids the 99th percentile latency regression observed in #49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible.

If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in #49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like #33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run: #49658.

Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 03b374a Jun 10, 2020
@nvanbenschoten
Copy link
Member

Let's keep this open until the backport lands.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 11, 2020
Fixes cockroachdb#49658.
Informs cockroachdb#9521.
Informs cockroachdb#49973.
Related to cockroachdb#49684.

This commit tweaks the `lockTable`'s handling of lock acquisition to
drop write-uncontended locks when upgraded from the Unreplicated to
Replicated durability in much the same way we drop Replicated locks when
first acquired. This is possible because a Replicated lock is also
stored as an MVCC intent, so it does not need to also be stored in the
lockTable if writers are not queuing on it. This is beneficial because
it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile
latency regression observed in cockroachdb#49658. Since we aren't currently great
at avoiding excessive contention on limited scans when locks are in the
lockTable, it's better the keep locks out of the lockTable when
possible.

If any of the readers do truly contend with this lock even after their
limit has been applied, they will notice during their MVCC scan and
re-enter the queue (possibly recreating the lock through
AddDiscoveredLock). Still, in practice this seems to work well in
avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a
bit of a hack and I am very interested in fixing this fully in the
future (through an approach like cockroachdb#33373 or by incrementally consulting
the lockTable in a `lockAwareIterator`), but for now, I don't see a
downside to make this change.

I intend to backport this change to v20.1, as it's causing issues in one
of the demos we like to run.

Release note (performance improvement): limited SELECT statements
now do a better job avoiding unnecessary contention with UPDATE and
SELECT FOR UPDATE statements.
@nvanbenschoten
Copy link
Member

This is now fixed on the release-20.1 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-partitioning C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants