Performance regression in covering/duplicate indexes in 20.1 #49658

keith-mcclellan · 2020-05-28T19:00:26Z

Describe the problem
Tested using MovR, when applying duplicate covering indexes to the promo code table for fast reads in multiple datacenters, the database seems to only use the primary index and not the index closest to the read.

To Reproduce
Step thru MovR in a multi-region config(easier to do in the 19.2 version), create duplicate indexes and pin to multi regions. p99 on 20.1 is ~20ms, on 19.2 it was <5ms

geo-partitioned-replicas-demo.sh in https://github.com/cockroachlabs/scripted-demos

RaduBerinde · 2020-05-29T00:33:39Z

What is the query which always uses the primary index? I ran the script and looked at the SELECT FROM promo_codes query and it does look like it's using a duplicated index.

RaduBerinde · 2020-05-29T00:50:12Z

Here's an EXPLAIN ANALYZE for the query on n9: linky

RaduBerinde · 2020-05-29T01:05:55Z

It's worth noting that the p90 latency I see is <2ms while the p99 indeed is ~20ms.

The KV execution latencies are very interesting. p90 is <2ms across the board but p99 is ~30ms on a bunch of nodes (and very small on others).

Looping in @nvanbenschoten who is on call for KV.

RaduBerinde · 2020-05-29T01:38:05Z

The cluster I created is at http://radu-multi-region-georeplicas-0001.roachprod.crdb.io:26258/

nvanbenschoten · 2020-05-29T02:09:02Z

I'm probably missing something, but how do we know that the impact on p99 latency is due to the SELECT FROM promo_codes query? It looks like we're just watching top-line p99 latency.

Also, we are inserting into this table in an explicit transaction (INSERT INTO promo_codes ...), so this could be responsible. If one of these reads hits an intent, it's going to have to wait for the intent to be resolved. That would be expected.

RaduBerinde · 2020-05-29T02:13:35Z

Yes, I think the database not using duplicate indexes properly (which would only be relevant for that particular query) was just a guess. The problem (as far as I understand) is that the overall p99 latency is 20ms when it was <5ms in 19.2.

nvanbenschoten · 2020-05-29T02:24:43Z

Got it. @keith-mcclellan Has anything changed in the workload? Have periodic INSERT INTO promo_codes statements always been part of it? If so then we'd always expect some high latency in the tail.

on 19.2 it was <5ms

Have we validated this recently or is this coming from memory?

nvanbenschoten · 2020-05-29T02:54:08Z

I've been looking at traces from this cluster and nothing sticks out. The only time I see latency jump above 3ms for any query is when it hits contention and needs to wait on the completion of another transaction.

Given that there is contention in the workload and a few transactions perform a series of statements and take 20-30ms in aggregate, I'm not surprised that we see p99 latency in the 10s of ms, even with partitioning.

nvanbenschoten · 2020-05-29T03:21:00Z

I just ran the demo with v19.2.4 and confirmed that I do see p99 latencies below 5ms. Interestingly, p90 and p99.9 latencies appear to be almost identical between v19.2.4 and v20.1.1, so it's just p99 that differs.

nvanbenschoten · 2020-05-29T04:15:53Z

SET CLUSTER SETTING "sql.defaults.implicit_select_for_update.enabled" = false; appears to fix the issue in v20.1.1.

I think what's happening is that the first and third most frequent statements are SELECT ... FROM vehicles ... LIMIT ... and the fourth most frequent statement is UPDATE vehicles ... and implicit SFU changes the way these statements interact. Specifically, it causes the UPDATE statement to acquire a lock on its row earlier, which evens out the odds of it winning in situations where these statements contend. This improved fairness is obviously a trade-off, and in this case, it appears to push enough contention under the p99 threshold that we see this jump in latency. On the other hand, it doesn't appear to affect p90 or p99.9 and we see the improved fairness pull down the rest of the tail latencies significantly - p99.99 drops from 230ms to 66ms with implicit SFU.

So I'm not really sure what the next step here is. I wouldn't really consider this a "bug" in the database, but it's certainly a behavioral change in v20.1 that's having an adverse effect on the metric we're interested in for this demo. We are aware of longer-term improvements in the database that may help retain most of the benefits of SFU without the downsides (#49684), but they are not on the roadmap for the next release. We could shift around the workload distribution in movr a little bit to push the contention back above the 99th percentile (e.g. halve the number of UPDATE vehicle statements we run). Or we could disable implicit SFU in movr using the enable_implicit_select_for_update session variable until we do get around to addressing #49684.

@nstewart may have opinions here.

keith-mcclellan · 2020-05-29T12:27:57Z

@nvanbenschoten Thanks for looking into this... so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases. Correct?

@nstewart what do you suggest? Should we modify the demo to change the query balance or should we consider setting this DCL command ahead of running the demo? Or can we get #49684 prioritized?

nstewart · 2020-05-29T14:38:35Z

I definitely wouldn't change the clusters setting, @keith-mcclellan. To your point, we wouldn't want to do this in a real-world scenario. I will change the query balance as a stop-gap, but obviously we wouldn't have this luxury if this was a customer workload, which MovR is attempting to replicate.

@johnrk @nvanbenschoten do we have any way to see from telemetry if actual customers are running into this? I'm inclined to revisit the priority of #49684 if so, but I don't know anything about the cost or actual customer benefit so I'll defer to the KV team here.

@nvanbenschoten would TPC-E help us catch these types of regressions? If not, would it be helpful to have MovR as a non-micro benchmark as a canary in the coal mine?

jseldess · 2020-05-29T18:03:01Z

In my opinion, this type of change should be called out as "backward-incompatible" in our release notes: https://www.cockroachlabs.com/docs/releases/v20.1.0.html#backward-incompatible-changes. Even though it didn't "break" anything, application changes are required to maintain consistent, expected performance. @johnrk, @nvanbenschoten, if you agree, please work with @rmloveland to update those docs.

RaduBerinde · 2020-05-29T18:10:00Z

I think calling this "backward incompatible" is a bit much. Depending on what you look at (p99 vs p99.99) it looks like either a regression or a significant improvement. Which of those figures matters more to a user - it depends.

jseldess · 2020-05-29T18:13:15Z

Yeah, "backward incompatible" may be too much, but it does seem like we should communicate changes like this, with suggested steps, if we know about them.

ajwerner · 2020-05-29T18:20:02Z

Even more than terming this backwards incompatible, I find the idea of even terming this a regression to be controversial. Slight changes to latency distributions (which this is), are generally not documented changes. Furthermore, I suspect this indicates an improvement in average and maximum latency. My guess is nobody would want to actually in practice choose to implement the "workaround" that was proposed.

If anything, I think this highlights that our choice to display arbitrary percentiles of a distribution is a mistake. If we showed the full distribution of latencies then the user would likely be happy with the upgrade rather than alarmed. The reason we choose to display these arbitrary percentiles is a limitation of the way we down-sample histograms for long-term storage. Ideally we'd encode and store the entire histogram (as happens when Prometheus is used to monitor a cockroach cluster) then we could use something like a heatmap or a ridgeplot to visualize the entire latency distribution.

https://github.com/ryantxu/grafana-ridgeline-panel
https://www.circonus.com/2018/03/grafana-heatmaps-with-irondb-have-landed/

nstewart · 2020-05-29T18:22:12Z

cc @piyush-singh

jseldess · 2020-05-29T19:04:07Z

Ok, thanks for this perspective, Andrew. This is bubbling up from the docs side because we couldn’t get the expected p99 in the 20.1 version of the multi-region topology tutorial. Perhaps we should be describing latency improvements differently in that doc.

On Fri, May 29, 2020 at 2:22 PM Nate ***@***.***> wrote: cc @piyush-singh <https://github.com/piyush-singh> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#49658 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZIH4XOMA7FMKYDN2N4BETRT74OFANCNFSM4NNK6CDQ> .

-- Jesse Seldess VP of Education, Cockroach Labs https://www.cockroachlabs.com/

RaduBerinde · 2020-05-29T19:05:16Z

Maybe we can update the doc to look at p90?

nvanbenschoten · 2020-05-29T19:50:25Z

so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases.

Yes, that's correct. But as Nate mentioned, that's not a great solution.

do we have any way to see from telemetry if actual customers are running into this

Do we track full-system latency percentiles in telemetry? If not, I don't know how we would be able to figure this out. And even with that, it wouldn't tell us the whole story.

would TPC-E help us catch these types of regressions

Interestingly, TPC-E does have a much more realistic read/write ratio than TPC-C, so it might pick this kind of thing up. If we're running MovR in front of customers, though, I don't think there's a substitute for monitoring its performance directly.

I tend to agree with what @RaduBerinde and @ajwerner said above. "backwards incompatable" is not the right term here. "Regression" is more appropriate, but only if it's properly contextualized as a regression in a portion of the latency distribution for certain workloads. Unfortunately, when it comes to contention handling, you're rarely going to make progress that is an improvement across the board because, at the end of the day, you have multiple transactions competing for the same limited shared resource.

To this point, I was just talking to @sumeerbhola about #49684. There might be a way that we can make this change in such a way that it's less invasive by coupling the "upgrade" locking strength with the "unreplicated" lock durability. This should move the needle on p99 here, but it would also make changes elsewhere. For one, it would re-introduce a large class of transaction retries that we considered important to get rid of. It would probably also hurt top-end throughput on a workload like YCSB-A. So it's not the kind of thing that's appropriate for a point release.

nvanbenschoten · 2020-05-29T21:42:16Z

I think I found a small patch inside the lockTable implementation that we can make in v20.1.x to recover the p99 latency on this workload (and others like it) without needing to make much of a compromise anywhere else. It's a bit hacky but appears to be working well. The patch drives p99 latencies on the workload back down to ~3.3ms.

By the way, I meant to thank whoever put this script together (@keith-mcclellan?). These kinds of investigations are orders of magnitude easier to work through with an automated reproduction.

awoods187 · 2020-06-01T13:51:30Z

Nicely done @nvanbenschoten!

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

49891: physicalplan: preevaluate subqueries on LocalExprs and always set LocalExprs r=yuzefovich a=yuzefovich **physicalplan: preevaluate subqueries on LocalExprs** When the plan is local, we do not serialize expressions. Previously, in such a case we would also not preevaluate the subqueries in the expressions which made `EXPLAIN (VEC)` return unexpected plan (there would `tree.Subquery` in the expression which we don't support in the vectorized, so we would wrap the plan). Now we will preevalute the subqueries before storing in the processor spec. AFAICT it affects only this explain variant and nothing else. Release note: None **colexec: improve expression parsing** This commit introduces `colexec.ExprHelper` that helps with expression processing. Previously, we were allocating a new `execinfra.ExprHelper` and were calling `Init` on it in order to get the typed expression from possibly serialized representation of each expression. Now, this new expression helper is reused between all expressions in the flow on a single node. There is one caveat, however: we need to make sure that we force deserialization of the expressions during `SupportsVectorized` check if the flow is scheduled to be run on a remote node (different from the one that is performing the check). This is necessary to make sure that the remote nodes will be able to deserialize the expressions without encountering errors (if we don't force the serialization during the check, we will use `LocalExpr` - if available - and might not catch things that we don't support). Release note: None **physicalplan: always store LocalExpr** Previously, we would set either `LocalExpr` (unserialized expression, only when we have the full plan on a single node) or `Expr` (serialized expression, when we have distributed plan as well as in some tests). However, we could be setting both and making best effort to reuse unserialized `LocalExpr` on the gateway even if the plan is distributed. And this commit adds such behavior. Fixes: #49810. Release note: None 49966: roachtest: adjust tpchvec and tpcdsvec r=yuzefovich a=yuzefovich **roachtest: add new tpchvec config** This commit adds a new `tpchvec/perf_no_stats` config that is the same as `tpchvec/perf` except for the fact that stats creation is disabled. The plans without stats are likely to be different, so it gives us an easy way to get more test coverage. One caveat here is that query 9 without stats takes insanely long to run, so some new plumbing has been added to skip that query. Additionally, `tpcdsvec` has been adjusted. The test runs all queries with and without stats present with `on` and `off` vectorize options. However, when stats are not present, `on` config will be reduced to `off` because of `vectorize_row_count_threshold` heuristic. This commit disables that heuristic. Release note: None **roachtest: switch the config order in tpchvec/perf** Let's see whether it makes difference to occasional failures of `tpchvec/perf` which are very hard to explain. This commit also changes the workload command for `perf` config to run only against node 1, thus, eliminating one possible source of "randomness" for the failures. Addresses: #49955. Release note: None 49980: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade r=nvanbenschoten a=nvanbenschoten Fixes #49658. Informs #9521. Informs #49973. Related to #49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for #49973 and avoids the 99th percentile latency regression observed in #49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in #49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like #33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run: #49658. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2020-06-10T15:46:45Z

Let's keep this open until the backport lands.

Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.

nvanbenschoten · 2020-06-12T19:42:15Z

This is now fixed on the release-20.1 branch.

keith-mcclellan added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-partitioning labels May 28, 2020

awoods187 assigned RaduBerinde May 28, 2020

awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label May 28, 2020

cockroachdb deleted a comment from blathers-crl bot May 28, 2020

RaduBerinde assigned nvanbenschoten and unassigned RaduBerinde May 29, 2020

nstewart mentioned this issue May 29, 2020

ui: std deviation not visible on statements list page #49706

Closed

nvanbenschoten mentioned this issue Jun 8, 2020

kv/concurrency: drop uncontended replicated lock on unreplicated upgrade #49980

Merged

nvanbenschoten mentioned this issue Jun 10, 2020

Multiregion movr workload performance discrepancies after primary key change #50019

Closed

craig bot closed this as completed in 03b374a Jun 10, 2020

nvanbenschoten reopened this Jun 10, 2020

nvanbenschoten mentioned this issue Jun 11, 2020

release-20.1: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade #50119

Merged

nvanbenschoten closed this as completed Jun 12, 2020

nstewart mentioned this issue Nov 19, 2020

sql: regression test MovR #56922

Closed

nvanbenschoten mentioned this issue Dec 10, 2020

sql: performance regression in Movr #56923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in covering/duplicate indexes in 20.1 #49658

Performance regression in covering/duplicate indexes in 20.1 #49658

keith-mcclellan commented May 28, 2020 •

edited

Loading

RaduBerinde commented May 29, 2020

RaduBerinde commented May 29, 2020

RaduBerinde commented May 29, 2020

RaduBerinde commented May 29, 2020

nvanbenschoten commented May 29, 2020

RaduBerinde commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

keith-mcclellan commented May 29, 2020

nstewart commented May 29, 2020

jseldess commented May 29, 2020

RaduBerinde commented May 29, 2020

jseldess commented May 29, 2020

ajwerner commented May 29, 2020

nstewart commented May 29, 2020

jseldess commented May 29, 2020 via email

RaduBerinde commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

awoods187 commented Jun 1, 2020

nvanbenschoten commented Jun 10, 2020

nvanbenschoten commented Jun 12, 2020

Performance regression in covering/duplicate indexes in 20.1 #49658

Performance regression in covering/duplicate indexes in 20.1 #49658

Comments

keith-mcclellan commented May 28, 2020 • edited Loading

RaduBerinde commented May 29, 2020

RaduBerinde commented May 29, 2020

RaduBerinde commented May 29, 2020

RaduBerinde commented May 29, 2020

nvanbenschoten commented May 29, 2020

RaduBerinde commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

keith-mcclellan commented May 29, 2020

nstewart commented May 29, 2020

jseldess commented May 29, 2020

RaduBerinde commented May 29, 2020

jseldess commented May 29, 2020

ajwerner commented May 29, 2020

nstewart commented May 29, 2020

jseldess commented May 29, 2020 via email

RaduBerinde commented May 29, 2020

nvanbenschoten commented May 29, 2020

nvanbenschoten commented May 29, 2020

awoods187 commented Jun 1, 2020

nvanbenschoten commented Jun 10, 2020

nvanbenschoten commented Jun 12, 2020

keith-mcclellan commented May 28, 2020 •

edited

Loading