-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in covering/duplicate indexes in 20.1 #49658
Comments
What is the query which always uses the primary index? I ran the script and looked at the |
Here's an EXPLAIN ANALYZE for the query on n9: linky |
It's worth noting that the p90 latency I see is <2ms while the p99 indeed is ~20ms. The KV execution latencies are very interesting. p90 is <2ms across the board but p99 is ~30ms on a bunch of nodes (and very small on others). Looping in @nvanbenschoten who is on call for KV. |
The cluster I created is at http://radu-multi-region-georeplicas-0001.roachprod.crdb.io:26258/ |
I'm probably missing something, but how do we know that the impact on p99 latency is due to the Also, we are inserting into this table in an explicit transaction ( |
Yes, I think the database not using duplicate indexes properly (which would only be relevant for that particular query) was just a guess. The problem (as far as I understand) is that the overall p99 latency is 20ms when it was <5ms in 19.2. |
Got it. @keith-mcclellan Has anything changed in the workload? Have periodic
Have we validated this recently or is this coming from memory? |
I've been looking at traces from this cluster and nothing sticks out. The only time I see latency jump above 3ms for any query is when it hits contention and needs to wait on the completion of another transaction. Given that there is contention in the workload and a few transactions perform a series of statements and take 20-30ms in aggregate, I'm not surprised that we see p99 latency in the 10s of ms, even with partitioning. |
I just ran the demo with v19.2.4 and confirmed that I do see p99 latencies below 5ms. Interestingly, p90 and p99.9 latencies appear to be almost identical between v19.2.4 and v20.1.1, so it's just p99 that differs. |
I think what's happening is that the first and third most frequent statements are So I'm not really sure what the next step here is. I wouldn't really consider this a "bug" in the database, but it's certainly a behavioral change in v20.1 that's having an adverse effect on the metric we're interested in for this demo. We are aware of longer-term improvements in the database that may help retain most of the benefits of SFU without the downsides (#49684), but they are not on the roadmap for the next release. We could shift around the workload distribution in movr a little bit to push the contention back above the 99th percentile (e.g. halve the number of @nstewart may have opinions here. |
@nvanbenschoten Thanks for looking into this... so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases. Correct? @nstewart what do you suggest? Should we modify the demo to change the query balance or should we consider setting this DCL command ahead of running the demo? Or can we get #49684 prioritized? |
I definitely wouldn't change the clusters setting, @keith-mcclellan. To your point, we wouldn't want to do this in a real-world scenario. I will change the query balance as a stop-gap, but obviously we wouldn't have this luxury if this was a customer workload, which MovR is attempting to replicate. @johnrk @nvanbenschoten do we have any way to see from telemetry if actual customers are running into this? I'm inclined to revisit the priority of #49684 if so, but I don't know anything about the cost or actual customer benefit so I'll defer to the KV team here. @nvanbenschoten would TPC-E help us catch these types of regressions? If not, would it be helpful to have MovR as a non-micro benchmark as a canary in the coal mine? |
In my opinion, this type of change should be called out as "backward-incompatible" in our release notes: https://www.cockroachlabs.com/docs/releases/v20.1.0.html#backward-incompatible-changes. Even though it didn't "break" anything, application changes are required to maintain consistent, expected performance. @johnrk, @nvanbenschoten, if you agree, please work with @rmloveland to update those docs. |
I think calling this "backward incompatible" is a bit much. Depending on what you look at (p99 vs p99.99) it looks like either a regression or a significant improvement. Which of those figures matters more to a user - it depends. |
Yeah, "backward incompatible" may be too much, but it does seem like we should communicate changes like this, with suggested steps, if we know about them. |
Even more than terming this backwards incompatible, I find the idea of even terming this a regression to be controversial. Slight changes to latency distributions (which this is), are generally not documented changes. Furthermore, I suspect this indicates an improvement in average and maximum latency. My guess is nobody would want to actually in practice choose to implement the "workaround" that was proposed. If anything, I think this highlights that our choice to display arbitrary percentiles of a distribution is a mistake. If we showed the full distribution of latencies then the user would likely be happy with the upgrade rather than alarmed. The reason we choose to display these arbitrary percentiles is a limitation of the way we down-sample histograms for long-term storage. Ideally we'd encode and store the entire histogram (as happens when Prometheus is used to monitor a cockroach cluster) then we could use something like a heatmap or a ridgeplot to visualize the entire latency distribution. https://github.com/ryantxu/grafana-ridgeline-panel |
Ok, thanks for this perspective, Andrew. This is bubbling up from the docs
side because we couldn’t get the expected p99 in the 20.1 version of the
multi-region topology tutorial. Perhaps we should be describing latency
improvements differently in that doc.
On Fri, May 29, 2020 at 2:22 PM Nate ***@***.***> wrote:
cc @piyush-singh <https://github.com/piyush-singh>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#49658 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZIH4XOMA7FMKYDN2N4BETRT74OFANCNFSM4NNK6CDQ>
.
--
Jesse Seldess
VP of Education, Cockroach Labs
https://www.cockroachlabs.com/
|
Maybe we can update the doc to look at p90? |
Yes, that's correct. But as Nate mentioned, that's not a great solution.
Do we track full-system latency percentiles in telemetry? If not, I don't know how we would be able to figure this out. And even with that, it wouldn't tell us the whole story.
Interestingly, TPC-E does have a much more realistic read/write ratio than TPC-C, so it might pick this kind of thing up. If we're running MovR in front of customers, though, I don't think there's a substitute for monitoring its performance directly. I tend to agree with what @RaduBerinde and @ajwerner said above. "backwards incompatable" is not the right term here. "Regression" is more appropriate, but only if it's properly contextualized as a regression in a portion of the latency distribution for certain workloads. Unfortunately, when it comes to contention handling, you're rarely going to make progress that is an improvement across the board because, at the end of the day, you have multiple transactions competing for the same limited shared resource. To this point, I was just talking to @sumeerbhola about #49684. There might be a way that we can make this change in such a way that it's less invasive by coupling the "upgrade" locking strength with the "unreplicated" lock durability. This should move the needle on p99 here, but it would also make changes elsewhere. For one, it would re-introduce a large class of transaction retries that we considered important to get rid of. It would probably also hurt top-end throughput on a workload like YCSB-A. So it's not the kind of thing that's appropriate for a point release. |
I think I found a small patch inside the By the way, I meant to thank whoever put this script together (@keith-mcclellan?). These kinds of investigations are orders of magnitude easier to work through with an automated reproduction. |
Nicely done @nvanbenschoten! |
Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.
49891: physicalplan: preevaluate subqueries on LocalExprs and always set LocalExprs r=yuzefovich a=yuzefovich **physicalplan: preevaluate subqueries on LocalExprs** When the plan is local, we do not serialize expressions. Previously, in such a case we would also not preevaluate the subqueries in the expressions which made `EXPLAIN (VEC)` return unexpected plan (there would `tree.Subquery` in the expression which we don't support in the vectorized, so we would wrap the plan). Now we will preevalute the subqueries before storing in the processor spec. AFAICT it affects only this explain variant and nothing else. Release note: None **colexec: improve expression parsing** This commit introduces `colexec.ExprHelper` that helps with expression processing. Previously, we were allocating a new `execinfra.ExprHelper` and were calling `Init` on it in order to get the typed expression from possibly serialized representation of each expression. Now, this new expression helper is reused between all expressions in the flow on a single node. There is one caveat, however: we need to make sure that we force deserialization of the expressions during `SupportsVectorized` check if the flow is scheduled to be run on a remote node (different from the one that is performing the check). This is necessary to make sure that the remote nodes will be able to deserialize the expressions without encountering errors (if we don't force the serialization during the check, we will use `LocalExpr` - if available - and might not catch things that we don't support). Release note: None **physicalplan: always store LocalExpr** Previously, we would set either `LocalExpr` (unserialized expression, only when we have the full plan on a single node) or `Expr` (serialized expression, when we have distributed plan as well as in some tests). However, we could be setting both and making best effort to reuse unserialized `LocalExpr` on the gateway even if the plan is distributed. And this commit adds such behavior. Fixes: #49810. Release note: None 49966: roachtest: adjust tpchvec and tpcdsvec r=yuzefovich a=yuzefovich **roachtest: add new tpchvec config** This commit adds a new `tpchvec/perf_no_stats` config that is the same as `tpchvec/perf` except for the fact that stats creation is disabled. The plans without stats are likely to be different, so it gives us an easy way to get more test coverage. One caveat here is that query 9 without stats takes insanely long to run, so some new plumbing has been added to skip that query. Additionally, `tpcdsvec` has been adjusted. The test runs all queries with and without stats present with `on` and `off` vectorize options. However, when stats are not present, `on` config will be reduced to `off` because of `vectorize_row_count_threshold` heuristic. This commit disables that heuristic. Release note: None **roachtest: switch the config order in tpchvec/perf** Let's see whether it makes difference to occasional failures of `tpchvec/perf` which are very hard to explain. This commit also changes the workload command for `perf` config to run only against node 1, thus, eliminating one possible source of "randomness" for the failures. Addresses: #49955. Release note: None 49980: kv/concurrency: drop uncontended replicated lock on unreplicated upgrade r=nvanbenschoten a=nvanbenschoten Fixes #49658. Informs #9521. Informs #49973. Related to #49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for #49973 and avoids the 99th percentile latency regression observed in #49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in #49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like #33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run: #49658. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Let's keep this open until the backport lands. |
Fixes cockroachdb#49658. Informs cockroachdb#9521. Informs cockroachdb#49973. Related to cockroachdb#49684. This commit tweaks the `lockTable`'s handling of lock acquisition to drop write-uncontended locks when upgraded from the Unreplicated to Replicated durability in much the same way we drop Replicated locks when first acquired. This is possible because a Replicated lock is also stored as an MVCC intent, so it does not need to also be stored in the lockTable if writers are not queuing on it. This is beneficial because it serves as a mitigation for cockroachdb#49973 and avoids the 99th percentile latency regression observed in cockroachdb#49658. Since we aren't currently great at avoiding excessive contention on limited scans when locks are in the lockTable, it's better the keep locks out of the lockTable when possible. If any of the readers do truly contend with this lock even after their limit has been applied, they will notice during their MVCC scan and re-enter the queue (possibly recreating the lock through AddDiscoveredLock). Still, in practice this seems to work well in avoiding most of the artificial concurrency discussed in cockroachdb#49973. It's a bit of a hack and I am very interested in fixing this fully in the future (through an approach like cockroachdb#33373 or by incrementally consulting the lockTable in a `lockAwareIterator`), but for now, I don't see a downside to make this change. I intend to backport this change to v20.1, as it's causing issues in one of the demos we like to run. Release note (performance improvement): limited SELECT statements now do a better job avoiding unnecessary contention with UPDATE and SELECT FOR UPDATE statements.
This is now fixed on the |
Describe the problem
Tested using MovR, when applying duplicate covering indexes to the promo code table for fast reads in multiple datacenters, the database seems to only use the primary index and not the index closest to the read.
To Reproduce
Step thru MovR in a multi-region config(easier to do in the 19.2 version), create duplicate indexes and pin to multi regions. p99 on 20.1 is ~20ms, on 19.2 it was <5ms
geo-partitioned-replicas-demo.sh in https://github.com/cockroachlabs/scripted-demos
The text was updated successfully, but these errors were encountered: