-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: performance regression in Movr #56923
Comments
@nvanbenschoten can PTAL here and help route as needed? |
The first step here would be to determine which queries are slower and whether any of the query plans have changed. From the referenced issue, I can't get a sense for that. The only thing we see is that an INSERT statement got a ReadWithinUncertaintyIntervalError, which is a little strange and may indicate that the SQL optimizer is choosing a bad plan for the INSERTs initial row scan (see #56661). |
@keith-mcclellan can you show us the |
@awoods187 how can I do that? I don't actually run the queries, the movr-load process generates them dynamically. |
You can do it from the DB Console or by looking at the queries via |
do you want it before or after applying the duplicate indexes? or both? The perf regressed both before and after optimization |
These are all from after applying the duplicate indexes on a fresh roachprod cluster... performance was better than the first time I tried it, most of these are in the 20-60 ms range (generally they're all sub 10ms after adding the partitioned duplicate indexes) As an aside, ALL of the queries are running slower on 20.2, not just some of them. |
None of those statement bundles show queries that take more than 4.5ms. This might be a good test of whether we can use our observability tools to determine which queries are slower and why. |
The UI was showing much higher query latency than what you're seeing in the bundle. You can reproduce my environment using https://github.com/cockroachlabs/scripted-demos/blob/master/geo-partitioned-replicas-demo.sh |
Note: we should probably add movr performance to roachperf given its importance |
I'm not able to reproduce this. Note that I'm running without duplicate indexes (per the issue description, it doesn't seem like a requirement to reproduce the performance regression). I first load v20.1.8 and look at the SQL p99 graph: |
@asubiotto can we do a quick call? Your 20.1 results don't look right, p99 should be sub 10 ms and this is showing 500+ms (which is what I expect before the indexes and leaseholder pinning is applied) |
I just did a long running test of both 20.1.7 and 20.2.0 using Movr. The good news if you let it run long enough 20.2 does get to the expected sub-10ms performance for p99, but it takes 10 minutes or so after applying the schema changes. in 20.1.7 it only takes 3 to 5 minutes. The weird thing is, I went hunting for the long-running queries in 20.2 at the 5 minute mark and couldn't find them in the statements or transactions pages... did we change the way we're rolling up these aggregate stats in the admin UI? Perhaps the performance hasn't gotten worse and we're just presenting the data differently? |
We changed how statement fingerprinting works but I don't think that should have changed the metrics you are looking at. CC @solongordon to correct me if that isn't right |
That's correct, no changes to statement statistic aggregation I can think of. |
The problem is that we're trying to understand which queries have p99 performance while the statements page focuses on average query performance. Maybe the standard deviation could be a little help here but it probably includes the query performance before the schema change. My repro attempt did not perform any schema changes. I guess I misunderstood the repro instructions. This does seem like it might be an optimizer issue (reminiscent of bad plans before creating statistics). I see a couple next steps:
Let me know how it goes and I'm also happy to jump on a call with you tomorrow. |
OK, so I took some more time today to reproduce this and the TL;DR is that when explicit Each of the following graphs starts when the schema change finishes. First off, we have 20.1.8 for reference behavior (note that the scales between 20.1.8 and 20.2.0 are different, so 20.1.8 looks a lot smoother due to the larger scale but isn't necessarily so): 20.2.0 without explicit stats collection: Compare this with 20.2.0 with explicit stats collection after the schema change which brings this back down to around 4-5 minutes: It seems like @nvanbenschoten's hypothesis is correct and we seem to be generating bad plans because of incomplete statistics. I guess the question now is what changed in statistics collection in 20.2 and is there anything we can do about it? cc @RaduBerinde @rytaft I would expect that we could refresh stats after a schema change. |
Here's an updated repro script with the create statistics statements and option to reuse a cluster: geo-partitioned-replicas-demo.sh.txt |
Hi @asubiotto -- have you verified that auto stats is definitely not getting triggered (i.e., there is no auto stats job running on the table after the schema change)? We should be triggering auto stats collection after every schema change, so unless the schema team changed some logic there, it should be triggered. The only change we made recently is that after auto stats completes, we don't immediately invalidate the stats cache on each node. The cache update now happens asynchronously so that existing queries aren't blocked on the refresh. But the asynchronous refresh definitely should not take 10 minutes, so I doubt that's the problem. Also, we backported that change to 20.1 a while ago, so there shouldn't be any difference on 20.1.8. |
Confirmed that the backport with asynchronous cache updates was added in 20.1.5. |
It looks like this logic is new in 20.2, which could be the culprit: cockroach/pkg/sql/schema_changer.go Line 610 in 2b38158
@ajwerner or @lucy-zhang do you think this could prevent stats from getting updated after the schema change described in this issue? |
That call to cockroach/pkg/sql/schema_changer.go Line 679 in 2b38158
What are the relevant schema changes where we were relying on stats updates at the end? Is this about adding the foreign keys, or the initial table creation? I assume it's the foreign keys, but I want to make sure. Also, do you know exactly which tables have the out-of-date stats? We did make some changes to waiting for leases when updating foreign key backreferences, IIRC, but that's the only remotely related thing I can think of. |
The schema changes that happen are creating indexes on
Ran it again, and it seems like the auto stats job runs, but much later than the schema change. For example, this is
The table that is subject to the index creation,
It's also possible that a manual create statistics run has some other side effects I'm not considering. |
I'm not sure what you mean. Could you post the full contents of your Are you expecting
No, that should only get called for non-table schema changes (i.e., on databases, user-defined schemas, and enums, which don't have stats). I think to make comparisons on timing (including when the various schema change jobs actually ended), we just need to see all the jobs first. |
This is not
No, this is unrelated to |
Sorry, I read too quickly and didn't realize there were separate indexes to be added beyond the default workload initialization. I just added some logging on master and confirmed (on a local single-node cluster) that the schema changer for |
Unfortunately haven't made much progress on this but need to move on to other stuff so writing down my progress. I think I agree that statistics seem to be unrelated. I recently had a "bad" run with statistics triggered relatively quickly after the I took a step back and tried to confirm which queries seem to cause this high p99 latency and two queries seemed to be repeatedly logged:
Since this p99 metric is only service+execution time of the |
I filed #57634 to extend statement diagnostics to only collect them when exec time is above a given value. |
Added the mentioned cluster setting to trace statements above a certain threshold. This is what a
The important part here is:
Which shows ~24ms waiting in a lock wait queue. @nvanbenschoten, is that contention? It doesn't look like it's a plan issue and the workload is the same so this is curious. For reference the schema change happened at ~15:22:
And the above trace was collected ~5 minutes after this when latencies had not yet stabilized: At this stage I'd prefer to hand this off to someone in KV for further investigation since this is outside my area of expertise. @nvanbenschoten are you interested in taking it on/could you find someone to take this on? Happy to collect any more information you might need. |
Yes, that looks like contention. The workload seems to be updating the This may be impacted by #9521 / #49973, which we saw cause some issues with movr in 20.1 before a partial fix, but that wouldn't have gotten worse in 20.2. @sumeerbhola is also looking to improve that in this release. Taking a step back, where are we on this issue? Is that slow query unique to 20.2? Is your statement in #56923 (comment) about the regression being transient no longer true? |
No the workload is exactly the same between both versions. This query is also slow/contended on 20.1.8, but the mystery is why with the exact same workload, 20.2.0 takes twice the time to stabilize to these low p99 latencies. Here are the logs for both versions with a statement tracing threshold of One unscientific observation is that in 20.1.8, the
|
the 50ms before push is probably because of the Trace 1
Trace 2
|
linking issue for regression testing movr performance: #56922 |
This issue got buried, and the CRDB versions discussed here are no longer supported, so I don't think it's worthwhile to leave this issue open. #56922 should cover the work to more easily discover issues like this in the future. |
See details here https://cockroachlabs.atlassian.net/browse/CRDB-1662
Jira issue: CRDB-2891
The text was updated successfully, but these errors were encountered: