sql: performance regression in Movr #56923

awoods187 · 2020-11-19T19:20:49Z

See details here https://cockroachlabs.atlassian.net/browse/CRDB-1662

awoods187 · 2020-11-19T19:21:15Z

@nvanbenschoten can PTAL here and help route as needed?

nvanbenschoten · 2020-11-19T20:01:19Z

The first step here would be to determine which queries are slower and whether any of the query plans have changed. From the referenced issue, I can't get a sense for that. The only thing we see is that an INSERT statement got a ReadWithinUncertaintyIntervalError, which is a little strange and may indicate that the SQL optimizer is choosing a bad plan for the INSERTs initial row scan (see #56661).

awoods187 · 2020-11-19T20:36:00Z

@keith-mcclellan can you show us the EXPLAIN and EXPLAIN ANALYZE(debug) for the slow queries?

keith-mcclellan · 2020-11-19T21:17:27Z

@awoods187 how can I do that? I don't actually run the queries, the movr-load process generates them dynamically.

awoods187 · 2020-11-19T21:30:10Z

You can do it from the DB Console or by looking at the queries via SHOW or logs.

keith-mcclellan · 2020-11-19T21:50:09Z

do you want it before or after applying the duplicate indexes? or both? The perf regressed both before and after optimization

keith-mcclellan · 2020-11-19T22:04:07Z

These are all from after applying the duplicate indexes on a fresh roachprod cluster... performance was better than the first time I tried it, most of these are in the 20-60 ms range (generally they're all sub 10ms after adding the partitioned duplicate indexes)
stmt-bundle-608675081702768642.zip
stmt-bundle-608675041611087878.zip
stmt-bundle-608674970782269446.zip
stmt-bundle-608674836224376838.zip
stmt-bundle-608674548608794629.zip

As an aside, ALL of the queries are running slower on 20.2, not just some of them.

nvanbenschoten · 2020-11-19T23:28:18Z

None of those statement bundles show queries that take more than 4.5ms. This might be a good test of whether we can use our observability tools to determine which queries are slower and why.

keith-mcclellan · 2020-11-20T15:20:51Z

The UI was showing much higher query latency than what you're seeing in the bundle. You can reproduce my environment using https://github.com/cockroachlabs/scripted-demos/blob/master/geo-partitioned-replicas-demo.sh

asubiotto · 2020-11-23T17:04:37Z

Note: we should probably add movr performance to roachperf given its importance

asubiotto · 2020-11-24T15:07:25Z

I'm not able to reproduce this. Note that I'm running without duplicate indexes (per the issue description, it doesn't seem like a requirement to reproduce the performance regression). I first load v20.1.8 and look at the SQL p99 graph:

I then stop the nodes, stage 20.20, and restart the nodes:

The p99 performance seems to improve. @keith-mcclellan what am I missing?

keith-mcclellan · 2020-11-30T16:27:10Z

@asubiotto can we do a quick call? Your 20.1 results don't look right, p99 should be sub 10 ms and this is showing 500+ms (which is what I expect before the indexes and leaseholder pinning is applied)

keith-mcclellan · 2020-11-30T17:41:47Z

I just did a long running test of both 20.1.7 and 20.2.0 using Movr. The good news if you let it run long enough 20.2 does get to the expected sub-10ms performance for p99, but it takes 10 minutes or so after applying the schema changes. in 20.1.7 it only takes 3 to 5 minutes.

20.1.7 -

20.2.0 -

The weird thing is, I went hunting for the long-running queries in 20.2 at the 5 minute mark and couldn't find them in the statements or transactions pages... did we change the way we're rolling up these aggregate stats in the admin UI? Perhaps the performance hasn't gotten worse and we're just presenting the data differently?

awoods187 · 2020-11-30T17:50:38Z

We changed how statement fingerprinting works but I don't think that should have changed the metrics you are looking at. CC @solongordon to correct me if that isn't right

solongordon · 2020-11-30T18:03:41Z

That's correct, no changes to statement statistic aggregation I can think of.

asubiotto · 2020-12-01T17:24:41Z

The weird thing is, I went hunting for the long-running queries in 20.2 at the 5 minute mark and couldn't find them in the statements or transactions pages... did we change the way we're rolling up these aggregate stats in the admin UI? Perhaps the performance hasn't gotten worse and we're just presenting the data differently?

The problem is that we're trying to understand which queries have p99 performance while the statements page focuses on average query performance. Maybe the standard deviation could be a little help here but it probably includes the query performance before the schema change.

My repro attempt did not perform any schema changes. I guess I misunderstood the repro instructions.

This does seem like it might be an optimizer issue (reminiscent of bad plans before creating statistics). I see a couple next steps:

Add an explicit CREATE STATISTICS statement on the tables the workload touches in your script and see if anything changes. If nothing changes, that would rule out a stats-related issue.
Check the KV Transaction Restart graph in the Distributed metrics tab. These retries might also be shown in the statements/transactions pages. What are the numbers there?
Set sql.log.slow_query.latency_threshold to a desired threshold to log queries that take over a certain amount of time as well as sql.log.slow_query.experimental_full_table_scans.enabled to see if there are any bad plans generated (the latter does not work if you do not set the former). This will probably affect general workload performance but if we set the threshold to a minimum p99 threshold we expect to achieve (say 20ms), we should be able to observe which queries are exceeding that threshold.

Let me know how it goes and I'm also happy to jump on a call with you tomorrow.

asubiotto · 2020-12-02T15:39:13Z

OK, so I took some more time today to reproduce this and the TL;DR is that when explicit CREATE STATISTICS statements are issued, 20.2.0 behaves like 20.1.8. To be clear (because I was confused), the issue is not that we experience 600ms latencies, the issue is the amount of time it takes to achieve sub-10ms latencies in movr after performance-optimizing schema changes are completed.

Each of the following graphs starts when the schema change finishes. First off, we have 20.1.8 for reference behavior (note that the scales between 20.1.8 and 20.2.0 are different, so 20.1.8 looks a lot smoother due to the larger scale but isn't necessarily so):

Which takes around 4 minutes to achieve sub-10ms p99 performance

20.2.0 without explicit stats collection:

Which takes around 10 minutes to achieve sub-10ms p99 performance.

Compare this with 20.2.0 with explicit stats collection after the schema change which brings this back down to around 4-5 minutes:

It seems like @nvanbenschoten's hypothesis is correct and we seem to be generating bad plans because of incomplete statistics. I guess the question now is what changed in statistics collection in 20.2 and is there anything we can do about it? cc @RaduBerinde @rytaft I would expect that we could refresh stats after a schema change.

asubiotto · 2020-12-02T15:45:42Z

Here's an updated repro script with the create statistics statements and option to reuse a cluster: geo-partitioned-replicas-demo.sh.txt

rytaft · 2020-12-02T17:20:10Z

Hi @asubiotto -- have you verified that auto stats is definitely not getting triggered (i.e., there is no auto stats job running on the table after the schema change)? We should be triggering auto stats collection after every schema change, so unless the schema team changed some logic there, it should be triggered.

The only change we made recently is that after auto stats completes, we don't immediately invalidate the stats cache on each node. The cache update now happens asynchronously so that existing queries aren't blocked on the refresh. But the asynchronous refresh definitely should not take 10 minutes, so I doubt that's the problem. Also, we backported that change to 20.1 a while ago, so there shouldn't be any difference on 20.1.8.

rytaft · 2020-12-02T17:25:42Z

Confirmed that the backport with asynchronous cache updates was added in 20.1.5.

rytaft · 2020-12-02T17:40:26Z

It looks like this logic is new in 20.2, which could be the culprit:

cockroach/pkg/sql/schema_changer.go

Line 610 in 2b38158

if err := waitToUpdateLeases(false /* refreshStats */); err != nil {

@ajwerner or @lucy-zhang do you think this could prevent stats from getting updated after the schema change described in this issue?

thoszhang · 2020-12-02T18:52:16Z

That call to waitToUpdateLeases is only for non-table descriptors. We should still refresh stats in the usual place, at the end of the schema change:

cockroach/pkg/sql/schema_changer.go

Line 679 in 2b38158

    
           if err := waitToUpdateLeases(err == nil /* refreshStats */); err != nil && !errors.Is(err, catalog.ErrDescriptorNotFound) {

What are the relevant schema changes where we were relying on stats updates at the end? Is this about adding the foreign keys, or the initial table creation? I assume it's the foreign keys, but I want to make sure. Also, do you know exactly which tables have the out-of-date stats?

We did make some changes to waiting for leases when updating foreign key backreferences, IIRC, but that's the only remotely related thing I can think of.

asubiotto · 2020-12-03T15:53:43Z

I assume it's the foreign keys, but I want to make sure. Also, do you know exactly which tables have the out-of-date stats?

The schema changes that happen are creating indexes on promo_codes (the other operations that happen are partitions and zone config changes), so I assume it's stats on this table that are stale (will double check). I assume we will call waitToUpdateLeases(false /* refreshStats */) in this case, right?

have you verified that auto stats is definitely not getting triggered (i.e., there is no auto stats job running on the table after the schema change)? We should be triggering auto stats collection after every schema change, so unless the schema team changed some logic there, it should be triggered.

Ran it again, and it seems like the auto stats job runs, but much later than the schema change. For example, this is SHOW AUTOMATIC STATS after a schema change that finished at 10:18:

  612501019249442818 | AUTO CREATE STATS | Table statistics refresh for movr.public.promo_codes                | CREATE STATISTICS __auto__ FROM [55] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:21:17.065855+00:00 | 2020-12-03 10:21:17.847+00:00    | 2020-12-03 10:21:25.769157+00:00 | 2020-12-03 10:21:25.429564+00:00 |                  1 |       |           NULL
  612500920620154881 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicle_location_histories | CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:46.966593+00:00 | 2020-12-03 10:20:47.721721+00:00 | 2020-12-03 10:20:57.243853+00:00 | 2020-12-03 10:20:56.916104+00:00 |                  1 |       |           NULL
  612500887426859009 | AUTO CREATE STATS | Table statistics refresh for movr.public.users                      | CREATE STATISTICS __auto__ FROM [53] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:36.8368+00:00   | 2020-12-03 10:20:37.591598+00:00 | 2020-12-03 10:20:46.080833+00:00 | 2020-12-03 10:20:45.753184+00:00 |                  1 |       |           NULL
  612500844115394561 | AUTO CREATE STATS | Table statistics refresh for movr.public.rides                      | CREATE STATISTICS __auto__ FROM [58] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:23.619199+00:00 | 2020-12-03 10:20:24.373564+00:00 | 2020-12-03 10:20:35.885412+00:00 | 2020-12-03 10:20:35.557796+00:00 |                  1 |       |           NULL
  612500803940646913 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:11.358836+00:00 | 2020-12-03 10:20:12.112409+00:00 | 2020-12-03 10:20:22.667373+00:00 | 2020-12-03 10:20:22.338694+00:00 |                  1 |       |           NULL
  612500682893590534 | AUTO CREATE STATS | Table statistics refresh for movr.public.user_promo_codes           | CREATE STATISTICS __auto__ FROM [57] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:19:34.41819+00:00  | 2020-12-03 10:19:35.345203+00:00 | 2020-12-03 10:19:46.4199+00:00   | 2020-12-03 10:19:45.991022+00:00 |                  1 |       |           NULL
  612500628609204228 | AUTO CREATE STATS | Table statistics refresh for movr.public.rides                      | CREATE STATISTICS __auto__ FROM [58] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:19:17.851918+00:00 | 2020-12-03 10:19:18.783895+00:00 | 2020-12-03 10:19:33.730644+00:00 | 2020-12-03 10:19:33.29866+00:00  |                  1 |       |           NULL
  612500585027960836 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicle_location_histories | CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:19:04.55198+00:00  | 2020-12-03 10:19:05.485633+00:00 | 2020-12-03 10:19:16.766945+00:00 | 2020-12-03 10:19:16.334848+00:00 |                  1 |       |           NULL
  612500466265653256 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:18:28.308603+00:00 | 2020-12-03 10:18:29.666409+00:00 | 2020-12-03 10:18:51.943715+00:00 | 2020-12-03 10:18:51.282628+00:00 |                  1 |       |           NULL
  612500393172566018 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:18:06.002366+00:00 | 2020-12-03 10:18:06.492608+00:00 | 2020-12-03 10:18:13.544681+00:00 | 2020-12-03 10:18:13.349177+00:00 |                  1 |       |           NULL
  612500370044518401 | AUTO CREATE STATS | Table statistics refresh for movr.public.user_promo_codes           | CREATE STATISTICS __auto__ FROM [57] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:17:58.944239+00:00 | 2020-12-03 10:17:59.46807+00:00  | 2020-12-03 10:18:05.390906+00:00 | 2020-12-03 10:18:05.196298+00:00 |                  1 |       |           NULL

The table that is subject to the index creation, promo_codes doesn't get a stats run until around ~3m30 after the schema change completed. In 20.1.8, a schema change that completes at 15:28 kicks off a run around 1 min later:

  612561616761847811 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicle_location_histories | CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:29.959051+00:00 | 2020-12-03 15:29:30.02769+00:00  | 2020-12-03 15:29:31.704878+00:00 | 2020-12-03 15:29:31.701072+00:00 |                  1 |       |              3
  612561601749712898 | AUTO CREATE STATS | Table statistics refresh for movr.public.promo_codes                | CREATE STATISTICS __auto__ FROM [55] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:25.377701+00:00 | 2020-12-03 15:29:25.447617+00:00 | 2020-12-03 15:29:26.177337+00:00 | 2020-12-03 15:29:26.172998+00:00 |                  1 |       |              2
  612561567975604225 | AUTO CREATE STATS | Table statistics refresh for movr.public.promo_codes                | CREATE STATISTICS __auto__ FROM [55] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:15.070679+00:00 | 2020-12-03 15:29:15.137618+00:00 | 2020-12-03 15:29:15.862587+00:00 | 2020-12-03 15:29:15.861355+00:00 |                  1 |       |              1
  612561525407088646 | AUTO CREATE STATS | Table statistics refresh for movr.public.users                      | CREATE STATISTICS __auto__ FROM [53] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:02.079787+00:00 | 2020-12-03 15:29:02.463981+00:00 | 2020-12-03 15:29:06.771068+00:00 | 2020-12-03 15:29:06.611338+00:00 |                  1 |       |              6
  612561500135129094 | AUTO CREATE STATS | Table statistics refresh for movr.public.rides                      | CREATE STATISTICS __auto__ FROM [58] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:28:54.367398+00:00 | 2020-12-03 15:28:54.751511+00:00 | 2020-12-03 15:29:01.429136+00:00 | 2020-12-03 15:29:01.259049+00:00 |                  1 |       |              6
  612561478425542662 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:28:47.742169+00:00 | 2020-12-03 15:28:48.134475+00:00 | 2020-12-03 15:28:53.722634+00:00 | 2020-12-03 15:28:53.562343+00:00 |                  1 |       |              6
  612561461178564614 | AUTO CREATE STATS | Table statistics refresh for movr.public.user_promo_codes           | CREATE STATISTICS __auto__ FROM [57] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:28:42.478805+00:00 | 2020-12-03 15:28:42.863286+00:00 | 2020-12-03 15:28:47.096559+00:00 | 2020-12-03 15:28:46.936521+00:00 |                  1 |       |              6

It's also possible that a manual create statistics run has some other side effects I'm not considering.

thoszhang · 2020-12-03T16:31:12Z

The schema changes that happen are creating indexes on promo_code (the other operations that happen are partitions and zone config changes)

I'm not sure what you mean. Could you post the full contents of your crdb_internal.jobs on 20.2 and 20.1? I just started cockroach demo with the default movr database on both 20.2.0 and 20.1.8, and in both cases I only see schema change jobs to add foreign keys. All the secondary indexes for the movr tables are specified as part of CREATE TABLE, as far as I can tell. Also, did you mean user_promo_codes? promo_codes doesn't have any secondary indexes or constraints.

Are you expecting CREATE TABLE to itself trigger an auto stats job?

I assume we will call waitToUpdateLeases(false /* refreshStats */) in this case, right?

No, that should only get called for non-table schema changes (i.e., on databases, user-defined schemas, and enums, which don't have stats).

I think to make comparisons on timing (including when the various schema change jobs actually ended), we just need to see all the jobs first.

asubiotto · 2020-12-03T17:02:31Z

This is not cockroach demo, it's a demo run with this script to demonstrate geo partitioned replicas (geo-partitioned-replicas-demo.sh.txt).

Are you expecting CREATE TABLE to itself trigger an auto stats job?

No, this is unrelated to CREATE TABLE. The observation is that when running the demo in the above script, after the schema changes, p99 latency in 20.1.8 goes to sub-10ms in half the time as 20.2.0. This difference seems to disappear with explicit statistics creation. The hypothesis is that for some reason the auto stats job that should be triggered as part of the CREATE INDEX statements in the script above are not.

thoszhang · 2020-12-03T17:31:30Z

Sorry, I read too quickly and didn't realize there were separate indexes to be added beyond the default workload initialization.

I just added some logging on master and confirmed (on a local single-node cluster) that the schema changer for CREATE INDEX on a table does call Refresher.NotifyMutation for the table with rowsAffected set to MaxInt32, as expected, and that this does eventually manifest in an auto stats job. We didn't make any changes to this area in 20.2, so I can't imagine why it would be different in a multi-region setup. So at this point I don't think this is a schema changer problem.

asubiotto · 2020-12-07T14:01:53Z

Unfortunately haven't made much progress on this but need to move on to other stuff so writing down my progress. I think I agree that statistics seem to be unrelated. I recently had a "bad" run with statistics triggered relatively quickly after the promo_codes schema change.

I took a step back and tried to confirm which queries seem to cause this high p99 latency and two queries seemed to be repeatedly logged: INSERT INTO promo_codes and SELECT FROM vehicles LIMIT 25. Since INSERT INTO promo_codes seems to be expected to take a long time (duplicate indexes favor read latency), it looks like our culprit is SELECT FROM vehicles LIMIT 25. Unfortunately I haven't been able to get a trace for one of these high p99 queries since the only cluster setting we have to enable tracing above a certain threshold seems to be sql.trace.txn.enable_threshold which traces the whole txn and takes into account communication with the client, which in this case seems to be non-negligible:

‹     0.690ms      0.016ms        event:sql/conn_executor.go:1428 [n1,client=10.142.0.23:54036,hostnossl,user=root] [Open pos:873] executing Sync›
‹     8.243ms      7.554ms        === operation:exec stmt client:10.142.0.23:54036 hostnossl: node:1 user:root›
‹     8.304ms      0.060ms        event:sql/conn_executor.go:1428 [n1,client=10.142.0.23:54036,hostnossl,user=root] [Open pos:874] executing ExecStmt: SELECT vehicles.id AS vehicles_id, vehicles.city AS vehicles_city, vehicles.type AS vehicles_type, vehicles.owner_id AS vehicles_owner_id, vehicles.creation_time AS vehicles_creation_time, vehicles.status AS vehicles_status, vehicles.current_location AS vehicles_current_location, vehicles.ext AS vehicles_ext FROM vehicles WHERE vehicles.city = 'new york' LIMIT 25›

Since this p99 metric is only service+execution time of the exec cmd, the next step will be to add a cluster setting to allow for tracing according to a single command's latency. Hopefully this should give us a slow trace.

RaduBerinde · 2020-12-07T14:20:28Z

I filed #57634 to extend statement diagnostics to only collect them when exec time is above a given value.

asubiotto · 2020-12-08T17:52:46Z

Added the mentioned cluster setting to trace statements above a certain threshold. This is what a SELECT FROM vehicles LIMIT 25 (which we established was the "surprising" latency) trace looks like:

I201208 15:27:23.912222 234937 sql/conn_executor_exec.go:660 ⋮ [n1,client=‹10.142.0.50:38456›,hostnossl,user=root] SQL stmt ‹SELECT vehicles.id AS vehicles_id, vehicles.city AS vehicles_city, vehicles.type AS vehicles_type, vehicles.owner_id AS vehicles_owner_id, vehicles.creation_time AS vehicles_creation_time, vehicles.status AS vehicles_status, vehicles.current_location AS vehicles_current_location, vehicles.ext AS vehicles_ext FROM vehicles WHERE vehicles.city = 'new york' LIMIT 25› took 26.384037ms, exceeding statement threshold of 10ms:
‹     0.000ms      0.000ms    === operation:exec-threshold client:10.142.0.50:38456 hostnossl: node:1 sb:1 unfinished: user:root›
‹     0.056ms      0.056ms    event:sql/catalog/descs/collection.go:295 [n1,client=10.142.0.50:38456,hostnossl,user=root] added descriptor 'movr' to collection: name:"movr" id:52 modification_time:<wall_time:1607440075481870607 > version:1 privileges:<users:<user:"admin" privileges:2 > users:<user:"root" privileges:2 > owner:"root" version:1 > state:PUBLIC offline_reason:"" ›
‹     0.072ms      0.017ms    event:sql/catalog/descs/collection.go:943 [n1,client=10.142.0.50:38456,hostnossl,user=root] found descriptor 52 in cache›
‹     0.088ms      0.016ms    event:sql/catalog/descs/collection.go:271 [n1,client=10.142.0.50:38456,hostnossl,user=root] found descriptor in collection for 'movr'›
‹     0.360ms      0.272ms    event:sql/catalog/descs/collection.go:295 [n1,client=10.142.0.50:38456,hostnossl,user=root] added descriptor 'vehicles' to collection: name:"vehicles" id:56 version:33 modification_time:<wall_time:1607440832481058971 logical:1 > parent_id:52 unexposed_parent_schema_id:29 columns:<name:"id" id:1 type:<InternalType:<family:UuidFamily width:0 precision:0 locale:"" visible_type:0 oid:2950 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:false hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"city" id:2 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:false hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"type" id:3 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"owner_id" id:4 type:<InternalType:<family:UuidFamily width:0 precision:0 locale:"" visible_type:0 oid:2950 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"creation_time" id:5 type:<InternalType:<family:TimestampFamily width:0 precision:0 locale:"" visible_type:0 oid:1114 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"status" id:6 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"current_location" id:7 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"ext" id:8 type:<InternalType:<family:JsonFamily width:0 precision:0 locale:"" visible_type:0 oid:3802 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > next_column_id:9 families:<name:"primary" id:0 column_names:"id" column_names:"city" column_names:"type" column_names:"owner_id" column_names:"creation_time" column_names:"status" column_names:"current_location" column_names:"ext" column_ids:1 column_ids:2 column_ids:3 column_ids:4 column_ids:5 column_ids:6 column_ids:7 column_ids:8 default_column_id:0 > next_family_id:1 primary_index:<name:"primary" id:3 unique:true version:1 column_names:"city" column_names:"id" column_directions:ASC column_directions:ASC column_ids:2 column_ids:1 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:1 list:<name:"new_york" values:"\006\010new york" subpartitioning:<num_columns:0 > > list:<name:"chicago" values:"\006\007chicago" subpartitioning:<num_columns:0 > > list:<name:"seattle" values:"\006\007seattle" subpartitioning:<num_columns:0 > > > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" > indexes:<name:"vehicles_city_owner_id_idx" id:6 unique:false version:1 column_names:"city" column_names:"owner_id" column_directions:ASC column_directions:ASC column_ids:2 column_ids:4 extra_column_ids:1 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:1 list:<name:"new_york" values:"\006\010new york" subpartitioning:<num_columns:0 > > list:<name:"chicago" values:"\006\007chicago" subpartitioning:<num_columns:0 > > list:<name:"seattle" values:"\006\007seattle" subpartitioning:<num_columns:0 > > > type:FORWARD created_explicitly:true encoding_type:0 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" > next_index_id:7 privileges:<users:<user:"admin" privileges:2 > users:<user:"root" privileges:2 > owner:"root" version:1 > next_mutation_id:9 format_version:3 state:PUBLIC offline_reason:"" view_query:"" is_materialized_view:false drop_time:0 replacement_of:<id:0 time:<> > audit_mode:DISABLED drop_job_id:0 gc_mutations:<index_id:1 drop_time:0 job_id:0 > gc_mutations:<index_id:2 drop_time:0 job_id:0 > gc_mutations:<index_id:5 drop_time:0 job_id:0 > gc_mutations:<index_id:4 drop_time:0 job_id:0 > create_query:"" create_as_of_time:<wall_time:1607440123025966313 logical:1 > outbound_fks:<origin_table_id:56 origin_column_ids:2 origin_column_ids:4 referenced_column_ids:2 referenced_column_ids:1 referenced_table_id:53 name:"fk_owner_id_ref_users_mr" validity:Validated on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > inbound_fks:<origin_table_id:58 origin_column_ids:2 origin_column_ids:4 referenced_column_ids:2 referenced_column_ids:1 referenced_table_id:56 name:"fk_vehicle_id_ref_vehicles_mr" validity:Validating on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > temporary:false ›
‹     0.381ms      0.020ms    event:sql/plan_opt.go:343 [n1,client=10.142.0.50:38456,hostnossl,user=root] query cache hit›
‹     0.428ms      0.047ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] planning ends›
‹     0.446ms      0.018ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] checking distributability›
‹     0.458ms      0.013ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] will distribute plan: false›
‹     0.481ms      0.022ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] execution starts: distributed engine›
‹     0.492ms      0.011ms    event:sql/distsql_running.go:982 [n1,client=10.142.0.50:38456,hostnossl,user=root] creating DistSQL plan with isLocal=true›
‹     0.513ms      0.021ms    event:sql/distsql_running.go:382 [n1,client=10.142.0.50:38456,hostnossl,user=root] running DistSQL plan›
‹     0.516ms      0.003ms        === operation:flow client:10.142.0.50:38456 hostnossl: node:1 user:root›
‹     0.546ms      0.031ms        event:sql/flowinfra/flow.go:307 [n1,client=10.142.0.50:38456,hostnossl,user=root] starting (0 processors, 0 startables)›
‹     0.547ms      0.001ms        === operation:table reader client:10.142.0.50:38456 cockroach.flowid:00000000-0000-0000-0000-000000000000 cockroach.processorid:0 hostnossl: node:1 user:root›
‹     0.558ms      0.011ms        event:sql/rowexec/tablereader.go:188 [n1,client=10.142.0.50:38456,hostnossl,user=root] starting scan with limitBatches true›
‹     0.604ms      0.046ms        event:sql/row/kv_batch_fetcher.go:334 [n1,client=10.142.0.50:38456,hostnossl,user=root] Scan /Table/56/3/"new york"{-/PrefixEnd}›
‹     0.608ms      0.004ms            === operation:txn coordinator send client:10.142.0.50:38456 hostnossl: node:1 txnID:3b4266cd-41c0-4c84-bf2c-6de5b466db7f user:root›
‹     0.616ms      0.008ms                === operation:dist sender send client:10.142.0.50:38456 hostnossl: node:1 txn:3b4266cd user:root›
‹     0.636ms      0.020ms                event:kv/kvclient/kvcoord/range_iter.go:159 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] querying next range at /Table/56/3/"new york"›
‹     0.685ms      0.048ms                event:kv/kvclient/kvcoord/dist_sender.go:1846 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] r80: sending batch 1 Scan to (n2,s2):4›
‹     0.698ms      0.013ms                event:rpc/nodedialer/nodedialer.go:132 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] sending request to 10.142.0.38:26257›
‹     0.707ms      0.009ms                    === operation:/cockroach.roachpb.Internal/Batch component:gRPC node:1 sb:1 span.kind:client txnID:3b4266cd-41c0-4c84-bf2c-6de5b466db7f›
‹     0.857ms      0.150ms                    === operation:/cockroach.roachpb.Internal/Batch component:gRPC node:2 sb:1 span.kind:server txnid:3b4266cd-41c0-4c84-bf2c-6de5b466db7f unfinished:›
‹     0.884ms      0.027ms                    event:server/node.go:879 [n2] node received request: 1 Scan›
‹     0.960ms      0.076ms                    event:kv/kvserver/store_send.go:161 [n2,s2] executing [txn: 3b4266cd], [can-forward-ts], Scan [/Table/56/3/"new york",/Table/56/3/"new york"/PrefixEnd)›
‹     0.982ms      0.023ms                    event:kv/kvserver/replica_send.go:94 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] read-only path›
‹     1.005ms      0.023ms                    event:kv/kvserver/concurrency/concurrency_manager.go:116 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] sequencing request›
‹     1.016ms      0.012ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›
‹     1.183ms      0.166ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] scanning lock table for conflicting locks›
‹     1.200ms      0.017ms                    event:kv/kvserver/replica_read.go:124 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] executing read-only batch›
‹     1.428ms      0.229ms                    event:kv/kvserver/concurrency/concurrency_manager.go:120 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] re-sequencing request›
‹     1.445ms      0.017ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›
‹     1.462ms      0.017ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] scanning lock table for conflicting locks›
‹     1.478ms      0.016ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] waiting in lock wait-queues›
‹    25.355ms     23.877ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›
‹    25.370ms      0.015ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] scanning lock table for conflicting locks›
‹    25.389ms      0.019ms                    event:kv/kvserver/replica_read.go:124 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] executing read-only batch›
‹    25.590ms      0.201ms                    event:kv/kvserver/replica_read.go:109 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] read completed›
‹    26.047ms     25.439ms            event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:591 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] recording span to refresh: /Table/56/3/"new york"{-/"\x19=\tae\xc9Bݛ\xeab\xd6\n\xb1\xdaw"/0}›
‹    26.304ms     25.791ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] execution ends›
‹    26.335ms      0.031ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] rows affected: 25›

The important part here is:

‹     1.478ms      0.016ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] waiting in lock wait-queues›
‹    25.355ms     23.877ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›

Which shows ~24ms waiting in a lock wait queue. @nvanbenschoten, is that contention? It doesn't look like it's a plan issue and the workload is the same so this is curious. For reference the schema change happened at ~15:22:

 613975727105638402 | SCHEMA CHANGE    | CREATE INDEX promo_codes_idx_west ON movr.public.promo_codes (code) STORING (description, creation_time, expiration_time, rules)                                                                                                                                                                                                                                                                                                        |           | root      | succeeded | NULL           | 2020-12-08 15:22:02.18674+00:00  | 2020-12-08 15:22:02.45555+00:00  | 2020-12-08 15:22:07.485331+00:00 | 2020-12-08 15:22:07.480031+00:00 |                  1 |       |           NULL
  613975706373095425 | SCHEMA CHANGE    | CREATE INDEX promo_codes_idx_east ON movr.public.promo_codes (code) STORING (description, creation_time, expiration_time, rules)                                                                                                                                                                                                                                                                                                        |           | root      | succeeded | NULL           | 2020-12-08 15:21:55.859468+00:00 | 2020-12-08 15:21:56.229605+00:00 | 2020-12-08 15:22:01.159752+00:00 | 2020-12-08 15:22:01.155939+00:00 |                  1 |       |           NULL

And the above trace was collected ~5 minutes after this when latencies had not yet stabilized:

At this stage I'd prefer to hand this off to someone in KV for further investigation since this is outside my area of expertise. @nvanbenschoten are you interested in taking it on/could you find someone to take this on? Happy to collect any more information you might need.

nvanbenschoten · 2020-12-10T03:51:10Z

Yes, that looks like contention. The workload seems to be updating the vehicles table periodically, so it's not clear that this is unexpected. If a transaction updates the vehicles table and then another scans it and conflicts, the scanning transaction is going to have to wait.

This may be impacted by #9521 / #49973, which we saw cause some issues with movr in 20.1 before a partial fix, but that wouldn't have gotten worse in 20.2. @sumeerbhola is also looking to improve that in this release.

Taking a step back, where are we on this issue? Is that slow query unique to 20.2? Is your statement in #56923 (comment) about the regression being transient no longer true?

asubiotto · 2020-12-10T12:55:31Z

Taking a step back, where are we on this issue? Is that slow query unique to 20.2? Is your statement in #56923 (comment) about the regression being transient no longer true?

No the workload is exactly the same between both versions. This query is also slow/contended on 20.1.8, but the mystery is why with the exact same workload, 20.2.0 takes twice the time to stabilize to these low p99 latencies.

Here are the logs for both versions with a statement tracing threshold of 15ms in 20.1.8 and 20ms in 20.2.0: https://drive.google.com/drive/folders/12N2k47TZsQaXR6amy7jUivFnfX9jVe5d?usp=sharing

One unscientific observation is that in 20.1.8, the SELECT vehicle transaction seems to perform way more pushes (note that the logs for 20.2.0 is for a longer period than 20.1.8):

$ ag "pushing timestamp" asubiotto-mrgeoreplicas-v20.1.8.logs/*.logs/*.log | ag "chicago|seattle|new york" | wc -l
    2183
$ ag "pushing timestamp" asubiotto-mrgeoreplicas-v20.2.0.logs/*.logs/*.log | ag "chicago|seattle|new york" | wc -l
      55

sumeerbhola · 2020-12-10T15:26:59Z

the 50ms before push is probably because of the LockTableLivenessPushDelay setting.
I am not sure I am interpreting the following trace correctly, but it looks like the push is resolving the intents, so is it that the conflicting transaction did not resolve? And the second trace below has a "pushee txn record not found", which is odd (both from asubiotto-mrgeoreplicas-v20.2.0.logs/5.logs/cockroach.log)

Trace 1

‹    -3.927ms      0.011ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] waiting in lock wait-queues›
‹    46.219ms     50.146ms                    event:kv/kvserver/concurrency/lock_table_waiter.go:451 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing timestamp of txn 101f7f2c above 1607441273.781834899,0›
‹    46.236ms      0.017ms                    event:kv/kvserver/intentresolver/intent_resolver.go:375 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing 1 transaction(s)›
‹    46.243ms      0.007ms                        === operation:dist sender send node:4 range:82/5:/Table/56/3/"chicago"{-/P…} store:4›
‹    46.267ms      0.024ms                        event:kv/kvclient/kvcoord/range_iter.go:159 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying next range at /Table/56/3/"chicago"/"\x05\x05\xb1\xc2>\x1f@R\x8adU34N\xf0^"›
‹    46.324ms      0.057ms                        event:kv/kvclient/kvcoord/dist_sender.go:1846 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] r82: sending batch 1 PushTxn to (n4,s4):5›
‹    46.336ms      0.013ms                        event:rpc/nodedialer/nodedialer.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sending request to local client›
‹    46.340ms      0.004ms                            === operation:/cockroach.roachpb.Internal/Batch node:4›
‹    46.356ms      0.016ms                            event:server/node.go:879 [n4] node received request: 1 PushTxn›
‹    46.461ms      0.105ms                            event:kv/kvserver/store_send.go:161 [n4,s4] executing PushTxn(e3a463a6->101f7f2c) [/Table/56/3/"chicago"/"\x05\x05\xb1\xc2>\x1f@R\x8adU34N\xf0^",/Min)›
‹    46.474ms      0.013ms                            event:kv/kvserver/replica_send.go:98 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read-write path›
‹    46.494ms      0.020ms                            event:kv/kvserver/concurrency/concurrency_manager.go:116 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sequencing request›
‹    46.503ms      0.009ms                            event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    50.984ms      4.481ms                            event:kv/kvserver/replica_write.go:110 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] applied timestamp cache›
‹    51.001ms      0.017ms                            event:kv/kvserver/replica_write.go:355 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-write batch›
‹    51.075ms      0.074ms                            event:kv/kvserver/replica_raft.go:74 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] evaluated request›
‹    51.120ms      4.884ms                    event:kv/kvserver/intentresolver/intent_resolver.go:405 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] 101f7f2c-e7a2-48f7-a899-06f85031d50a is now COMMITTED›
‹    51.136ms      0.016ms                    event:kv/kvserver/intentresolver/intent_resolver.go:833 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] resolving intents›
‹    57.840ms      6.703ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    57.882ms      0.042ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] scanning lock table for conflicting locks›
‹    57.903ms      0.021ms                    event:kv/kvserver/replica_read.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-only batch›
‹    58.089ms      0.186ms                    event:kv/kvserver/replica_read.go:109 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read completed›

Trace 2

‹    -2.103ms      0.044ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] waiting in lock wait-queues›
‹    48.169ms     50.272ms                    event:kv/kvserver/concurrency/lock_table_waiter.go:451 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing timestamp of txn 3f2f13e8 above 1607441306.597794051,7›
‹    48.192ms      0.023ms                    event:kv/kvserver/intentresolver/intent_resolver.go:375 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing 1 transaction(s)›
‹    48.205ms      0.013ms                        === operation:dist sender send node:4 range:82/5:/Table/56/3/"chicago"{-/P…} store:4›
‹    48.264ms      0.059ms                        event:kv/kvclient/kvcoord/range_iter.go:159 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying next range at /Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e"›
‹    48.356ms      0.092ms                        event:kv/kvclient/kvcoord/dist_sender.go:1846 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] r82: sending batch 1 PushTxn to (n4,s4):5›
‹    48.371ms      0.014ms                        event:rpc/nodedialer/nodedialer.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sending request to local client›
‹    48.377ms      0.006ms                            === operation:/cockroach.roachpb.Internal/Batch node:4›
‹    48.389ms      0.012ms                            event:server/node.go:879 [n4] node received request: 1 PushTxn›
‹    48.517ms      0.128ms                            event:kv/kvserver/store_send.go:161 [n4,s4] executing PushTxn(76277d81->3f2f13e8) [/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e",/Min)›
‹    48.537ms      0.020ms                            event:kv/kvserver/replica_send.go:98 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read-write path›
‹    48.565ms      0.028ms                            event:kv/kvserver/concurrency/concurrency_manager.go:116 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sequencing request›
‹    48.575ms      0.010ms                            event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    48.599ms      0.024ms                            event:kv/kvserver/replica_write.go:110 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] applied timestamp cache›
‹    48.611ms      0.012ms                            event:kv/kvserver/replica_write.go:355 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-write batch›
‹    48.680ms      0.070ms                            event:kv/kvserver/batcheval/cmd_push_txn.go:145 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushee txn record not found›
‹    48.731ms      0.051ms                            event:kv/kvserver/batcheval/cmd_push_txn.go:267 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] failed to push meta={id=3f2f13e8 key=/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e" pri=0.02011331 epo=0 ts=1607441306.582972980,0 min=1607441306.582972980,0 seq=1} lock=true stat=PENDING rts=0,0 wto=false max=0,0›
‹    48.765ms      0.033ms                            event:kv/kvserver/replica_raft.go:74 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] evaluated request›
‹    48.781ms      0.016ms                            event:kv/kvserver/concurrency/concurrency_manager.go:120 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] re-sequencing request›
‹    48.796ms      0.015ms                            event:kv/kvserver/txnwait/queue.go:455 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] 76277d81 pushing 3f2f13e8 (1 pending)›
‹    48.947ms      0.152ms                            event:kv/kvserver/txnwait/queue.go:557 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying pushee›
‹    48.955ms      0.008ms                                === operation:dist sender send node:4 range:82/5:/Table/56/3/"chicago"{-/P…} store:4›
‹    48.980ms      0.025ms                                event:kv/kvclient/kvcoord/range_iter.go:159 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying next range at /Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e"›
‹    49.066ms      0.086ms                                event:kv/kvclient/kvcoord/dist_sender.go:1846 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] r82: sending batch 1 QueryTxn to (n4,s4):5›
‹    49.077ms      0.012ms                                event:rpc/nodedialer/nodedialer.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sending request to local client›
‹    49.082ms      0.005ms                                    === operation:/cockroach.roachpb.Internal/Batch node:4›
‹    49.093ms      0.010ms                                    event:server/node.go:879 [n4] node received request: 1 QueryTxn›
‹    49.114ms      0.021ms                                    event:kv/kvserver/store_send.go:161 [n4,s4] executing QueryTxn [/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e",/Min)›
‹    49.127ms      0.014ms                                    event:kv/kvserver/replica_send.go:94 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read-only path›
‹    49.145ms      0.017ms                                    event:kv/kvserver/concurrency/concurrency_manager.go:116 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sequencing request›
‹    49.154ms      0.009ms                                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    49.170ms      0.017ms                                    event:kv/kvserver/replica_read.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-only batch›
‹    49.222ms      0.052ms                                    event:kv/kvserver/replica_read.go:109 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read completed›
‹    86.415ms     37.468ms                            event:kv/kvserver/txnwait/queue.go:538 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] result of pending push: meta={id=3f2f13e8 key=/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e" pri=0.02011331 epo=0 ts=1607441306.582972980,0 min=1607441306.582972980,0 seq=5} lock=true stat=COMMITTED rts=1607441306.582972980,0 wto=false max=1607441306.584668532,0 int=3›
‹    86.516ms      0.101ms                            event:kv/kvserver/txnwait/queue.go:549 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] push request is satisfied›
‹    86.567ms     38.374ms                    event:kv/kvserver/intentresolver/intent_resolver.go:405 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] 3f2f13e8-bfc6-4dea-9c6c-8b07891785cb is now COMMITTED›
‹    86.584ms      0.018ms                    event:kv/kvserver/intentresolver/intent_resolver.go:833 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] resolving intents›
‹    92.014ms      5.430ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›

rafiss · 2021-02-16T20:39:16Z

linking issue for regression testing movr performance: #56922

rytaft · 2022-08-01T16:16:43Z

This issue got buried, and the CRDB versions discussed here are no longer supported, so I don't think it's worthwhile to leave this issue open. #56922 should cover the work to more easily discover issues like this in the future.

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Nov 19, 2020

asubiotto self-assigned this Nov 23, 2020

asubiotto changed the title ~~sql: performance regression in Movr~~ sql: performance regression in Movr due to auto stats not triggered Dec 2, 2020

asubiotto changed the title ~~sql: performance regression in Movr due to auto stats not triggered~~ sql: performance regression in Movr Dec 7, 2020

asubiotto mentioned this issue Dec 9, 2020

sql: add sql.trace.stmt.enable_threshold #57733

Merged

asubiotto removed their assignment Dec 9, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 20, 2021

mwang1026 removed the T-kv KV Team label Aug 20, 2021

RaduBerinde assigned rytaft Aug 24, 2021

rytaft closed this as completed Aug 1, 2022

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: performance regression in Movr #56923

sql: performance regression in Movr #56923

awoods187 commented Nov 19, 2020 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Nov 19, 2020

nvanbenschoten commented Nov 19, 2020

awoods187 commented Nov 19, 2020

keith-mcclellan commented Nov 19, 2020

awoods187 commented Nov 19, 2020

keith-mcclellan commented Nov 19, 2020

keith-mcclellan commented Nov 19, 2020 •

edited

Loading

nvanbenschoten commented Nov 19, 2020

keith-mcclellan commented Nov 20, 2020

asubiotto commented Nov 23, 2020

asubiotto commented Nov 24, 2020

keith-mcclellan commented Nov 30, 2020

keith-mcclellan commented Nov 30, 2020 •

edited

Loading

awoods187 commented Nov 30, 2020

solongordon commented Nov 30, 2020

asubiotto commented Dec 1, 2020 •

edited

Loading

asubiotto commented Dec 2, 2020 •

edited

Loading

asubiotto commented Dec 2, 2020

rytaft commented Dec 2, 2020

rytaft commented Dec 2, 2020

rytaft commented Dec 2, 2020

thoszhang commented Dec 2, 2020

asubiotto commented Dec 3, 2020

thoszhang commented Dec 3, 2020

asubiotto commented Dec 3, 2020

thoszhang commented Dec 3, 2020

asubiotto commented Dec 7, 2020

RaduBerinde commented Dec 7, 2020

asubiotto commented Dec 8, 2020 •

edited

Loading

nvanbenschoten commented Dec 10, 2020

asubiotto commented Dec 10, 2020

sumeerbhola commented Dec 10, 2020

rafiss commented Feb 16, 2021

rytaft commented Aug 1, 2022

sql: performance regression in Movr #56923

sql: performance regression in Movr #56923

Comments

awoods187 commented Nov 19, 2020 • edited by cockroach-jira-scripts Loading

awoods187 commented Nov 19, 2020

nvanbenschoten commented Nov 19, 2020

awoods187 commented Nov 19, 2020

keith-mcclellan commented Nov 19, 2020

awoods187 commented Nov 19, 2020

keith-mcclellan commented Nov 19, 2020

keith-mcclellan commented Nov 19, 2020 • edited Loading

nvanbenschoten commented Nov 19, 2020

keith-mcclellan commented Nov 20, 2020

asubiotto commented Nov 23, 2020

asubiotto commented Nov 24, 2020

keith-mcclellan commented Nov 30, 2020

keith-mcclellan commented Nov 30, 2020 • edited Loading

awoods187 commented Nov 30, 2020

solongordon commented Nov 30, 2020

asubiotto commented Dec 1, 2020 • edited Loading

asubiotto commented Dec 2, 2020 • edited Loading

asubiotto commented Dec 2, 2020

rytaft commented Dec 2, 2020

rytaft commented Dec 2, 2020

rytaft commented Dec 2, 2020

thoszhang commented Dec 2, 2020

asubiotto commented Dec 3, 2020

thoszhang commented Dec 3, 2020

asubiotto commented Dec 3, 2020

thoszhang commented Dec 3, 2020

asubiotto commented Dec 7, 2020

RaduBerinde commented Dec 7, 2020

asubiotto commented Dec 8, 2020 • edited Loading

nvanbenschoten commented Dec 10, 2020

asubiotto commented Dec 10, 2020

sumeerbhola commented Dec 10, 2020

rafiss commented Feb 16, 2021

rytaft commented Aug 1, 2022

awoods187 commented Nov 19, 2020 •

edited by cockroach-jira-scripts

Loading

keith-mcclellan commented Nov 19, 2020 •

edited

Loading

keith-mcclellan commented Nov 30, 2020 •

edited

Loading

asubiotto commented Dec 1, 2020 •

edited

Loading

asubiotto commented Dec 2, 2020 •

edited

Loading

asubiotto commented Dec 8, 2020 •

edited

Loading