Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: performance regression in Movr #56923

Closed
awoods187 opened this issue Nov 19, 2020 · 34 comments
Closed

sql: performance regression in Movr #56923

awoods187 opened this issue Nov 19, 2020 · 34 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. T-sql-queries SQL Queries Team

Comments

@awoods187
Copy link
Contributor

awoods187 commented Nov 19, 2020

See details here https://cockroachlabs.atlassian.net/browse/CRDB-1662

Jira issue: CRDB-2891

@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Nov 19, 2020
@awoods187
Copy link
Contributor Author

@nvanbenschoten can PTAL here and help route as needed?

@nvanbenschoten
Copy link
Member

The first step here would be to determine which queries are slower and whether any of the query plans have changed. From the referenced issue, I can't get a sense for that. The only thing we see is that an INSERT statement got a ReadWithinUncertaintyIntervalError, which is a little strange and may indicate that the SQL optimizer is choosing a bad plan for the INSERTs initial row scan (see #56661).

@awoods187
Copy link
Contributor Author

@keith-mcclellan can you show us the EXPLAIN and EXPLAIN ANALYZE(debug) for the slow queries?

@keith-mcclellan
Copy link
Contributor

@awoods187 how can I do that? I don't actually run the queries, the movr-load process generates them dynamically.

@awoods187
Copy link
Contributor Author

You can do it from the DB Console or by looking at the queries via SHOW or logs.

@keith-mcclellan
Copy link
Contributor

do you want it before or after applying the duplicate indexes? or both? The perf regressed both before and after optimization

@keith-mcclellan
Copy link
Contributor

keith-mcclellan commented Nov 19, 2020

These are all from after applying the duplicate indexes on a fresh roachprod cluster... performance was better than the first time I tried it, most of these are in the 20-60 ms range (generally they're all sub 10ms after adding the partitioned duplicate indexes)
stmt-bundle-608675081702768642.zip
stmt-bundle-608675041611087878.zip
stmt-bundle-608674970782269446.zip
stmt-bundle-608674836224376838.zip
stmt-bundle-608674548608794629.zip

As an aside, ALL of the queries are running slower on 20.2, not just some of them.

@nvanbenschoten
Copy link
Member

None of those statement bundles show queries that take more than 4.5ms. This might be a good test of whether we can use our observability tools to determine which queries are slower and why.

@keith-mcclellan
Copy link
Contributor

The UI was showing much higher query latency than what you're seeing in the bundle. You can reproduce my environment using https://github.com/cockroachlabs/scripted-demos/blob/master/geo-partitioned-replicas-demo.sh

@asubiotto asubiotto self-assigned this Nov 23, 2020
@asubiotto
Copy link
Contributor

Note: we should probably add movr performance to roachperf given its importance

@asubiotto
Copy link
Contributor

I'm not able to reproduce this. Note that I'm running without duplicate indexes (per the issue description, it doesn't seem like a requirement to reproduce the performance regression). I first load v20.1.8 and look at the SQL p99 graph:
Screen Shot 2020-11-24 at 3 56 55 PM
I then stop the nodes, stage 20.20, and restart the nodes:
Screen Shot 2020-11-24 at 4 01 24 PM
The p99 performance seems to improve. @keith-mcclellan what am I missing?

@keith-mcclellan
Copy link
Contributor

@asubiotto can we do a quick call? Your 20.1 results don't look right, p99 should be sub 10 ms and this is showing 500+ms (which is what I expect before the indexes and leaseholder pinning is applied)

@keith-mcclellan
Copy link
Contributor

keith-mcclellan commented Nov 30, 2020

I just did a long running test of both 20.1.7 and 20.2.0 using Movr. The good news if you let it run long enough 20.2 does get to the expected sub-10ms performance for p99, but it takes 10 minutes or so after applying the schema changes. in 20.1.7 it only takes 3 to 5 minutes.

20.1.7 -
Screen Shot 2020-11-30 at 12 40 45 PM

20.2.0 -
Screen Shot 2020-11-30 at 12 39 58 PM

The weird thing is, I went hunting for the long-running queries in 20.2 at the 5 minute mark and couldn't find them in the statements or transactions pages... did we change the way we're rolling up these aggregate stats in the admin UI? Perhaps the performance hasn't gotten worse and we're just presenting the data differently?

@awoods187
Copy link
Contributor Author

We changed how statement fingerprinting works but I don't think that should have changed the metrics you are looking at. CC @solongordon to correct me if that isn't right

@solongordon
Copy link
Contributor

That's correct, no changes to statement statistic aggregation I can think of.

@asubiotto
Copy link
Contributor

asubiotto commented Dec 1, 2020

The weird thing is, I went hunting for the long-running queries in 20.2 at the 5 minute mark and couldn't find them in the statements or transactions pages... did we change the way we're rolling up these aggregate stats in the admin UI? Perhaps the performance hasn't gotten worse and we're just presenting the data differently?

The problem is that we're trying to understand which queries have p99 performance while the statements page focuses on average query performance. Maybe the standard deviation could be a little help here but it probably includes the query performance before the schema change.

My repro attempt did not perform any schema changes. I guess I misunderstood the repro instructions.

This does seem like it might be an optimizer issue (reminiscent of bad plans before creating statistics). I see a couple next steps:

  1. Add an explicit CREATE STATISTICS statement on the tables the workload touches in your script and see if anything changes. If nothing changes, that would rule out a stats-related issue.
  2. Check the KV Transaction Restart graph in the Distributed metrics tab. These retries might also be shown in the statements/transactions pages. What are the numbers there?
  3. Set sql.log.slow_query.latency_threshold to a desired threshold to log queries that take over a certain amount of time as well as sql.log.slow_query.experimental_full_table_scans.enabled to see if there are any bad plans generated (the latter does not work if you do not set the former). This will probably affect general workload performance but if we set the threshold to a minimum p99 threshold we expect to achieve (say 20ms), we should be able to observe which queries are exceeding that threshold.

Let me know how it goes and I'm also happy to jump on a call with you tomorrow.

@asubiotto
Copy link
Contributor

asubiotto commented Dec 2, 2020

OK, so I took some more time today to reproduce this and the TL;DR is that when explicit CREATE STATISTICS statements are issued, 20.2.0 behaves like 20.1.8. To be clear (because I was confused), the issue is not that we experience 600ms latencies, the issue is the amount of time it takes to achieve sub-10ms latencies in movr after performance-optimizing schema changes are completed.

Each of the following graphs starts when the schema change finishes. First off, we have 20.1.8 for reference behavior (note that the scales between 20.1.8 and 20.2.0 are different, so 20.1.8 looks a lot smoother due to the larger scale but isn't necessarily so):
image
Which takes around 4 minutes to achieve sub-10ms p99 performance

20.2.0 without explicit stats collection:
image
Which takes around 10 minutes to achieve sub-10ms p99 performance.

Compare this with 20.2.0 with explicit stats collection after the schema change which brings this back down to around 4-5 minutes:
image

It seems like @nvanbenschoten's hypothesis is correct and we seem to be generating bad plans because of incomplete statistics. I guess the question now is what changed in statistics collection in 20.2 and is there anything we can do about it? cc @RaduBerinde @rytaft I would expect that we could refresh stats after a schema change.

@asubiotto asubiotto changed the title sql: performance regression in Movr sql: performance regression in Movr due to auto stats not triggered Dec 2, 2020
@asubiotto
Copy link
Contributor

Here's an updated repro script with the create statistics statements and option to reuse a cluster: geo-partitioned-replicas-demo.sh.txt

@rytaft
Copy link
Collaborator

rytaft commented Dec 2, 2020

Hi @asubiotto -- have you verified that auto stats is definitely not getting triggered (i.e., there is no auto stats job running on the table after the schema change)? We should be triggering auto stats collection after every schema change, so unless the schema team changed some logic there, it should be triggered.

The only change we made recently is that after auto stats completes, we don't immediately invalidate the stats cache on each node. The cache update now happens asynchronously so that existing queries aren't blocked on the refresh. But the asynchronous refresh definitely should not take 10 minutes, so I doubt that's the problem. Also, we backported that change to 20.1 a while ago, so there shouldn't be any difference on 20.1.8.

@rytaft
Copy link
Collaborator

rytaft commented Dec 2, 2020

Confirmed that the backport with asynchronous cache updates was added in 20.1.5.

@rytaft
Copy link
Collaborator

rytaft commented Dec 2, 2020

It looks like this logic is new in 20.2, which could be the culprit:

if err := waitToUpdateLeases(false /* refreshStats */); err != nil {

@ajwerner or @lucy-zhang do you think this could prevent stats from getting updated after the schema change described in this issue?

@thoszhang
Copy link
Contributor

That call to waitToUpdateLeases is only for non-table descriptors. We should still refresh stats in the usual place, at the end of the schema change:

if err := waitToUpdateLeases(err == nil /* refreshStats */); err != nil && !errors.Is(err, catalog.ErrDescriptorNotFound) {

What are the relevant schema changes where we were relying on stats updates at the end? Is this about adding the foreign keys, or the initial table creation? I assume it's the foreign keys, but I want to make sure. Also, do you know exactly which tables have the out-of-date stats?

We did make some changes to waiting for leases when updating foreign key backreferences, IIRC, but that's the only remotely related thing I can think of.

@asubiotto
Copy link
Contributor

I assume it's the foreign keys, but I want to make sure. Also, do you know exactly which tables have the out-of-date stats?

The schema changes that happen are creating indexes on promo_codes (the other operations that happen are partitions and zone config changes), so I assume it's stats on this table that are stale (will double check). I assume we will call waitToUpdateLeases(false /* refreshStats */) in this case, right?

have you verified that auto stats is definitely not getting triggered (i.e., there is no auto stats job running on the table after the schema change)? We should be triggering auto stats collection after every schema change, so unless the schema team changed some logic there, it should be triggered.

Ran it again, and it seems like the auto stats job runs, but much later than the schema change. For example, this is SHOW AUTOMATIC STATS after a schema change that finished at 10:18:

  612501019249442818 | AUTO CREATE STATS | Table statistics refresh for movr.public.promo_codes                | CREATE STATISTICS __auto__ FROM [55] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:21:17.065855+00:00 | 2020-12-03 10:21:17.847+00:00    | 2020-12-03 10:21:25.769157+00:00 | 2020-12-03 10:21:25.429564+00:00 |                  1 |       |           NULL
  612500920620154881 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicle_location_histories | CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:46.966593+00:00 | 2020-12-03 10:20:47.721721+00:00 | 2020-12-03 10:20:57.243853+00:00 | 2020-12-03 10:20:56.916104+00:00 |                  1 |       |           NULL
  612500887426859009 | AUTO CREATE STATS | Table statistics refresh for movr.public.users                      | CREATE STATISTICS __auto__ FROM [53] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:36.8368+00:00   | 2020-12-03 10:20:37.591598+00:00 | 2020-12-03 10:20:46.080833+00:00 | 2020-12-03 10:20:45.753184+00:00 |                  1 |       |           NULL
  612500844115394561 | AUTO CREATE STATS | Table statistics refresh for movr.public.rides                      | CREATE STATISTICS __auto__ FROM [58] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:23.619199+00:00 | 2020-12-03 10:20:24.373564+00:00 | 2020-12-03 10:20:35.885412+00:00 | 2020-12-03 10:20:35.557796+00:00 |                  1 |       |           NULL
  612500803940646913 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:20:11.358836+00:00 | 2020-12-03 10:20:12.112409+00:00 | 2020-12-03 10:20:22.667373+00:00 | 2020-12-03 10:20:22.338694+00:00 |                  1 |       |           NULL
  612500682893590534 | AUTO CREATE STATS | Table statistics refresh for movr.public.user_promo_codes           | CREATE STATISTICS __auto__ FROM [57] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:19:34.41819+00:00  | 2020-12-03 10:19:35.345203+00:00 | 2020-12-03 10:19:46.4199+00:00   | 2020-12-03 10:19:45.991022+00:00 |                  1 |       |           NULL
  612500628609204228 | AUTO CREATE STATS | Table statistics refresh for movr.public.rides                      | CREATE STATISTICS __auto__ FROM [58] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:19:17.851918+00:00 | 2020-12-03 10:19:18.783895+00:00 | 2020-12-03 10:19:33.730644+00:00 | 2020-12-03 10:19:33.29866+00:00  |                  1 |       |           NULL
  612500585027960836 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicle_location_histories | CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:19:04.55198+00:00  | 2020-12-03 10:19:05.485633+00:00 | 2020-12-03 10:19:16.766945+00:00 | 2020-12-03 10:19:16.334848+00:00 |                  1 |       |           NULL
  612500466265653256 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:18:28.308603+00:00 | 2020-12-03 10:18:29.666409+00:00 | 2020-12-03 10:18:51.943715+00:00 | 2020-12-03 10:18:51.282628+00:00 |                  1 |       |           NULL
  612500393172566018 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:18:06.002366+00:00 | 2020-12-03 10:18:06.492608+00:00 | 2020-12-03 10:18:13.544681+00:00 | 2020-12-03 10:18:13.349177+00:00 |                  1 |       |           NULL
  612500370044518401 | AUTO CREATE STATS | Table statistics refresh for movr.public.user_promo_codes           | CREATE STATISTICS __auto__ FROM [57] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 10:17:58.944239+00:00 | 2020-12-03 10:17:59.46807+00:00  | 2020-12-03 10:18:05.390906+00:00 | 2020-12-03 10:18:05.196298+00:00 |                  1 |       |           NULL

The table that is subject to the index creation, promo_codes doesn't get a stats run until around ~3m30 after the schema change completed. In 20.1.8, a schema change that completes at 15:28 kicks off a run around 1 min later:

  612561616761847811 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicle_location_histories | CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:29.959051+00:00 | 2020-12-03 15:29:30.02769+00:00  | 2020-12-03 15:29:31.704878+00:00 | 2020-12-03 15:29:31.701072+00:00 |                  1 |       |              3
  612561601749712898 | AUTO CREATE STATS | Table statistics refresh for movr.public.promo_codes                | CREATE STATISTICS __auto__ FROM [55] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:25.377701+00:00 | 2020-12-03 15:29:25.447617+00:00 | 2020-12-03 15:29:26.177337+00:00 | 2020-12-03 15:29:26.172998+00:00 |                  1 |       |              2
  612561567975604225 | AUTO CREATE STATS | Table statistics refresh for movr.public.promo_codes                | CREATE STATISTICS __auto__ FROM [55] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:15.070679+00:00 | 2020-12-03 15:29:15.137618+00:00 | 2020-12-03 15:29:15.862587+00:00 | 2020-12-03 15:29:15.861355+00:00 |                  1 |       |              1
  612561525407088646 | AUTO CREATE STATS | Table statistics refresh for movr.public.users                      | CREATE STATISTICS __auto__ FROM [53] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:29:02.079787+00:00 | 2020-12-03 15:29:02.463981+00:00 | 2020-12-03 15:29:06.771068+00:00 | 2020-12-03 15:29:06.611338+00:00 |                  1 |       |              6
  612561500135129094 | AUTO CREATE STATS | Table statistics refresh for movr.public.rides                      | CREATE STATISTICS __auto__ FROM [58] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:28:54.367398+00:00 | 2020-12-03 15:28:54.751511+00:00 | 2020-12-03 15:29:01.429136+00:00 | 2020-12-03 15:29:01.259049+00:00 |                  1 |       |              6
  612561478425542662 | AUTO CREATE STATS | Table statistics refresh for movr.public.vehicles                   | CREATE STATISTICS __auto__ FROM [56] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:28:47.742169+00:00 | 2020-12-03 15:28:48.134475+00:00 | 2020-12-03 15:28:53.722634+00:00 | 2020-12-03 15:28:53.562343+00:00 |                  1 |       |              6
  612561461178564614 | AUTO CREATE STATS | Table statistics refresh for movr.public.user_promo_codes           | CREATE STATISTICS __auto__ FROM [57] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s' | root      | succeeded | NULL           | 2020-12-03 15:28:42.478805+00:00 | 2020-12-03 15:28:42.863286+00:00 | 2020-12-03 15:28:47.096559+00:00 | 2020-12-03 15:28:46.936521+00:00 |                  1 |       |              6

It's also possible that a manual create statistics run has some other side effects I'm not considering.

@thoszhang
Copy link
Contributor

The schema changes that happen are creating indexes on promo_code (the other operations that happen are partitions and zone config changes)

I'm not sure what you mean. Could you post the full contents of your crdb_internal.jobs on 20.2 and 20.1? I just started cockroach demo with the default movr database on both 20.2.0 and 20.1.8, and in both cases I only see schema change jobs to add foreign keys. All the secondary indexes for the movr tables are specified as part of CREATE TABLE, as far as I can tell. Also, did you mean user_promo_codes? promo_codes doesn't have any secondary indexes or constraints.

Are you expecting CREATE TABLE to itself trigger an auto stats job?

I assume we will call waitToUpdateLeases(false /* refreshStats */) in this case, right?

No, that should only get called for non-table schema changes (i.e., on databases, user-defined schemas, and enums, which don't have stats).

I think to make comparisons on timing (including when the various schema change jobs actually ended), we just need to see all the jobs first.

@asubiotto
Copy link
Contributor

This is not cockroach demo, it's a demo run with this script to demonstrate geo partitioned replicas (geo-partitioned-replicas-demo.sh.txt).

Are you expecting CREATE TABLE to itself trigger an auto stats job?

No, this is unrelated to CREATE TABLE. The observation is that when running the demo in the above script, after the schema changes, p99 latency in 20.1.8 goes to sub-10ms in half the time as 20.2.0. This difference seems to disappear with explicit statistics creation. The hypothesis is that for some reason the auto stats job that should be triggered as part of the CREATE INDEX statements in the script above are not.

@thoszhang
Copy link
Contributor

Sorry, I read too quickly and didn't realize there were separate indexes to be added beyond the default workload initialization.

I just added some logging on master and confirmed (on a local single-node cluster) that the schema changer for CREATE INDEX on a table does call Refresher.NotifyMutation for the table with rowsAffected set to MaxInt32, as expected, and that this does eventually manifest in an auto stats job. We didn't make any changes to this area in 20.2, so I can't imagine why it would be different in a multi-region setup. So at this point I don't think this is a schema changer problem.

@asubiotto asubiotto changed the title sql: performance regression in Movr due to auto stats not triggered sql: performance regression in Movr Dec 7, 2020
@asubiotto
Copy link
Contributor

Unfortunately haven't made much progress on this but need to move on to other stuff so writing down my progress. I think I agree that statistics seem to be unrelated. I recently had a "bad" run with statistics triggered relatively quickly after the promo_codes schema change.

I took a step back and tried to confirm which queries seem to cause this high p99 latency and two queries seemed to be repeatedly logged: INSERT INTO promo_codes and SELECT FROM vehicles LIMIT 25. Since INSERT INTO promo_codes seems to be expected to take a long time (duplicate indexes favor read latency), it looks like our culprit is SELECT FROM vehicles LIMIT 25. Unfortunately I haven't been able to get a trace for one of these high p99 queries since the only cluster setting we have to enable tracing above a certain threshold seems to be sql.trace.txn.enable_threshold which traces the whole txn and takes into account communication with the client, which in this case seems to be non-negligible:

‹     0.690ms      0.016ms        event:sql/conn_executor.go:1428 [n1,client=10.142.0.23:54036,hostnossl,user=root] [Open pos:873] executing Sync›
‹     8.243ms      7.554ms        === operation:exec stmt client:10.142.0.23:54036 hostnossl: node:1 user:root›
‹     8.304ms      0.060ms        event:sql/conn_executor.go:1428 [n1,client=10.142.0.23:54036,hostnossl,user=root] [Open pos:874] executing ExecStmt: SELECT vehicles.id AS vehicles_id, vehicles.city AS vehicles_city, vehicles.type AS vehicles_type, vehicles.owner_id AS vehicles_owner_id, vehicles.creation_time AS vehicles_creation_time, vehicles.status AS vehicles_status, vehicles.current_location AS vehicles_current_location, vehicles.ext AS vehicles_ext FROM vehicles WHERE vehicles.city = 'new york' LIMIT 25›

Since this p99 metric is only service+execution time of the exec cmd, the next step will be to add a cluster setting to allow for tracing according to a single command's latency. Hopefully this should give us a slow trace.

@RaduBerinde
Copy link
Member

I filed #57634 to extend statement diagnostics to only collect them when exec time is above a given value.

@asubiotto
Copy link
Contributor

asubiotto commented Dec 8, 2020

Added the mentioned cluster setting to trace statements above a certain threshold. This is what a SELECT FROM vehicles LIMIT 25 (which we established was the "surprising" latency) trace looks like:

I201208 15:27:23.912222 234937 sql/conn_executor_exec.go:660 ⋮ [n1,client=‹10.142.0.50:38456›,hostnossl,user=root] SQL stmt ‹SELECT vehicles.id AS vehicles_id, vehicles.city AS vehicles_city, vehicles.type AS vehicles_type, vehicles.owner_id AS vehicles_owner_id, vehicles.creation_time AS vehicles_creation_time, vehicles.status AS vehicles_status, vehicles.current_location AS vehicles_current_location, vehicles.ext AS vehicles_ext FROM vehicles WHERE vehicles.city = 'new york' LIMIT 25› took 26.384037ms, exceeding statement threshold of 10ms:
‹     0.000ms      0.000ms    === operation:exec-threshold client:10.142.0.50:38456 hostnossl: node:1 sb:1 unfinished: user:root›
‹     0.056ms      0.056ms    event:sql/catalog/descs/collection.go:295 [n1,client=10.142.0.50:38456,hostnossl,user=root] added descriptor 'movr' to collection: name:"movr" id:52 modification_time:<wall_time:1607440075481870607 > version:1 privileges:<users:<user:"admin" privileges:2 > users:<user:"root" privileges:2 > owner:"root" version:1 > state:PUBLIC offline_reason:"" ›
‹     0.072ms      0.017ms    event:sql/catalog/descs/collection.go:943 [n1,client=10.142.0.50:38456,hostnossl,user=root] found descriptor 52 in cache›
‹     0.088ms      0.016ms    event:sql/catalog/descs/collection.go:271 [n1,client=10.142.0.50:38456,hostnossl,user=root] found descriptor in collection for 'movr'›
‹     0.360ms      0.272ms    event:sql/catalog/descs/collection.go:295 [n1,client=10.142.0.50:38456,hostnossl,user=root] added descriptor 'vehicles' to collection: name:"vehicles" id:56 version:33 modification_time:<wall_time:1607440832481058971 logical:1 > parent_id:52 unexposed_parent_schema_id:29 columns:<name:"id" id:1 type:<InternalType:<family:UuidFamily width:0 precision:0 locale:"" visible_type:0 oid:2950 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:false hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"city" id:2 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:false hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"type" id:3 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"owner_id" id:4 type:<InternalType:<family:UuidFamily width:0 precision:0 locale:"" visible_type:0 oid:2950 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"creation_time" id:5 type:<InternalType:<family:TimestampFamily width:0 precision:0 locale:"" visible_type:0 oid:1114 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"status" id:6 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"current_location" id:7 type:<InternalType:<family:StringFamily width:0 precision:0 locale:"" visible_type:0 oid:1043 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"ext" id:8 type:<InternalType:<family:JsonFamily width:0 precision:0 locale:"" visible_type:0 oid:3802 time_precision_is_set:false > TypeMeta:<Version:0 > > nullable:true hidden:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > next_column_id:9 families:<name:"primary" id:0 column_names:"id" column_names:"city" column_names:"type" column_names:"owner_id" column_names:"creation_time" column_names:"status" column_names:"current_location" column_names:"ext" column_ids:1 column_ids:2 column_ids:3 column_ids:4 column_ids:5 column_ids:6 column_ids:7 column_ids:8 default_column_id:0 > next_family_id:1 primary_index:<name:"primary" id:3 unique:true version:1 column_names:"city" column_names:"id" column_directions:ASC column_directions:ASC column_ids:2 column_ids:1 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:1 list:<name:"new_york" values:"\006\010new york" subpartitioning:<num_columns:0 > > list:<name:"chicago" values:"\006\007chicago" subpartitioning:<num_columns:0 > > list:<name:"seattle" values:"\006\007seattle" subpartitioning:<num_columns:0 > > > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" > indexes:<name:"vehicles_city_owner_id_idx" id:6 unique:false version:1 column_names:"city" column_names:"owner_id" column_directions:ASC column_directions:ASC column_ids:2 column_ids:4 extra_column_ids:1 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:1 list:<name:"new_york" values:"\006\010new york" subpartitioning:<num_columns:0 > > list:<name:"chicago" values:"\006\007chicago" subpartitioning:<num_columns:0 > > list:<name:"seattle" values:"\006\007seattle" subpartitioning:<num_columns:0 > > > type:FORWARD created_explicitly:true encoding_type:0 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" > next_index_id:7 privileges:<users:<user:"admin" privileges:2 > users:<user:"root" privileges:2 > owner:"root" version:1 > next_mutation_id:9 format_version:3 state:PUBLIC offline_reason:"" view_query:"" is_materialized_view:false drop_time:0 replacement_of:<id:0 time:<> > audit_mode:DISABLED drop_job_id:0 gc_mutations:<index_id:1 drop_time:0 job_id:0 > gc_mutations:<index_id:2 drop_time:0 job_id:0 > gc_mutations:<index_id:5 drop_time:0 job_id:0 > gc_mutations:<index_id:4 drop_time:0 job_id:0 > create_query:"" create_as_of_time:<wall_time:1607440123025966313 logical:1 > outbound_fks:<origin_table_id:56 origin_column_ids:2 origin_column_ids:4 referenced_column_ids:2 referenced_column_ids:1 referenced_table_id:53 name:"fk_owner_id_ref_users_mr" validity:Validated on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > inbound_fks:<origin_table_id:58 origin_column_ids:2 origin_column_ids:4 referenced_column_ids:2 referenced_column_ids:1 referenced_table_id:56 name:"fk_vehicle_id_ref_vehicles_mr" validity:Validating on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > temporary:false ›
‹     0.381ms      0.020ms    event:sql/plan_opt.go:343 [n1,client=10.142.0.50:38456,hostnossl,user=root] query cache hit›
‹     0.428ms      0.047ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] planning ends›
‹     0.446ms      0.018ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] checking distributability›
‹     0.458ms      0.013ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] will distribute plan: false›
‹     0.481ms      0.022ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] execution starts: distributed engine›
‹     0.492ms      0.011ms    event:sql/distsql_running.go:982 [n1,client=10.142.0.50:38456,hostnossl,user=root] creating DistSQL plan with isLocal=true›
‹     0.513ms      0.021ms    event:sql/distsql_running.go:382 [n1,client=10.142.0.50:38456,hostnossl,user=root] running DistSQL plan›
‹     0.516ms      0.003ms        === operation:flow client:10.142.0.50:38456 hostnossl: node:1 user:root›
‹     0.546ms      0.031ms        event:sql/flowinfra/flow.go:307 [n1,client=10.142.0.50:38456,hostnossl,user=root] starting (0 processors, 0 startables)›
‹     0.547ms      0.001ms        === operation:table reader client:10.142.0.50:38456 cockroach.flowid:00000000-0000-0000-0000-000000000000 cockroach.processorid:0 hostnossl: node:1 user:root›
‹     0.558ms      0.011ms        event:sql/rowexec/tablereader.go:188 [n1,client=10.142.0.50:38456,hostnossl,user=root] starting scan with limitBatches true›
‹     0.604ms      0.046ms        event:sql/row/kv_batch_fetcher.go:334 [n1,client=10.142.0.50:38456,hostnossl,user=root] Scan /Table/56/3/"new york"{-/PrefixEnd}›
‹     0.608ms      0.004ms            === operation:txn coordinator send client:10.142.0.50:38456 hostnossl: node:1 txnID:3b4266cd-41c0-4c84-bf2c-6de5b466db7f user:root›
‹     0.616ms      0.008ms                === operation:dist sender send client:10.142.0.50:38456 hostnossl: node:1 txn:3b4266cd user:root›
‹     0.636ms      0.020ms                event:kv/kvclient/kvcoord/range_iter.go:159 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] querying next range at /Table/56/3/"new york"›
‹     0.685ms      0.048ms                event:kv/kvclient/kvcoord/dist_sender.go:1846 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] r80: sending batch 1 Scan to (n2,s2):4›
‹     0.698ms      0.013ms                event:rpc/nodedialer/nodedialer.go:132 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] sending request to 10.142.0.38:26257›
‹     0.707ms      0.009ms                    === operation:/cockroach.roachpb.Internal/Batch component:gRPC node:1 sb:1 span.kind:client txnID:3b4266cd-41c0-4c84-bf2c-6de5b466db7f›
‹     0.857ms      0.150ms                    === operation:/cockroach.roachpb.Internal/Batch component:gRPC node:2 sb:1 span.kind:server txnid:3b4266cd-41c0-4c84-bf2c-6de5b466db7f unfinished:›
‹     0.884ms      0.027ms                    event:server/node.go:879 [n2] node received request: 1 Scan›
‹     0.960ms      0.076ms                    event:kv/kvserver/store_send.go:161 [n2,s2] executing [txn: 3b4266cd], [can-forward-ts], Scan [/Table/56/3/"new york",/Table/56/3/"new york"/PrefixEnd)›
‹     0.982ms      0.023ms                    event:kv/kvserver/replica_send.go:94 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] read-only path›
‹     1.005ms      0.023ms                    event:kv/kvserver/concurrency/concurrency_manager.go:116 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] sequencing request›
‹     1.016ms      0.012ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›
‹     1.183ms      0.166ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] scanning lock table for conflicting locks›
‹     1.200ms      0.017ms                    event:kv/kvserver/replica_read.go:124 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] executing read-only batch›
‹     1.428ms      0.229ms                    event:kv/kvserver/concurrency/concurrency_manager.go:120 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] re-sequencing request›
‹     1.445ms      0.017ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›
‹     1.462ms      0.017ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] scanning lock table for conflicting locks›
‹     1.478ms      0.016ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] waiting in lock wait-queues›
‹    25.355ms     23.877ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›
‹    25.370ms      0.015ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] scanning lock table for conflicting locks›
‹    25.389ms      0.019ms                    event:kv/kvserver/replica_read.go:124 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] executing read-only batch›
‹    25.590ms      0.201ms                    event:kv/kvserver/replica_read.go:109 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] read completed›
‹    26.047ms     25.439ms            event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:591 [n1,client=10.142.0.50:38456,hostnossl,user=root,txn=3b4266cd] recording span to refresh: /Table/56/3/"new york"{-/"\x19=\tae\xc9Bݛ\xeab\xd6\n\xb1\xdaw"/0}›
‹    26.304ms     25.791ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] execution ends›
‹    26.335ms      0.031ms    event:sql/conn_executor_exec.go:654 [n1,client=10.142.0.50:38456,hostnossl,user=root] rows affected: 25›

The important part here is:

‹     1.478ms      0.016ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] waiting in lock wait-queues›
‹    25.355ms     23.877ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n2,s2,r80/4:/Table/56/3/"new york"{-/…}] acquiring latches›

Which shows ~24ms waiting in a lock wait queue. @nvanbenschoten, is that contention? It doesn't look like it's a plan issue and the workload is the same so this is curious. For reference the schema change happened at ~15:22:

 613975727105638402 | SCHEMA CHANGE    | CREATE INDEX promo_codes_idx_west ON movr.public.promo_codes (code) STORING (description, creation_time, expiration_time, rules)                                                                                                                                                                                                                                                                                                        |           | root      | succeeded | NULL           | 2020-12-08 15:22:02.18674+00:00  | 2020-12-08 15:22:02.45555+00:00  | 2020-12-08 15:22:07.485331+00:00 | 2020-12-08 15:22:07.480031+00:00 |                  1 |       |           NULL
  613975706373095425 | SCHEMA CHANGE    | CREATE INDEX promo_codes_idx_east ON movr.public.promo_codes (code) STORING (description, creation_time, expiration_time, rules)                                                                                                                                                                                                                                                                                                        |           | root      | succeeded | NULL           | 2020-12-08 15:21:55.859468+00:00 | 2020-12-08 15:21:56.229605+00:00 | 2020-12-08 15:22:01.159752+00:00 | 2020-12-08 15:22:01.155939+00:00 |                  1 |       |           NULL

And the above trace was collected ~5 minutes after this when latencies had not yet stabilized:
image

At this stage I'd prefer to hand this off to someone in KV for further investigation since this is outside my area of expertise. @nvanbenschoten are you interested in taking it on/could you find someone to take this on? Happy to collect any more information you might need.

@nvanbenschoten
Copy link
Member

Yes, that looks like contention. The workload seems to be updating the vehicles table periodically, so it's not clear that this is unexpected. If a transaction updates the vehicles table and then another scans it and conflicts, the scanning transaction is going to have to wait.

This may be impacted by #9521 / #49973, which we saw cause some issues with movr in 20.1 before a partial fix, but that wouldn't have gotten worse in 20.2. @sumeerbhola is also looking to improve that in this release.

Taking a step back, where are we on this issue? Is that slow query unique to 20.2? Is your statement in #56923 (comment) about the regression being transient no longer true?

@asubiotto
Copy link
Contributor

Taking a step back, where are we on this issue? Is that slow query unique to 20.2? Is your statement in #56923 (comment) about the regression being transient no longer true?

No the workload is exactly the same between both versions. This query is also slow/contended on 20.1.8, but the mystery is why with the exact same workload, 20.2.0 takes twice the time to stabilize to these low p99 latencies.

Here are the logs for both versions with a statement tracing threshold of 15ms in 20.1.8 and 20ms in 20.2.0: https://drive.google.com/drive/folders/12N2k47TZsQaXR6amy7jUivFnfX9jVe5d?usp=sharing

One unscientific observation is that in 20.1.8, the SELECT vehicle transaction seems to perform way more pushes (note that the logs for 20.2.0 is for a longer period than 20.1.8):

$ ag "pushing timestamp" asubiotto-mrgeoreplicas-v20.1.8.logs/*.logs/*.log | ag "chicago|seattle|new york" | wc -l
    2183
$ ag "pushing timestamp" asubiotto-mrgeoreplicas-v20.2.0.logs/*.logs/*.log | ag "chicago|seattle|new york" | wc -l
      55

@sumeerbhola
Copy link
Collaborator

the 50ms before push is probably because of the LockTableLivenessPushDelay setting.
I am not sure I am interpreting the following trace correctly, but it looks like the push is resolving the intents, so is it that the conflicting transaction did not resolve? And the second trace below has a "pushee txn record not found", which is odd (both from asubiotto-mrgeoreplicas-v20.2.0.logs/5.logs/cockroach.log)

Trace 1

‹    -3.927ms      0.011ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] waiting in lock wait-queues›
‹    46.219ms     50.146ms                    event:kv/kvserver/concurrency/lock_table_waiter.go:451 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing timestamp of txn 101f7f2c above 1607441273.781834899,0›
‹    46.236ms      0.017ms                    event:kv/kvserver/intentresolver/intent_resolver.go:375 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing 1 transaction(s)›
‹    46.243ms      0.007ms                        === operation:dist sender send node:4 range:82/5:/Table/56/3/"chicago"{-/P…} store:4›
‹    46.267ms      0.024ms                        event:kv/kvclient/kvcoord/range_iter.go:159 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying next range at /Table/56/3/"chicago"/"\x05\x05\xb1\xc2>\x1f@R\x8adU34N\xf0^"›
‹    46.324ms      0.057ms                        event:kv/kvclient/kvcoord/dist_sender.go:1846 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] r82: sending batch 1 PushTxn to (n4,s4):5›
‹    46.336ms      0.013ms                        event:rpc/nodedialer/nodedialer.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sending request to local client›
‹    46.340ms      0.004ms                            === operation:/cockroach.roachpb.Internal/Batch node:4›
‹    46.356ms      0.016ms                            event:server/node.go:879 [n4] node received request: 1 PushTxn›
‹    46.461ms      0.105ms                            event:kv/kvserver/store_send.go:161 [n4,s4] executing PushTxn(e3a463a6->101f7f2c) [/Table/56/3/"chicago"/"\x05\x05\xb1\xc2>\x1f@R\x8adU34N\xf0^",/Min)›
‹    46.474ms      0.013ms                            event:kv/kvserver/replica_send.go:98 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read-write path›
‹    46.494ms      0.020ms                            event:kv/kvserver/concurrency/concurrency_manager.go:116 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sequencing request›
‹    46.503ms      0.009ms                            event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    50.984ms      4.481ms                            event:kv/kvserver/replica_write.go:110 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] applied timestamp cache›
‹    51.001ms      0.017ms                            event:kv/kvserver/replica_write.go:355 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-write batch›
‹    51.075ms      0.074ms                            event:kv/kvserver/replica_raft.go:74 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] evaluated request›
‹    51.120ms      4.884ms                    event:kv/kvserver/intentresolver/intent_resolver.go:405 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] 101f7f2c-e7a2-48f7-a899-06f85031d50a is now COMMITTED›
‹    51.136ms      0.016ms                    event:kv/kvserver/intentresolver/intent_resolver.go:833 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] resolving intents›
‹    57.840ms      6.703ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    57.882ms      0.042ms                    event:kv/kvserver/concurrency/concurrency_manager.go:164 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] scanning lock table for conflicting locks›
‹    57.903ms      0.021ms                    event:kv/kvserver/replica_read.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-only batch›
‹    58.089ms      0.186ms                    event:kv/kvserver/replica_read.go:109 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read completed›

Trace 2

‹    -2.103ms      0.044ms                    event:kv/kvserver/concurrency/concurrency_manager.go:171 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] waiting in lock wait-queues›
‹    48.169ms     50.272ms                    event:kv/kvserver/concurrency/lock_table_waiter.go:451 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing timestamp of txn 3f2f13e8 above 1607441306.597794051,7›
‹    48.192ms      0.023ms                    event:kv/kvserver/intentresolver/intent_resolver.go:375 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushing 1 transaction(s)›
‹    48.205ms      0.013ms                        === operation:dist sender send node:4 range:82/5:/Table/56/3/"chicago"{-/P…} store:4›
‹    48.264ms      0.059ms                        event:kv/kvclient/kvcoord/range_iter.go:159 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying next range at /Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e"›
‹    48.356ms      0.092ms                        event:kv/kvclient/kvcoord/dist_sender.go:1846 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] r82: sending batch 1 PushTxn to (n4,s4):5›
‹    48.371ms      0.014ms                        event:rpc/nodedialer/nodedialer.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sending request to local client›
‹    48.377ms      0.006ms                            === operation:/cockroach.roachpb.Internal/Batch node:4›
‹    48.389ms      0.012ms                            event:server/node.go:879 [n4] node received request: 1 PushTxn›
‹    48.517ms      0.128ms                            event:kv/kvserver/store_send.go:161 [n4,s4] executing PushTxn(76277d81->3f2f13e8) [/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e",/Min)›
‹    48.537ms      0.020ms                            event:kv/kvserver/replica_send.go:98 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read-write path›
‹    48.565ms      0.028ms                            event:kv/kvserver/concurrency/concurrency_manager.go:116 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sequencing request›
‹    48.575ms      0.010ms                            event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    48.599ms      0.024ms                            event:kv/kvserver/replica_write.go:110 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] applied timestamp cache›
‹    48.611ms      0.012ms                            event:kv/kvserver/replica_write.go:355 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-write batch›
‹    48.680ms      0.070ms                            event:kv/kvserver/batcheval/cmd_push_txn.go:145 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] pushee txn record not found›
‹    48.731ms      0.051ms                            event:kv/kvserver/batcheval/cmd_push_txn.go:267 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] failed to push meta={id=3f2f13e8 key=/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e" pri=0.02011331 epo=0 ts=1607441306.582972980,0 min=1607441306.582972980,0 seq=1} lock=true stat=PENDING rts=0,0 wto=false max=0,0›
‹    48.765ms      0.033ms                            event:kv/kvserver/replica_raft.go:74 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] evaluated request›
‹    48.781ms      0.016ms                            event:kv/kvserver/concurrency/concurrency_manager.go:120 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] re-sequencing request›
‹    48.796ms      0.015ms                            event:kv/kvserver/txnwait/queue.go:455 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] 76277d81 pushing 3f2f13e8 (1 pending)›
‹    48.947ms      0.152ms                            event:kv/kvserver/txnwait/queue.go:557 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying pushee›
‹    48.955ms      0.008ms                                === operation:dist sender send node:4 range:82/5:/Table/56/3/"chicago"{-/P…} store:4›
‹    48.980ms      0.025ms                                event:kv/kvclient/kvcoord/range_iter.go:159 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] querying next range at /Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e"›
‹    49.066ms      0.086ms                                event:kv/kvclient/kvcoord/dist_sender.go:1846 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] r82: sending batch 1 QueryTxn to (n4,s4):5›
‹    49.077ms      0.012ms                                event:rpc/nodedialer/nodedialer.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sending request to local client›
‹    49.082ms      0.005ms                                    === operation:/cockroach.roachpb.Internal/Batch node:4›
‹    49.093ms      0.010ms                                    event:server/node.go:879 [n4] node received request: 1 QueryTxn›
‹    49.114ms      0.021ms                                    event:kv/kvserver/store_send.go:161 [n4,s4] executing QueryTxn [/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e",/Min)›
‹    49.127ms      0.014ms                                    event:kv/kvserver/replica_send.go:94 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read-only path›
‹    49.145ms      0.017ms                                    event:kv/kvserver/concurrency/concurrency_manager.go:116 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] sequencing request›
‹    49.154ms      0.009ms                                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›
‹    49.170ms      0.017ms                                    event:kv/kvserver/replica_read.go:124 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] executing read-only batch›
‹    49.222ms      0.052ms                                    event:kv/kvserver/replica_read.go:109 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] read completed›
‹    86.415ms     37.468ms                            event:kv/kvserver/txnwait/queue.go:538 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] result of pending push: meta={id=3f2f13e8 key=/Table/56/3/"chicago"/"\x02\t\x99I\xfe\xcfO\xe7\xb8ct\x0f\x98\xa2Y\x8e" pri=0.02011331 epo=0 ts=1607441306.582972980,0 min=1607441306.582972980,0 seq=5} lock=true stat=COMMITTED rts=1607441306.582972980,0 wto=false max=1607441306.584668532,0 int=3›
‹    86.516ms      0.101ms                            event:kv/kvserver/txnwait/queue.go:549 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] push request is satisfied›
‹    86.567ms     38.374ms                    event:kv/kvserver/intentresolver/intent_resolver.go:405 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] 3f2f13e8-bfc6-4dea-9c6c-8b07891785cb is now COMMITTED›
‹    86.584ms      0.018ms                    event:kv/kvserver/intentresolver/intent_resolver.go:833 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] resolving intents›
‹    92.014ms      5.430ms                    event:kv/kvserver/concurrency/concurrency_manager.go:152 [n4,s4,r82/5:/Table/56/3/"chicago"{-/P…}] acquiring latches›

@rafiss
Copy link
Collaborator

rafiss commented Feb 16, 2021

linking issue for regression testing movr performance: #56922

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 20, 2021
@mwang1026 mwang1026 removed the T-kv KV Team label Aug 20, 2021
@rytaft
Copy link
Collaborator

rytaft commented Aug 1, 2022

This issue got buried, and the CRDB versions discussed here are no longer supported, so I don't think it's worthwhile to leave this issue open. #56922 should cover the work to more easily discover issues like this in the future.

@rytaft rytaft closed this as completed Aug 1, 2022
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests