sql: Slow query on select multiple records statement #69763

DavidChenLondon · 2021-09-02T11:57:09Z

Describe the problem

We met slow (can be up to 30+ seconds) SQL query issue which only happens on selecting multiple records, but not single record, as shown in below SQL analysis.

Our CockroachDB consists of 3 regional clusters, and we checked only one node has service latency issue shown in dashboard /metrics/overview/cluster page, and also loads of RPC errors. However, /reports/network/region page looks normal, this is strange!

We found simply restarting the pod can not fix the issue, and our final solution is to reschedule the pod to another k8s node, and our sql query recovered to be fast as usual. Later we found another app which is shared on the same k8s node also has slow/broken connection, so it's more like a hardware problem.

Although the problem is gone, we are wondering if anyone in the community could help answer below two questions:

Question#1: if only one node has the problem, NewSQL CockroachDB should continue to serve, right?

Question#2: why RPC error is so high and only occur in one node? should RPC error involves at least two nodes?

To Reproduce

Unfortunately it's not easy to reproduce, but we can provide our internal detailed analysis Confluence document and statement diagnostics bundle zip in private.

Expected behavior
The expected running time of this select multiple records should be just up to hundreds of milliseconds since we just have thousands of records on that table so far.

Additional data / screenshots

slow SQL analysis:

sfodoo_cockroach_prod=> explain analyze select * from sfodoo_orders limit 1;
                info
-------------------------------------
 planning time: 666µs
 execution time: 11.7s
 distribution: full
 vectorized: true
 rows read from KV: 1 (3.3 KiB)
 cumulative time spent in KV: 11.5s
 maximum memory usage: 20 KiB
 network usage: 5.8 KiB (3 messages)
 regions: cdb-cn-shanghai

 • scan
   nodes: n1
   regions: cdb-cn-shanghai
   actual row count: 1
   KV rows read: 1
   KV bytes read: 3.3 KiB
   missing stats
   table: sfodoo_orders@primary
   spans: LIMITED SCAN
   limit: 1
(20 rows)

sfodoo_cockroach_prod=> explain analyze select count(*) from sfodoo_orders;
                           info
----------------------------------------------------------
 planning time: 468µs
 execution time: 21.3s
 distribution: full
 vectorized: true
 rows read from KV: 6,430 (349 KiB)
 cumulative time spent in KV: 21.1s
 maximum memory usage: 2.4 MiB
 network usage: 152 B (3 messages)
 regions: cdb-cn-shanghai

 • group (scalar)
 │ nodes: n1
 │ regions: cdb-cn-shanghai
 │ actual row count: 1
 │
 └── • scan
       nodes: n1
       regions: cdb-cn-shanghai
       actual row count: 6,430
       KV rows read: 6,430
       KV bytes read: 349 KiB
       missing stats
       table: sfodoo_orders@sfodooorders_client_order_ref
       spans: FULL SCAN
(24 rows)

sfodoo_cockroach_prod=> explain analyze select * from sfodoo_orders where uuid = '0000fa7d-faf8-4abb-9a94-eadbc63358aa';
                                             info
----------------------------------------------------------------------------------------------
 planning time: 635µs
 execution time: 163ms
 distribution: local
 vectorized: true
 rows read from KV: 1 (2.1 KiB)
 cumulative time spent in KV: 163ms
 maximum memory usage: 10 KiB
 network usage: 0 B (0 messages)
 regions: cdb-eu-west-1

 • scan
   nodes: n3
   regions: cdb-eu-west-1
   actual row count: 1
   KV rows read: 1
   KV bytes read: 2.1 KiB
   missing stats
   table: sfodoo_orders@primary
   spans: [/'0000fa7d-faf8-4abb-9a94-eadbc63358aa' - /'0000fa7d-faf8-4abb-9a94-eadbc63358aa']
(19 rows)

error cockroachdb logs

2021-09-01 13:30:10	WARNING	
[n1,summaries] health alerts detected: ‹{Alerts:[{StoreID:0 Category:METRICS Description:timeseries.write.errors Value:1}]}›
server/node.go:768


2021-09-01 13:46:30	ERROR	
[n1,s1,r140/1:‹/System/tsd/cr.node.sql.…›] context canceled
kv/kvserver/replica_proposal.go:247
2021-09-01 13:46:36	ERROR	
[n1,consistencyChecker,s1,r140/1:‹/System/tsd/cr.node.sql.…›] computing own checksum: ‹rpc error: code = Unknown desc = no checksum found (ID = d57b83fe-1a07-405d-ad35-eca74e6eb5b7)›
kv/kvserver/consistency_queue.go:191
2021-09-01 13:46:36	ERROR	
[n1,consistencyChecker,s1,r140/1:‹/System/tsd/cr.node.sql.…›] computing own checksum: ‹rpc error: code = Unknown desc = no checksum found (ID = d57b83fe-1a07-405d-ad35-eca74e6eb5b7)›
kv/kvserver/queue.go:1093

2021-09-01 13:56:42	ERROR	
[n1,timeSeriesMaintenance,s1,r137/1:‹/System/tsd/cr.node.sql.mem.…›] ‹operation "timeSeriesMaintenance queue process replica 137" timed out after 1m0s›: aborted during DistSender.Send: context deadline exceeded
kv/kvserver/queue.go:1093

Environment:

CockroachDB version V21.1.5
Server OS: Linux version 3.10.0-1127.19.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) on Kubernetes
Client app: Python ORM peewee==3.14.4

Additional context
We found a similar issue is #59377 which has a similar symptom and recovered solution.

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-09-02T11:57:12Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/sql-experience (found keywords: ORM,ORM,peewee)
@dhartunian (found keywords: metrics)
@cockroachdb/kv (found keywords: KV)
@nvanbenschoten (commented on kv: apparent deadlock waiting for pushes #59377)
@ajwerner (commented on kv: apparent deadlock waiting for pushes #59377)
@cockroachdb/sql-queries (found keywords: vectorized,plan)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

rytaft · 2021-09-02T20:00:40Z

Hi @DavidChenLondon -- thanks for submitting this issue and sorry to hear about the slow query problem. I can answer Question 1, but I hope someone else will chime in for Question 2.

Question#1: if only one node has the problem, NewSQL CockroachDB should continue to serve, right?

It depends on which ranges are targeted by your query. The query that targeted a specific UUID was fast because we could route it to a specific range, which was likely located on a node that didn't have connectivity issues at the time. A query that does not target a specific range will currently require communicating with all nodes containing ranges for that table (specifically the leaseholders), even if there is a limit. If any of those nodes have connectivity issues, you may see performance problems.

We have an issue open (#64862) to support locality optimized search if a limit is used on a multi-region table. Once that issue is resolved, it will enable the query to short-circuit if the limit is reached before hearing back from all regions.

DavidChenLondon · 2021-09-03T09:33:52Z

Hi @rytaft Thanks for the detailed explanation which makes sense.

A further question is that will we not be able to perform select multiple records from a table when one of our 3 clusters are down or unreachable? pointing out some relevant CockroachDB documents is appreciated.

Some background informations are, we are just starting to adapt CockroachDB into our new services, and the data is currently very small like less than 100MB, and should be able to completely hold in one cluster or node. And We haven't really configured cluster or region information for any tables. E.g. the following is some configurations I copied from #/database/sfodoo_cockroach_prod/table/public.sfodoo_orders

ALTER RANGE default CONFIGURE ZONE USING
  range_min_bytes = 134217728,
  range_max_bytes = 536870912,
  gc.ttlseconds = 90000,
  num_replicas = 3,
  constraints = '[]',
  lease_preferences = '[]'

rytaft · 2021-09-08T16:42:37Z

A further question is that will we not be able to perform select multiple records from a table when one of our 3 clusters are down or unreachable? pointing out some relevant CockroachDB documents is appreciated.

If you specify the specific keys that you are interested in, you should have no problem as long as all of the ranges containing those keys are available.

If you want all ranges to be available even when a region becomes unavailable, I would suggest that you use the new multi-region features and set your survival goal to SURVIVE REGION FAILURE. Then you won't need to worry about setting zone configurations.

However, if for some reason you do not want to use the new multi-region features, you can still use zone configuration and add constraints to ensure there is one replica in each region.

Does this help?

DavidChenLondon · 2021-09-10T16:15:04Z

Sorry I may haven't described the situation clearly.

We adapting CockroachDB (so we configured 3 clusters CockroachDB in London and Shanghai) is because we want data to auto-replicate into different regions, simply say "a global table", but each cluster has their own updated whole data. For writing, each region/cluster can write their data and other regions/clusters get the update shortly. For reading, it should not depends on other region/cluster to be available/reachable. We thought it's CockroachDB's default feature, but looks like it's not?

"Slow query on select multiple records statement" may not sound 100% correct, what actually is the multiple records are not determined by specify primary key, it's like select * from table order by some_fields limit 100? we thought this only be done inside of one cluster/region (not to connect other clusters/regions, even the global data in that region falls behind a bit)?

rytaft · 2021-09-15T21:27:35Z

Hi @DavidChenLondon, sorry for the slow reply. I think you'll get better (and faster) help if you ask these questions on our community slack channel: https://www.cockroachlabs.com/join-community/. These types of questions are perfect for that forum.

But in the mean time, I can try to answer this. In order to avoid distributing the query to all nodes, you can try turning off DistSQL using the session setting SET distsql = off. I'm not sure this will completely fix your problem, though. It's still best to explicitly target specific ranges of primary keys rather than only using a LIMIT. It also sounds like you're interested in the ability to read data locally, even if it's slightly stale. You can do this with follower reads, although similar to the multi-region features, follower reads require an enterprise license.

Hope this helps answer some of your questions? You can certainly continue to ask questions on this issue, but I do think you'll get better help from our Slack channel if you have follow up questions.

DavidChenLondon · 2021-09-17T19:05:28Z

Hi @rytaft Thank you so much for this final answer which I think is the right solution for the original question 👍 . Yeah, definitely I'll get involved more in CockroachDB Slack channel since we already are using CockroachDB in our production.

DavidChenLondon added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 2, 2021

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Sep 2, 2021

blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 2, 2021

mgartner assigned rytaft Sep 14, 2021

DavidChenLondon closed this as completed Sep 17, 2021

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Slow query on select multiple records statement #69763

sql: Slow query on select multiple records statement #69763

DavidChenLondon commented Sep 2, 2021

blathers-crl bot commented Sep 2, 2021

rytaft commented Sep 2, 2021 •

edited

Loading

DavidChenLondon commented Sep 3, 2021 •

edited

Loading

rytaft commented Sep 8, 2021

DavidChenLondon commented Sep 10, 2021 •

edited

Loading

rytaft commented Sep 15, 2021

DavidChenLondon commented Sep 17, 2021

sql: Slow query on select multiple records statement #69763

sql: Slow query on select multiple records statement #69763

Comments

DavidChenLondon commented Sep 2, 2021

blathers-crl bot commented Sep 2, 2021

rytaft commented Sep 2, 2021 • edited Loading

DavidChenLondon commented Sep 3, 2021 • edited Loading

rytaft commented Sep 8, 2021

DavidChenLondon commented Sep 10, 2021 • edited Loading

rytaft commented Sep 15, 2021

DavidChenLondon commented Sep 17, 2021

rytaft commented Sep 2, 2021 •

edited

Loading

DavidChenLondon commented Sep 3, 2021 •

edited

Loading

DavidChenLondon commented Sep 10, 2021 •

edited

Loading