Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Slow query on select multiple records statement #69763

Closed
DavidChenLondon opened this issue Sep 2, 2021 · 7 comments
Closed

sql: Slow query on select multiple records statement #69763

DavidChenLondon opened this issue Sep 2, 2021 · 7 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@DavidChenLondon
Copy link

Describe the problem

We met slow (can be up to 30+ seconds) SQL query issue which only happens on selecting multiple records, but not single record, as shown in below SQL analysis.

Our CockroachDB consists of 3 regional clusters, and we checked only one node has service latency issue shown in dashboard /metrics/overview/cluster page, and also loads of RPC errors. However, /reports/network/region page looks normal, this is strange!

We found simply restarting the pod can not fix the issue, and our final solution is to reschedule the pod to another k8s node, and our sql query recovered to be fast as usual. Later we found another app which is shared on the same k8s node also has slow/broken connection, so it's more like a hardware problem.

Although the problem is gone, we are wondering if anyone in the community could help answer below two questions:

Question#1: if only one node has the problem, NewSQL CockroachDB should continue to serve, right?

Question#2: why RPC error is so high and only occur in one node? should RPC error involves at least two nodes?

To Reproduce

Unfortunately it's not easy to reproduce, but we can provide our internal detailed analysis Confluence document and statement diagnostics bundle zip in private.

Expected behavior
The expected running time of this select multiple records should be just up to hundreds of milliseconds since we just have thousands of records on that table so far.

Additional data / screenshots

slow SQL analysis:

sfodoo_cockroach_prod=> explain analyze select * from sfodoo_orders limit 1;
                info
-------------------------------------
 planning time: 666µs
 execution time: 11.7s
 distribution: full
 vectorized: true
 rows read from KV: 1 (3.3 KiB)
 cumulative time spent in KV: 11.5s
 maximum memory usage: 20 KiB
 network usage: 5.8 KiB (3 messages)
 regions: cdb-cn-shanghai

 • scan
   nodes: n1
   regions: cdb-cn-shanghai
   actual row count: 1
   KV rows read: 1
   KV bytes read: 3.3 KiB
   missing stats
   table: sfodoo_orders@primary
   spans: LIMITED SCAN
   limit: 1
(20 rows)

sfodoo_cockroach_prod=> explain analyze select count(*) from sfodoo_orders;
                           info
----------------------------------------------------------
 planning time: 468µs
 execution time: 21.3s
 distribution: full
 vectorized: true
 rows read from KV: 6,430 (349 KiB)
 cumulative time spent in KV: 21.1s
 maximum memory usage: 2.4 MiB
 network usage: 152 B (3 messages)
 regions: cdb-cn-shanghai

 • group (scalar)
 │ nodes: n1
 │ regions: cdb-cn-shanghai
 │ actual row count: 1
 │
 └── • scan
       nodes: n1
       regions: cdb-cn-shanghai
       actual row count: 6,430
       KV rows read: 6,430
       KV bytes read: 349 KiB
       missing stats
       table: sfodoo_orders@sfodooorders_client_order_ref
       spans: FULL SCAN
(24 rows)

sfodoo_cockroach_prod=> explain analyze select * from sfodoo_orders where uuid = '0000fa7d-faf8-4abb-9a94-eadbc63358aa';
                                             info
----------------------------------------------------------------------------------------------
 planning time: 635µs
 execution time: 163ms
 distribution: local
 vectorized: true
 rows read from KV: 1 (2.1 KiB)
 cumulative time spent in KV: 163ms
 maximum memory usage: 10 KiB
 network usage: 0 B (0 messages)
 regions: cdb-eu-west-1

 • scan
   nodes: n3
   regions: cdb-eu-west-1
   actual row count: 1
   KV rows read: 1
   KV bytes read: 2.1 KiB
   missing stats
   table: sfodoo_orders@primary
   spans: [/'0000fa7d-faf8-4abb-9a94-eadbc63358aa' - /'0000fa7d-faf8-4abb-9a94-eadbc63358aa']
(19 rows)

error cockroachdb logs

2021-09-01 13:30:10	WARNING	
[n1,summaries] health alerts detected: ‹{Alerts:[{StoreID:0 Category:METRICS Description:timeseries.write.errors Value:1}]}›
server/node.go:768


2021-09-01 13:46:30	ERROR	
[n1,s1,r140/1:‹/System/tsd/cr.node.sql.…›] context canceled
kv/kvserver/replica_proposal.go:247
2021-09-01 13:46:36	ERROR	
[n1,consistencyChecker,s1,r140/1:‹/System/tsd/cr.node.sql.…›] computing own checksum: ‹rpc error: code = Unknown desc = no checksum found (ID = d57b83fe-1a07-405d-ad35-eca74e6eb5b7)›
kv/kvserver/consistency_queue.go:191
2021-09-01 13:46:36	ERROR	
[n1,consistencyChecker,s1,r140/1:‹/System/tsd/cr.node.sql.…›] computing own checksum: ‹rpc error: code = Unknown desc = no checksum found (ID = d57b83fe-1a07-405d-ad35-eca74e6eb5b7)›
kv/kvserver/queue.go:1093

2021-09-01 13:56:42	ERROR	
[n1,timeSeriesMaintenance,s1,r137/1:‹/System/tsd/cr.node.sql.mem.…›] ‹operation "timeSeriesMaintenance queue process replica 137" timed out after 1m0s›: aborted during DistSender.Send: context deadline exceeded
kv/kvserver/queue.go:1093

Environment:

  • CockroachDB version V21.1.5
  • Server OS: Linux version 3.10.0-1127.19.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) on Kubernetes
  • Client app: Python ORM peewee==3.14.4

Additional context
We found a similar issue is #59377 which has a similar symptom and recovered solution.

@DavidChenLondon DavidChenLondon added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 2, 2021
@blathers-crl
Copy link

blathers-crl bot commented Sep 2, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Sep 2, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 2, 2021
@rytaft
Copy link
Collaborator

rytaft commented Sep 2, 2021

Hi @DavidChenLondon -- thanks for submitting this issue and sorry to hear about the slow query problem. I can answer Question 1, but I hope someone else will chime in for Question 2.

Question#1: if only one node has the problem, NewSQL CockroachDB should continue to serve, right?

It depends on which ranges are targeted by your query. The query that targeted a specific UUID was fast because we could route it to a specific range, which was likely located on a node that didn't have connectivity issues at the time. A query that does not target a specific range will currently require communicating with all nodes containing ranges for that table (specifically the leaseholders), even if there is a limit. If any of those nodes have connectivity issues, you may see performance problems.

We have an issue open (#64862) to support locality optimized search if a limit is used on a multi-region table. Once that issue is resolved, it will enable the query to short-circuit if the limit is reached before hearing back from all regions.

@DavidChenLondon
Copy link
Author

DavidChenLondon commented Sep 3, 2021

Hi @rytaft Thanks for the detailed explanation which makes sense.

A further question is that will we not be able to perform select multiple records from a table when one of our 3 clusters are down or unreachable? pointing out some relevant CockroachDB documents is appreciated.

Some background informations are, we are just starting to adapt CockroachDB into our new services, and the data is currently very small like less than 100MB, and should be able to completely hold in one cluster or node. And We haven't really configured cluster or region information for any tables. E.g. the following is some configurations I copied from #/database/sfodoo_cockroach_prod/table/public.sfodoo_orders

ALTER RANGE default CONFIGURE ZONE USING
  range_min_bytes = 134217728,
  range_max_bytes = 536870912,
  gc.ttlseconds = 90000,
  num_replicas = 3,
  constraints = '[]',
  lease_preferences = '[]'

@rytaft
Copy link
Collaborator

rytaft commented Sep 8, 2021

A further question is that will we not be able to perform select multiple records from a table when one of our 3 clusters are down or unreachable? pointing out some relevant CockroachDB documents is appreciated.

If you specify the specific keys that you are interested in, you should have no problem as long as all of the ranges containing those keys are available.

If you want all ranges to be available even when a region becomes unavailable, I would suggest that you use the new multi-region features and set your survival goal to SURVIVE REGION FAILURE. Then you won't need to worry about setting zone configurations.

However, if for some reason you do not want to use the new multi-region features, you can still use zone configuration and add constraints to ensure there is one replica in each region.

Does this help?

@DavidChenLondon
Copy link
Author

DavidChenLondon commented Sep 10, 2021

Sorry I may haven't described the situation clearly.

We adapting CockroachDB (so we configured 3 clusters CockroachDB in London and Shanghai) is because we want data to auto-replicate into different regions, simply say "a global table", but each cluster has their own updated whole data. For writing, each region/cluster can write their data and other regions/clusters get the update shortly. For reading, it should not depends on other region/cluster to be available/reachable. We thought it's CockroachDB's default feature, but looks like it's not?

"Slow query on select multiple records statement" may not sound 100% correct, what actually is the multiple records are not determined by specify primary key, it's like select * from table order by some_fields limit 100? we thought this only be done inside of one cluster/region (not to connect other clusters/regions, even the global data in that region falls behind a bit)?

@rytaft
Copy link
Collaborator

rytaft commented Sep 15, 2021

Hi @DavidChenLondon, sorry for the slow reply. I think you'll get better (and faster) help if you ask these questions on our community slack channel: https://www.cockroachlabs.com/join-community/. These types of questions are perfect for that forum.

But in the mean time, I can try to answer this. In order to avoid distributing the query to all nodes, you can try turning off DistSQL using the session setting SET distsql = off. I'm not sure this will completely fix your problem, though. It's still best to explicitly target specific ranges of primary keys rather than only using a LIMIT. It also sounds like you're interested in the ability to read data locally, even if it's slightly stale. You can do this with follower reads, although similar to the multi-region features, follower reads require an enterprise license.

Hope this helps answer some of your questions? You can certainly continue to ask questions on this issue, but I do think you'll get better help from our Slack channel if you have follow up questions.

@DavidChenLondon
Copy link
Author

Hi @rytaft Thank you so much for this final answer which I think is the right solution for the original question 👍 . Yeah, definitely I'll get involved more in CockroachDB Slack channel since we already are using CockroachDB in our production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

No branches or pull requests

2 participants