-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: Slow query on select multiple records statement #69763
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Hi @DavidChenLondon -- thanks for submitting this issue and sorry to hear about the slow query problem. I can answer Question 1, but I hope someone else will chime in for Question 2.
It depends on which ranges are targeted by your query. The query that targeted a specific UUID was fast because we could route it to a specific range, which was likely located on a node that didn't have connectivity issues at the time. A query that does not target a specific range will currently require communicating with all nodes containing ranges for that table (specifically the leaseholders), even if there is a limit. If any of those nodes have connectivity issues, you may see performance problems. We have an issue open (#64862) to support locality optimized search if a limit is used on a multi-region table. Once that issue is resolved, it will enable the query to short-circuit if the limit is reached before hearing back from all regions. |
Hi @rytaft Thanks for the detailed explanation which makes sense. A further question is that will we not be able to perform select multiple records from a table when one of our 3 clusters are down or unreachable? pointing out some relevant CockroachDB documents is appreciated. Some background informations are, we are just starting to adapt CockroachDB into our new services, and the data is currently very small like less than 100MB, and should be able to completely hold in one cluster or node. And We haven't really configured cluster or region information for any tables. E.g. the following is some configurations I copied from #/database/sfodoo_cockroach_prod/table/public.sfodoo_orders
|
If you specify the specific keys that you are interested in, you should have no problem as long as all of the ranges containing those keys are available. If you want all ranges to be available even when a region becomes unavailable, I would suggest that you use the new multi-region features and set your survival goal to However, if for some reason you do not want to use the new multi-region features, you can still use zone configuration and add constraints to ensure there is one replica in each region. Does this help? |
Sorry I may haven't described the situation clearly. We adapting CockroachDB (so we configured 3 clusters CockroachDB in London and Shanghai) is because we want data to auto-replicate into different regions, simply say "a global table", but each cluster has their own updated whole data. For writing, each region/cluster can write their data and other regions/clusters get the update shortly. For reading, it should not depends on other region/cluster to be available/reachable. We thought it's CockroachDB's default feature, but looks like it's not? "Slow query on select multiple records statement" may not sound 100% correct, what actually is the multiple records are not determined by specify primary key, it's like |
Hi @DavidChenLondon, sorry for the slow reply. I think you'll get better (and faster) help if you ask these questions on our community slack channel: https://www.cockroachlabs.com/join-community/. These types of questions are perfect for that forum. But in the mean time, I can try to answer this. In order to avoid distributing the query to all nodes, you can try turning off DistSQL using the session setting Hope this helps answer some of your questions? You can certainly continue to ask questions on this issue, but I do think you'll get better help from our Slack channel if you have follow up questions. |
Hi @rytaft Thank you so much for this final answer which I think is the right solution for the original question 👍 . Yeah, definitely I'll get involved more in CockroachDB Slack channel since we already are using CockroachDB in our production. |
Describe the problem
We met slow (can be up to 30+ seconds) SQL query issue which only happens on selecting multiple records, but not single record, as shown in below SQL analysis.
Our CockroachDB consists of 3 regional clusters, and we checked only one node has service latency issue shown in dashboard /metrics/overview/cluster page, and also loads of RPC errors. However, /reports/network/region page looks normal, this is strange!
We found simply restarting the pod can not fix the issue, and our final solution is to reschedule the pod to another k8s node, and our sql query recovered to be fast as usual. Later we found another app which is shared on the same k8s node also has slow/broken connection, so it's more like a hardware problem.
Although the problem is gone, we are wondering if anyone in the community could help answer below two questions:
Question#1: if only one node has the problem, NewSQL CockroachDB should continue to serve, right?
Question#2: why RPC error is so high and only occur in one node? should RPC error involves at least two nodes?
To Reproduce
Unfortunately it's not easy to reproduce, but we can provide our internal detailed analysis Confluence document and statement diagnostics bundle zip in private.
Expected behavior
The expected running time of this select multiple records should be just up to hundreds of milliseconds since we just have thousands of records on that table so far.
Additional data / screenshots
slow SQL analysis:
error cockroachdb logs
Environment:
Additional context
We found a similar issue is #59377 which has a similar symptom and recovered solution.
The text was updated successfully, but these errors were encountered: