-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: olap query hitting context canceled/can't complete #36842
Comments
Can you pull the logs? This doesn't seem like a planning issue to me. |
@awoods187 I deleted your comment for now, I don't think we should post logs from the reg cluster on public issues. |
I put them on gdrive instead: |
Action item here: determine what the root cause of the error is. |
Everything looks fine in the logs until we start getting a bunch of:
This is accompanied by a breaker trip to
There might be a connection between this stuck state and the context cancellation earlier. @awoods187 as discussed it would be interesting to get the goroutine dumps in this stuck state and maybe also |
When I ran this command in the webui it showed that it killed a node (k8 since restarted it) but I grabbed the logs https://drive.google.com/open?id=1HLCeHz0X7yXxXLZHlFyOttMZxzzPZnL_ |
I restored this to a roachprod cluster and ran the query. Here's the debug.zip while the query was running:https://drive.google.com/open?id=1LSRdaKl7BRMafmnp-bLrXwuennEME-4L It turns out that this query causes an OOM:
Link to OOM heap profile https://drive.google.com/open?id=1FZJDZZeZaspEWO8VyuM1U8824R50hlK3 |
That seems to do the trick. No crashes occur when grouping by cluster ID first. Closing this issue and will share the query/result privately. |
Reopening pending further investigation |
Can confirm that doing this again still causes an OOM error. |
Okay, I think I've gotten to the bottom of this. I think the root cause is that lookup joins can't impose a limit on the number of rows returned: #35950. As you can see from the plan, the first join in the query is a lookup join that scans
Because of the limitation in #35950, this means that the innocent looking lookup join plan creates a single span that returns over 5 million rows, each of which has a sizable amount of JSON inside. Together with the fact that DistSQL doesn't let Go GC any memory from an The resolution to this should be fixing #35950. |
@jordanlewis do you expect that this fixed now? |
It should be, yes. Retrying it would be useful to me if you have the bandwidth. |
It still kills a node via OOM.
It's up on |
Well, it takes an hour, but the query completes now!
Disk spilling appears to be working as expected. I kept an eye on the heap profile while the query was executing and A bit concerningly, I did run into an OOM the first time I tried to test this. But that was on a cluster where I swapped out several different CRDB versions trying to diagnose an unrelated RESTORE issue, so it may have been in a bad state to begin with. I think I'll test this once more tomorrow to double-check that the query completes on a happy cluster. But barring further issues I'm considering this closed. |
I'm trying to run an analytics query on registration cluster backup spun up in AWS (with no other traffic on m4.xlarge) but I keep seeing:
pq: communication error: rpc error: code = Canceled desc = context canceled root@cockroachdb-public:26257/registration> pod default/cockroachdb-22073 terminated (Error)
Here is the
explain (opt, env)
:The text was updated successfully, but these errors were encountered: