-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: investigate out of memory issues on TPCC check 3.3.2.6 #47205
Comments
As I posted in the slack, this crash is a regression on Here are the logs I get with those two commits reverted (at
and without the revert:
The logs seem to suggest that there is a memory leak in those two commits (the hash aggregator doesn't seem to release some of its memory), but I don't see what has changed other us calling |
I think I figured it out. We need this:
The thing is that we increased the size of |
The OOM crash can be reliably reproduced on a 3 node roachprod cluster with default
n1-standard-4
nodes after importing 500 warehouses and running the query 3.3.2.6 manually (so the alteration of primary key is not needed). I tried to add more logging to try to understand where we're not accounting for the memory properly but still don't see it.Here is an explain analyze (on an occasion it didn't crash) of the query for the context on amount of data flying around:
We seem to be estimating the memory usage of the aggregators relatively good now. For example, here is one of the heap profiles when first stage aggregator finishes and the second one is about to:
And at about this time in the logs we have:
In all of the crashes I've observed we successfully get to the point where hash join is being performed and actually get pretty far in the join's execution. However, RSS jumps quite significantly (say from
13 GiB RSS, 203 goroutines, 7.3 GiB/2.5 GiB/10 GiB GO alloc/idle/total
to14 GiB RSS, 201 goroutines, 9.2 GiB/720 MiB/10 GiB GO alloc/idle/total
in 10 seconds), but this jump is not reflected in the heap profiles. It is possible that OS reports RSS with a delay, but I'm very puzzled by this.I'm tired of looking at this issue and wondering whether someone else should take a stab at it. Here are a few suggestions:
The text was updated successfully, but these errors were encountered: