-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: oom during 24 node tpcc #31458
Comments
Cluster andy-1539654686-tpccbench-nodes-24-cpu-16-partition |
|
n1 got oom killed:
|
binary is v2.1.0-beta.20181008 (in case you need it for your own pprof). The last runtime stats are
Note the high number of goroutines. inuse_space shows most memory here: More sleuthing would have to be done, but on its face this looks like we might be overloading n1 with client connections. @jordanlewis, can you take a look? |
What do the numbers in 3.9 GiB/925 MiB/8.8 GiB GO alloc/idle/total mean again? The discrepancy between alloc+idle and total seems relevant. |
Oh, and as a heads up, the logs indicate that this node (or probably the whole cluster) is a mess. It'll be difficult to simply pinpoint a cause from these logs. And I'm also seeing this in dmesg:
There's this article about this message: https://access.redhat.com/solutions/30453 I'm not sure why that would be the case as our TCP connections should all be long-lived. cc @bdarnell |
Looks like both the optimizer and the execution engine are using up a ton of memory - this likely indicates that the cluster is accepting more and more queries without most of them making progress. That being said, I'd like to look at the pprof myself, but the one in Andy's logs.zip isn't the one you're referencing - the numbers are completely different. If you still have the profile, please upload it to this issue (if you used go tool pprof, it'll be saved locally somewhere). |
@jordanlewis the cluster is still around: andy-1539654686-tpccbench-nodes-24-cpu-16-partition I suspect this is in one family of badness with #31409, though here the profile looked different enough to warrant a second look. I'm adding this conf to roachtest in #31466. |
Just once or repeatedly? During startup we allow some time to pass between the It looks like tpcc is opening up to 20000/24=833 connections per node? cockroach/pkg/workload/tpcc/tpcc.go Lines 407 to 417 in 3e7f0f0
That's kind of nuts, and easily enough to trigger syn flood warnings on its own when those connections are spinning up (and I expect some kinds of errors could cause the client to discard connections and reopen them). We shouldn't need anywhere near that many connections. |
Here's the profile @tschottdorf was mentioning before. The data in the profile seem to support my earlier idea - that there's a lot of concurrent queries that aren't making progress in the system using up SQL memory. SQL doesn't have admission control, so this is fully possible if there are a lot of concurrent connections to the database. memprof.fraction_system_memory.000000013142040576_2018-10-16T06_11_31.436.zip |
Once. Would be surprised if that wasn't at startup. |
This isn't actionable anymore, but it's related to the admission control project. cc @asubiotto @ajwerner |
Describe the problem
To Reproduce
Use roachtest from two days ago
Change test to:
Run:
bin/roachtest bench '^tpccbench/nodes=24/cpu=16/partition$$' --wipe=false --user=andy
Expected behavior
Passing tpc-c
Additional data / screenshots
Logs.zip
Environment:
The text was updated successfully, but these errors were encountered: