Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: oom during 24 node tpcc #31458

Closed
awoods187 opened this issue Oct 16, 2018 · 12 comments
Closed

sql: oom during 24 node tpcc #31458

awoods187 opened this issue Oct 16, 2018 · 12 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@awoods187
Copy link
Contributor

awoods187 commented Oct 16, 2018

Describe the problem

image

To Reproduce

Use roachtest from two days ago

Change test to:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 24,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 20000,
+                       EstimatedMax:   12000,
+                       LoadConfig:     singlePartitionedLoadgen,

Run:

bin/roachtest bench '^tpccbench/nodes=24/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
Passing tpc-c

Additional data / screenshots

Logs.zip

Environment:

  • CockroachDB version 2.1 beta 1008
@awoods187
Copy link
Contributor Author

Cluster andy-1539654686-tpccbench-nodes-24-cpu-16-partition

@tbg
Copy link
Member

tbg commented Oct 16, 2018

F181016 11:04:04.760037 494 server/server.go:246  [n10] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (12 of 24 are within the offset)
goroutine 494 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc420057200, 0xc4200572c0, 0x3e95400, 0x10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:972 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x4788b00, 0xc400000004, 0x3e95438, 0x10, 0xf6, 0xc444c5f050, 0x8b)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:841 +0x807
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2fe0a60, 0xc4207c8810, 0x4, 0x2, 0x0, 0x0, 0xc425427c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2fe0a60, 0xc4207c8810, 0x1, 0x4, 0x0, 0x0, 0xc425427c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2fe0a60, 0xc4207c8810, 0xc425427c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/server.NewServer.func1()
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:246 +0xb0
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc4203a5cc0, 0xc420c49950, 0xc420312b40, 0x3e, 0xc420331e60, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:790 +0x667
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1(0x2fe0a60, 0xc420c479e0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:668 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42037cf50, 0xc42061a000, 0xc420c479b0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

@tbg tbg closed this as completed Oct 16, 2018
@tbg tbg reopened this Oct 16, 2018
@tbg
Copy link
Member

tbg commented Oct 16, 2018

n1 got oom killed:

[15674.908291] Out of memory: Kill process 4319 (cockroach) score 941 or sacrifice child
[15674.916550] Killed process 4319 (cockroach) total-vm:32758200kB, anon-rss:13852084kB, file-rss:0kB, shmem-rss:0kB
[15676.219248] oom_reaper: reaped process 4319 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@tbg
Copy link
Member

tbg commented Oct 16, 2018

binary is v2.1.0-beta.20181008 (in case you need it for your own pprof).

The last runtime stats are

I181016 06:13:16.582275 565 server/status/runtime.go:465 [n1] runtime stats: 13 GiB RSS, 56850 goroutines, 3.9 GiB/925 MiB/8.8 GiB GO alloc/idle/total, 3.8 GiB/4.6 GiB CGO alloc/total, 304972.5 CGO/sec, 1271.3/136.7 %(u/s)time, 0.1 %gc (2x), 222 MiB/105 MiB (r/w)net

Note the high number of goroutines.

inuse_space shows most memory here:

image

More sleuthing would have to be done, but on its face this looks like we might be overloading n1 with client connections.

@jordanlewis, can you take a look?

@tbg
Copy link
Member

tbg commented Oct 16, 2018

What do the numbers in 3.9 GiB/925 MiB/8.8 GiB GO alloc/idle/total mean again? The discrepancy between alloc+idle and total seems relevant.

@tbg tbg changed the title 2 dead nodes and an unavailable range when running tpcc sql: oom during 24 node tpcc Oct 16, 2018
@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 16, 2018
@tbg
Copy link
Member

tbg commented Oct 16, 2018

Oh, and as a heads up, the logs indicate that this node (or probably the whole cluster) is a mess. It'll be difficult to simply pinpoint a cause from these logs.

And I'm also seeing this in dmesg:

[15511.740411] TCP: request_sock_TCP: Possible SYN flooding on port 26257. Sending cookies. Check SNMP counters.

There's this article about this message: https://access.redhat.com/solutions/30453
My TL;DR is that inbound connection attempts seem to be coming in faster than they're accept()ed.

I'm not sure why that would be the case as our TCP connections should all be long-lived. cc @bdarnell

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Oct 16, 2018
@jordanlewis
Copy link
Member

Looks like both the optimizer and the execution engine are using up a ton of memory - this likely indicates that the cluster is accepting more and more queries without most of them making progress.

That being said, I'd like to look at the pprof myself, but the one in Andy's logs.zip isn't the one you're referencing - the numbers are completely different. If you still have the profile, please upload it to this issue (if you used go tool pprof, it'll be saved locally somewhere).

@tbg
Copy link
Member

tbg commented Oct 16, 2018

@jordanlewis the cluster is still around: andy-1539654686-tpccbench-nodes-24-cpu-16-partition

I suspect this is in one family of badness with #31409, though here the profile looked different enough to warrant a second look. I'm adding this conf to roachtest in #31466.

@bdarnell
Copy link
Contributor

And I'm also seeing this in dmesg: [15511.740411] TCP: request_sock_TCP: Possible SYN flooding on port 26257

Just once or repeatedly? During startup we allow some time to pass between the listen() syscall and the first accept(), which could allow things to pile up. We've got the 24 intra-cluster connections plus however many connections tpcc is starting up. The default limit is 128 (hitting the limit doesn't hurt too much but it does generate this message).

It looks like tpcc is opening up to 20000/24=833 connections per node?

nConns := w.activeWarehouses / len(urls)
dbs := make([]*gosql.DB, len(urls))
for i, url := range urls {
dbs[i], err = gosql.Open(`postgres`, url)
if err != nil {
return workload.QueryLoad{}, err
}
// Allow a maximum of concurrency+1 connections to the database.
dbs[i].SetMaxOpenConns(nConns)
dbs[i].SetMaxIdleConns(nConns)
}

That's kind of nuts, and easily enough to trigger syn flood warnings on its own when those connections are spinning up (and I expect some kinds of errors could cause the client to discard connections and reopen them). We shouldn't need anywhere near that many connections.

@jordanlewis
Copy link
Member

Here's the profile @tschottdorf was mentioning before. The data in the profile seem to support my earlier idea - that there's a lot of concurrent queries that aren't making progress in the system using up SQL memory. SQL doesn't have admission control, so this is fully possible if there are a lot of concurrent connections to the database.

memprof.fraction_system_memory.000000013142040576_2018-10-16T06_11_31.436.zip

@tbg
Copy link
Member

tbg commented Oct 16, 2018

Just once or repeatedly?

Once. Would be surprised if that wasn't at startup.

@jordanlewis
Copy link
Member

This isn't actionable anymore, but it's related to the admission control project. cc @asubiotto @ajwerner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

4 participants