sql: oom during 24 node tpcc #31458

awoods187 · 2018-10-16T11:52:01Z

Describe the problem

To Reproduce

Use roachtest from two days ago

Change test to:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 24,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 20000,
+                       EstimatedMax:   12000,
+                       LoadConfig:     singlePartitionedLoadgen,

Run:

bin/roachtest bench '^tpccbench/nodes=24/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
Passing tpc-c

Additional data / screenshots

Logs.zip

Environment:

CockroachDB version 2.1 beta 1008

The text was updated successfully, but these errors were encountered:

awoods187 · 2018-10-16T11:53:13Z

Cluster andy-1539654686-tpccbench-nodes-24-cpu-16-partition

tbg · 2018-10-16T12:57:56Z

F181016 11:04:04.760037 494 server/server.go:246  [n10] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (12 of 24 are within the offset)
goroutine 494 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc420057200, 0xc4200572c0, 0x3e95400, 0x10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:972 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x4788b00, 0xc400000004, 0x3e95438, 0x10, 0xf6, 0xc444c5f050, 0x8b)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:841 +0x807
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2fe0a60, 0xc4207c8810, 0x4, 0x2, 0x0, 0x0, 0xc425427c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2fe0a60, 0xc4207c8810, 0x1, 0x4, 0x0, 0x0, 0xc425427c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2fe0a60, 0xc4207c8810, 0xc425427c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/server.NewServer.func1()
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:246 +0xb0
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc4203a5cc0, 0xc420c49950, 0xc420312b40, 0x3e, 0xc420331e60, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:790 +0x667
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1(0x2fe0a60, 0xc420c479e0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:668 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42037cf50, 0xc42061a000, 0xc420c479b0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

tbg · 2018-10-16T12:59:06Z

n1 got oom killed:

[15674.908291] Out of memory: Kill process 4319 (cockroach) score 941 or sacrifice child
[15674.916550] Killed process 4319 (cockroach) total-vm:32758200kB, anon-rss:13852084kB, file-rss:0kB, shmem-rss:0kB
[15676.219248] oom_reaper: reaped process 4319 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

tbg · 2018-10-16T13:07:54Z

binary is v2.1.0-beta.20181008 (in case you need it for your own pprof).

The last runtime stats are

I181016 06:13:16.582275 565 server/status/runtime.go:465 [n1] runtime stats: 13 GiB RSS, 56850 goroutines, 3.9 GiB/925 MiB/8.8 GiB GO alloc/idle/total, 3.8 GiB/4.6 GiB CGO alloc/total, 304972.5 CGO/sec, 1271.3/136.7 %(u/s)time, 0.1 %gc (2x), 222 MiB/105 MiB (r/w)net

Note the high number of goroutines.

inuse_space shows most memory here:

More sleuthing would have to be done, but on its face this looks like we might be overloading n1 with client connections.

@jordanlewis, can you take a look?

tbg · 2018-10-16T13:10:19Z

What do the numbers in 3.9 GiB/925 MiB/8.8 GiB GO alloc/idle/total mean again? The discrepancy between alloc+idle and total seems relevant.

tbg · 2018-10-16T13:24:12Z

Oh, and as a heads up, the logs indicate that this node (or probably the whole cluster) is a mess. It'll be difficult to simply pinpoint a cause from these logs.

And I'm also seeing this in dmesg:

[15511.740411] TCP: request_sock_TCP: Possible SYN flooding on port 26257. Sending cookies. Check SNMP counters.

There's this article about this message: https://access.redhat.com/solutions/30453
My TL;DR is that inbound connection attempts seem to be coming in faster than they're accept()ed.

I'm not sure why that would be the case as our TCP connections should all be long-lived. cc @bdarnell

jordanlewis · 2018-10-16T14:22:08Z

Looks like both the optimizer and the execution engine are using up a ton of memory - this likely indicates that the cluster is accepting more and more queries without most of them making progress.

That being said, I'd like to look at the pprof myself, but the one in Andy's logs.zip isn't the one you're referencing - the numbers are completely different. If you still have the profile, please upload it to this issue (if you used go tool pprof, it'll be saved locally somewhere).

tbg · 2018-10-16T14:49:09Z

@jordanlewis the cluster is still around: andy-1539654686-tpccbench-nodes-24-cpu-16-partition

I suspect this is in one family of badness with #31409, though here the profile looked different enough to warrant a second look. I'm adding this conf to roachtest in #31466.

bdarnell · 2018-10-16T18:10:10Z

And I'm also seeing this in dmesg: [15511.740411] TCP: request_sock_TCP: Possible SYN flooding on port 26257

Just once or repeatedly? During startup we allow some time to pass between the listen() syscall and the first accept(), which could allow things to pile up. We've got the 24 intra-cluster connections plus however many connections tpcc is starting up. The default limit is 128 (hitting the limit doesn't hurt too much but it does generate this message).

It looks like tpcc is opening up to 20000/24=833 connections per node?

cockroach/pkg/workload/tpcc/tpcc.go

Lines 407 to 417 in 3e7f0f0

    
           nConns := w.activeWarehouses / len(urls) 
        
           dbs := make([]*gosql.DB, len(urls)) 
        
           for i, url := range urls { 
        
           	dbs[i], err = gosql.Open(`postgres`, url) 
        
           	if err != nil { 
        
           		return workload.QueryLoad{}, err 
        
           	} 
        
           	// Allow a maximum of concurrency+1 connections to the database. 
        
           	dbs[i].SetMaxOpenConns(nConns) 
        
           	dbs[i].SetMaxIdleConns(nConns) 
        
           }

That's kind of nuts, and easily enough to trigger syn flood warnings on its own when those connections are spinning up (and I expect some kinds of errors could cause the client to discard connections and reopen them). We shouldn't need anywhere near that many connections.

jordanlewis · 2018-10-16T18:12:27Z

Here's the profile @tschottdorf was mentioning before. The data in the profile seem to support my earlier idea - that there's a lot of concurrent queries that aren't making progress in the system using up SQL memory. SQL doesn't have admission control, so this is fully possible if there are a lot of concurrent connections to the database.

memprof.fraction_system_memory.000000013142040576_2018-10-16T06_11_31.436.zip

tbg · 2018-10-16T19:07:17Z

Just once or repeatedly?

Once. Would be surprised if that wasn't at startup.

jordanlewis · 2019-02-06T21:39:10Z

This isn't actionable anymore, but it's related to the admission control project. cc @asubiotto @ajwerner

tbg closed this as completed Oct 16, 2018

tbg reopened this Oct 16, 2018

tbg changed the title ~~2 dead nodes and an unavailable range when running tpcc~~ sql: oom during 24 node tpcc Oct 16, 2018

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 16, 2018

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Oct 16, 2018

awoods187 mentioned this issue Oct 16, 2018

One dead node/lost connection to entire cluster during tpcc #31415

Closed

jordanlewis closed this as completed Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: oom during 24 node tpcc #31458

sql: oom during 24 node tpcc #31458

awoods187 commented Oct 16, 2018 •

edited

Loading

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018 •

edited

Loading

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

jordanlewis commented Oct 16, 2018

tbg commented Oct 16, 2018

bdarnell commented Oct 16, 2018

jordanlewis commented Oct 16, 2018

tbg commented Oct 16, 2018

jordanlewis commented Feb 6, 2019

sql: oom during 24 node tpcc #31458

sql: oom during 24 node tpcc #31458

Comments

awoods187 commented Oct 16, 2018 • edited Loading

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018 • edited Loading

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

jordanlewis commented Oct 16, 2018

tbg commented Oct 16, 2018

bdarnell commented Oct 16, 2018

jordanlewis commented Oct 16, 2018

tbg commented Oct 16, 2018

jordanlewis commented Feb 6, 2019

awoods187 commented Oct 16, 2018 •

edited

Loading

tbg commented Oct 16, 2018 •

edited

Loading