IMPORT: OOM on larger cluster sizes #35773

dt · 2019-03-15T00:47:08Z

spent some time today with @yuzefovich looking at the OOMs reported in #34878.

Most of the time we observe an OOM the idle go memory is significantly more than the actual in-use. My limited understanding is that this implies we recently briefly used a significant amount of memory and the GC has not yet released it.

We're not quite sure what used (and then stopped using) so much memory, but looking a sampling of profiles, it seems like handling AddSSTable commands dominates the majority of profiles, with the majority in just deserializing these huge (32mb) commands from GRPC. The evaluation of an AddSSTable command does make some copies of the input data -- I think I see one obvious one for leveldb SSTable iteration, plus some other steps -- so evaluating one 32mb AddSSTable command could use (and then release) a non-trivial amount of memory.

In the 60 node cluster that @yuzefovich is testing with, I think it is likely that with 60 producers all making and sending SSTs out, a given node might see a large spike in memory usage if it recieved e.g. 10 or 20 such SSTs at the same time. If handling each required even just a couple copies, we could be using a couple of gigs or so just for this pretty quickly.

The nodes in this cluster have a total of 8gb of ram, so with cgo/rocks sitting on 2-3 and various other sql and server machinery sitting on a gig, it seems easy to easy to get into the danger zone quickly.

Reducing the SST size from 32mb to 16mb appeared to make the job complete successfully, so we may need to, in the short term, document that in larger clusters, you may need to do that if individual nodes don't have enough headroom for the spikes that a larger number of processor's making and sending SSTs can produce.

dt · 2019-03-15T00:50:03Z

It would also be nice to have the node itself more resilient to this by limiting the number of expensive requests it processes concurrently -- we've done something similar with ImportRequest -- but simply dropping a limiter e.g. in evalAddSSTAble might not quite work: by that point, we've already unmarshaled the request to hand it to eval, so we're holding the whole SST in memory by then. It still would likely help, particularly if combined with smaller SSTs as it might block the caller from sending the next 16mb -- but it'd be even nicer if we could keep the larger files but somehow back pressure reading the rpc off the wire until we're ready to process it.... I just don't know if/how we'd do that.

dt · 2019-03-15T00:52:30Z

we still need to do more profiling to figure out exactly what's using so much memory about handling an AddSSTable request and I think we can also trim it e.g. I think we could trivially avoid one copy by making a leveldb db.File that sits directly on top of the []byte without an extra copy -- but in the very short term, lowering the SST size might be our suggested workaround.

awoods187 · 2019-09-10T18:28:31Z

Here is the panic I see on 9 nodes in a 60 node cluster during import:

runtime stack:
runtime.throw(0x437d603, 0x16)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0x7f19e81fd818 sp=0x7f19e81fd7e8 pc=0x78d862
runtime.sysMap(0xc144000000, 0x1c000000, 0x7a54498)
	/usr/local/go/src/runtime/mem_linux.go:170 +0xc7 fp=0x7f19e81fd858 sp=0x7f19e81fd818 pc=0x777eb7
runtime.(*mheap).sysAlloc(0x7a11d20, 0x18196000, 0x7a11d30, 0xc0cb)
	/usr/local/go/src/runtime/malloc.go:633 +0x1cd fp=0x7f19e81fd900 sp=0x7f19e81fd858 pc=0x76ae3d
runtime.(*mheap).grow(0x7a11d20, 0xc0cb, 0x0)
	/usr/local/go/src/runtime/mheap.go:1222 +0x42 fp=0x7f19e81fd958 sp=0x7f19e81fd900 pc=0x7851d2
runtime.(*mheap).allocSpanLocked(0x7a11d20, 0xc0cb, 0x7a544a8, 0x77d45d)
	/usr/local/go/src/runtime/mheap.go:1150 +0x37f fp=0x7f19e81fd990 sp=0x7f19e81fd958 pc=0x7850bf
runtime.(*mheap).alloc_m(0x7a11d20, 0xc0cb, 0x7f19dfca0101, 0x7a11d30)
	/usr/local/go/src/runtime/mheap.go:977 +0xc2 fp=0x7f19e81fd9e0 sp=0x7f19e81fd990 pc=0x784712
runtime.(*mheap).alloc.func1()
	/usr/local/go/src/runtime/mheap.go:1048 +0x4c fp=0x7f19e81fda18 sp=0x7f19e81fd9e0 pc=0x7b8e9c
runtime.(*mheap).alloc(0x7a11d20, 0xc0cb, 0x7f19e8000101, 0x7f19dfca5428)
	/usr/local/go/src/runtime/mheap.go:1047 +0x8a fp=0x7f19e81fda68 sp=0x7f19e81fda18 pc=0x7849ea
runtime.largeAlloc(0x18196000, 0xffffffffffff0100, 0x7f19dfca5428)
	/usr/local/go/src/runtime/malloc.go:1055 +0x99 fp=0x7f19e81fdaa8 sp=0x7f19e81fda68 pc=0x76c259
runtime.mallocgc.func1()
	/usr/local/go/src/runtime/malloc.go:950 +0x46 fp=0x7f19e81fdad8 sp=0x7f19e81fdaa8 pc=0x7b7e26
runtime.systemstack(0x7f19eca906d0)
	/usr/local/go/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f19e81fdae0 sp=0x7f19e81fdad8 pc=0x7bb1d6
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1153 fp=0x7f19e81fdae8 sp=0x7f19e81fdae0 pc=0x791e10

4 additional nodes OOMed.

Here is the repro steps in this closed issue.

dt assigned dt and yuzefovich Mar 15, 2019

yuzefovich mentioned this issue Mar 15, 2019

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

Closed

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Apr 22, 2019

kenliu added the T-disaster-recovery label Dec 5, 2020

yuzefovich removed their assignment Apr 5, 2021

dt closed this as completed Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMPORT: OOM on larger cluster sizes #35773

IMPORT: OOM on larger cluster sizes #35773

dt commented Mar 15, 2019

dt commented Mar 15, 2019

dt commented Mar 15, 2019

awoods187 commented Sep 10, 2019 •

edited

Loading

IMPORT: OOM on larger cluster sizes #35773

IMPORT: OOM on larger cluster sizes #35773

Comments

dt commented Mar 15, 2019

dt commented Mar 15, 2019

dt commented Mar 15, 2019

awoods187 commented Sep 10, 2019 • edited Loading

awoods187 commented Sep 10, 2019 •

edited

Loading