Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtrabackup engine can use tons of vttablet memory #5613

Closed
pH14 opened this issue Dec 23, 2019 · 24 comments · Fixed by #7037
Closed

xtrabackup engine can use tons of vttablet memory #5613

pH14 opened this issue Dec 23, 2019 · 24 comments · Fixed by #7037

Comments

@pH14
Copy link
Contributor

pH14 commented Dec 23, 2019

Using the xtrabackup engine can skyrocket vttablet memory usage. As a result. we often run into OOM kills on the xtrabackup child process it spawns, due to container memory limits.

We can use xtrabackup_stripes as a coarse-grain control for memory usage, but it'd be better to more precisely set an upper-bound on how much memory the backup engine can use.

When looking at the heap flamegraph, we can see that, unsurprisingly, all of our allocations are going into various byte buffers along the way.

Screen Shot 2019-11-27 at 5 13 27 PM

@morgo morgo added the Type: Bug label Jan 7, 2020
@enisoc
Copy link
Member

enisoc commented Jan 7, 2020

we often run into OOM kills on the xtrabackup child process it spawns

Were you able to see the mem usage of vttablet separately from xtrabackup (even though they're both in the same container) to know that the unexpected growth in usage is from vttablet and not xtrabackup itself? I noticed your flamegraph is only for vttablet.

Also, do you have any time-series graph of memory usage during a backup? It would be helpful to see if there is any sawtooth pattern, which would indicate we might be doing a lot of allocations that get cleaned up periodically by the GC.

We can use xtrabackup_stripes as a coarse-grain control for memory usage

Does this mean that you found more stripes leads to more memory usage? If so, then it seems like the main problem is the big chunk in your flamegraph attributed to pgzip, since we create a separate compressor for each stripe.

We do expose the underlying settings that are passed to pgzip as flags, so if this is the problem, it should be solvable with some tuning. The memory used by pgzip will scale like:

xtrabackup_stripes * backup_storage_block_size * backup_storage_number_blocks

If you want to increase xtrabackup_stripes while staying under the same memory limit, you can reduce backup_storage_block_size or backup_storage_number_blocks by the same factor.

@pH14
Copy link
Contributor Author

pH14 commented Jan 7, 2020

Were you able to see the mem usage of vttablet separately from xtrabackup (even though they're both in the same container) to know that the unexpected growth in usage is from vttablet and not xtrabackup itself? I noticed your flamegraph is only for vttablet.

Yes, when the OOM killer runs we see xtrabackup running with ~50MB of memory, and vttablet with effectively containerMemoryLimit - xtrabackupMemory

Also, do you have any time-series graph of memory usage during a backup? It would be helpful to see if there is any sawtooth pattern, which would indicate we might be doing a lot of allocations that get cleaned up periodically by the GC.

Sure -- here's one recent replica pod. Guess when xtrabackup started and when the OOM killer stepped in 😁

Screen Shot 2020-01-07 at 5 12 41 PM

Does this mean that you found more stripes leads to more memory usage? If so, then it seems like the main problem is the big chunk in your flamegraph attributed to pgzip, since we create a separate compressor for each stripe.

Yes, more stripes means more usage, we've avoided the problem on a few keyspaces by lowering the # of strips. Unfortunately it's a bit of guess-and-check for where it lands in total usage.

@pH14
Copy link
Contributor Author

pH14 commented Jan 7, 2020

If you want to increase xtrabackup_stripes while staying under the same memory limit, you can reduce backup_storage_block_size or backup_storage_number_blocks by the same factor.

This is a good suggestion, I can look into trying that in our environment.

Admittedly haven't dug super deep into the code here, but I'm a generally a little puzzled that sync.Pool doesn't have an inherent way to set an upper bound. Seems like if you're ingesting data faster than you're compressing / uploading, the usage can balloon indefinitely

@enisoc
Copy link
Member

enisoc commented Jan 7, 2020

I'm a generally a little puzzled that sync.Pool doesn't have an inherent way to set an upper bound

At a practical level, there should be a soft bound based on how many concurrent users of the pool there are. So the backup_storage_block_size and backup_storage_number_blocks settings should still be effective at reducing usage, although the effectiveness may be less than 1x depending on how sync.Pool works under the hood.

Seems like if you're ingesting data faster than you're compressing / uploading, the usage can balloon indefinitely

The pgzip lib claims that it blocks if its fixed compression buffers fill up. Do you see a way that data throughput mismatch could still cause memory bloat?

@enisoc
Copy link
Member

enisoc commented Jan 7, 2020

@pH14 If you have a chance, it would also be helpful if you can try a backup with this patch to see how it affects the flamegraph: #5666

That should address another, smaller section of the flamegraph, even though I still suspect the pgzip section is the main culprit.

@pH14
Copy link
Contributor Author

pH14 commented Jan 9, 2020

Thanks for jumping on that -- it might be a bit before we can test it out, but I'm keen to see what happens. I'll do a bit more digging into the pgzip lib as well, perhaps it is more limited than I thought. If so, then maybe we can get some estimation of memory used which would help us determine which settings to use

@enisoc
Copy link
Member

enisoc commented Jan 9, 2020

Can you share what values of xtrabackup_stripes, backup_storage_block_size, and backup_storage_number_blocks were in use for that graph showing 1.4G mem usage? And how many logical CPUs are accessible/visible (not necessarily reserved) to vttablet?

Also, what was the total mem usage represented by that flamegraph at the top? I'm trying to get an idea for the absolute sizes of these things to sanity check my theory. If the usage is way over the formula I proposed, you may be right that we need to fix pgzip to introduce true bounds.

@pH14
Copy link
Contributor Author

pH14 commented Jan 10, 2020

For that one:

xtrabackup_stripes: 6
backup_storage_block_size: 250000 (default)
backup_storage_number_blocks: 6
CPU limit available to the vttablet container: 6

Unfortunately it doesn't look like I have the original dataset for that particular flamegraph. The runtime.Memstats for another heap profile I took for a tablet with the same configuration had:

# runtime.MemStats
# Alloc = 33034256
# TotalAlloc = 50845079400
# Sys = 1598191864
# Lookups = 0
# Mallocs = 186161662
# Frees = 186017549
# HeapAlloc = 33034256
# HeapSys = 1538981888
# HeapIdle = 1501822976
# HeapInuse = 37158912
# HeapReleased = 1427701760
# HeapObjects = 144113
# Stack = 4521984 / 4521984
# MSpan = 364344 / 1425408
# MCache = 12096 / 16384
# BuckHashSys = 1631199
# GCSys = 50413568
# OtherSys = 1201433
# NextGC = 52332416
# LastGC = 1574472023506697383
# PauseNs = [41928 219746 57021 74638 48606 54464 53100 105071 46524 50256 75897 75989 82120 69213 53516 56021 56539 59561 59536 44207 71136 70527 53860 71671 98397 77927 83243 68294 53262 62361 37154 75530 51476 56809 36191 55622 66242 52555 58215 61597 52101 35786 55041 57462 140730 53144 80121 45777 59278 77986 63235 74173 43519 60681 58208 58100 72939 41386 58819 55749 43588 47021 51543 61073 50496 63813 37998687 31089433 160585 234971 556087 554559 813087 116851 168849 30028401 178236 513240 91509 92482 8296131 248172 221647 132349 27922876 216584 246076 530352 7938011 261465 192565 7360664 232284 269826 146394 3991675 295095 8135008 171347 59378 51669 56071 53453 67189 41612 72225 66018 52339 66547 89580 62420 57923 43673 43878 52250 49427 58781 73841 69693 72881 77828 48380 50768 74521 48341 98080 50682 63601 47551 68250 51623 71675 57784 59317 48306 52762 100271 80529 59811 51879 65629 50222 54875 59038 88965 66218 73696 40007 62134 61160 53833 43927 48634 47882 60609 37865 53453 46281 43615 91949 41756 51074 48934 62479 65023 69111 51359 54280 57490 50980 37626 60979 71835 51755 59191 51489 67329 49135 52654 71479 62732 43966 52931 51340 54345 45880 43956 44496 46865 62386 58260 51760 77188 38753 62118 70365 199266 86807 72610 72541 51052 51325 69030 51481 48960 66143 63817 1294965 45935 71176 57393 52384 94340 45654 62170 59087 115320 78799 68465 65751 52693 70318 46899 58750 69468 49841 48625 51374 41500 59702 52242 55489 62194 96653 37638 70371 66405 49379 58334 69797 69963 55410 51126 54586 73746 67594 69815 66076 50083 48987 106815 58280 62347 74055 68665 58720]
# PauseEnd = [1574463553417529552 1574463673422325421 1574463793426306799 1574463913430900479 1574464033435467504 1574464153439767092 1574464273443957959 1574464393448044758 1574464513452259425 1574464633456462665 1574464753460300822 1574464873464541493 1574464993468751381 1574465113473700650 1574465233477875102 1574465353482200724 1574465473486348297 1574465593490460249 1574465713494702383 1574465833498891154 1574465953502923894 1574466073507204167 1574466193511621922 1574466313516459625 1574466433521100274 1574466553528584727 1574466673533855260 1574466793537646423 1574466913542476556 1574467033547156904 1574467153551462485 1574467273556433668 1574467393561206840 1574467513565319732 1574467633569359679 1574467753574053342 1574467873578483405 1574467993582671709 1574468113586862384 1574468233591165852 1574468353595322651 1574468473599232523 1574468593603822549 1574468713607715855 1574468833612369790 1574468953616774453 1574469073621801903 1574469193625221625 1574469313629658380 1574469433634381963 1574469553638307248 1574469673642840826 1574469793647137229 1574469913651584657 1574470033656157260 1574470153660365731 1574470273664452440 1574470393668648478 1574470513673475432 1574470633677162041 1574470753681013944 1574470873685903534 1574470993689627185 1574471113693852742 1574471233698442495 1574471353702246981 1574471357720444734 1574471357756981100 1574471358561725705 1574471360508616645 1574471363417513789 1574471364810954273 1574471365267436377 1574471365912462360 1574471366757222431 1574471367211211891 1574471368238144909 1574471369715422802 1574471370593900012 1574471371468410953 1574471371798380451 1574471372910272571 1574471374108158644 1574471375029813969 1574471375299427328 1574471376672030819 1574471377620789009 1574471378537640119 1574471378760298670 1574471380303651884 1574471381372012415 1574471382265417074 1574471382525677174 1574471384213378166 1574471385292731012 1574471385563589157 1574471387244751500 1574471388374780547 1574471388922807937 1574471508967380838 1574471628970112341 1574471748972646835 1574471838303262555 1574471925751235774 1574472023506697383 1574445432773617713 1574445552777922207 1574445672782126676 1574445792786845469 1574445912791701433 1574446032794808469 1574446152799237289 1574446272803381211 1574446392807428455 1574446512811569553 1574446632816156817 1574446752820264781 1574446872824846884 1574446992828906652 1574447112833402992 1574447232837848850 1574447352841911264 1574447472846188936 1574447592851218282 1574447712855475383 1574447832859602846 1574447952864048114 1574448072868203156 1574448192872847450 1574448312876908382 1574448432881080923 1574448552885367611 1574448672889999408 1574448792894101486 1574448912898516885 1574449032902586083 1574449152907075336 1574449272911354032 1574449392916019691 1574449512920700612 1574449632925014844 1574449752929168511 1574449872933666430 1574449992938213793 1574450112942724804 1574450232946624354 1574450352951476226 1574450472955504765 1574450592959588357 1574450712964244353 1574450832968172005 1574450952972493584 1574451072976583861 1574451192980689512 1574451312984891526 1574451432989376892 1574451552993491451 1574451672997334771 1574451793001661320 1574451913005674082 1574452033010014663 1574452153014469130 1574452273018384842 1574452393022523337 1574452513026394283 1574452633030315955 1574452753034933414 1574452873039209283 1574452993043561528 1574453113047559615 1574453233051824561 1574453353056026063 1574453473060409173 1574453593064578686 1574453713068826114 1574453833073141859 1574453953077280794 1574454073081851837 1574454193085893926 1574454313089955318 1574454433094246873 1574454553098489477 1574454673102826654 1574454793106957589 1574454913110834515 1574455033115126261 1574455153119552608 1574455273123671469 1574455393127850580 1574455513132141207 1574455633136565536 1574455753140771235 1574455873145702274 1574455993148737973 1574456113153441281 1574456233158281318 1574456353162975182 1574456473167324908 1574456593171251216 1574456713175243676 1574456833179352074 1574456953183620875 1574457073188256681 1574457193192491909 1574457313196645788 1574457433201213095 1574457553205554724 1574457673211066261 1574457793213591316 1574457913217946262 1574458033222424169 1574458153226577787 1574458273231050506 1574458393235209150 1574458513239202493 1574458633243390947 1574458753248091997 1574458873252676546 1574458993256264378 1574459113260854908 1574459233264509335 1574459353268763055 1574459473272653008 1574459593276778699 1574459713280495984 1574459833285732397 1574459953289490762 1574460073293724788 1574460193297807889 1574460313302005293 1574460433306087028 1574460553310640749 1574460673314767743 1574460793319550978 1574460913323433035 1574461033328425738 1574461153332739982 1574461273336756352 1574461393341142965 1574461513345171266 1574461633349444855 1574461753353858902 1574461873357823879 1574461993362012217 1574462113366242831 1574462233370886503 1574462353374992956 1574462473379133292 1574462593383225711 1574462713387682889 1574462833391983157 1574462953396254097 1574463073400346544 1574463193404567830 1574463313409352247 1574463433413563403]
# NumGC = 2153
# NumForcedGC = 0
# GCCPUFraction = 5.668797717612457e-06
# DebugGC = false

@pH14
Copy link
Contributor Author

pH14 commented Jan 10, 2020

But, I'll see if I can pull a flamegraph w/ actual sizes from a tablet running into the same issues shortly

@enisoc
Copy link
Member

enisoc commented Jan 11, 2020

Hm that definitely seems like a lot of the mem usage is not accounted for. Almost all of the heap space is released or idle. If the space usage is idle heap space, it might even be that the flamegraph won't show it. The graph might only be showing us objects that are still allocated.

This might be a case of Go not releasing unused memory back to the OS. Usually that becomes a problem if you have tons of churn in objects being allocated and then collected, so it's possible that #5666 will help more than I thought it would. We'll have to wait and see.

@enisoc
Copy link
Member

enisoc commented Jan 22, 2020

If we suspect pgzip is using too much memory, we could try https://godoc.org/golang.org/x/build/pargzip.

@pH14
Copy link
Contributor Author

pH14 commented Jan 22, 2020

We rolled out #5666 and it hasn't changed overall memory usage, though the flamegraph looks pretty different now.

With:

xtrabackup_stripes: 4
backup_storage_block_size: 250000 (default)
backup_storage_number_blocks: 4
CPU limit available to the vttablet container: 10
Memory request for vttablet container: 2GB
Memory limit for vttablet container: 3GB

We get this flamegraph:

Screen Shot 2020-01-22 at 8 02 18 AM

Total vttablet process RSS was at 1.845388GB, the total heap usage there is 444MB and the whole tree from s3backupstorage.(*S3BackupHandle).AddFile.func1 --> s3manager.(*uploader).init.func1 is 393MB

@acharis pointed out that our S3 uploader runs with a patch (!): https://gist.github.com/pH14/caee0c2be14e5db09c69e480be9f8a42 -- I don't think it affects this particular backup, since the file size is calculable + we're doing a striped backup, but including it for full disclosure.

@enisoc
Copy link
Member

enisoc commented Jan 22, 2020

Given that the heap usage according to Go is only 444MB while the RSS is 1.8GB, it does feel like this might be a case of unused memory not being released to the OS. Since this memory is unused, it won't show up in a heap snapshot.

A quick-and-dirty way to test whether this is the problem would be to patch vttablet to periodically call debug.FreeOSMemory(). This will degrade performance/latency but will make the RSS more representative of actual allocated objects, which will give us evidence as to whether we're on the right track.

If you don't have a safe way to run that experimental patch, we may be able to find another way to test this but I can't think of one off the top of my head.

@pH14
Copy link
Contributor Author

pH14 commented Jan 22, 2020

We'll be able to run that safely, I'll try hacking that in -- any suggested interval? Every 30s?

@enisoc
Copy link
Member

enisoc commented Jan 22, 2020

If you don't care about the performance of this instance, you could do it every 1s just to be really sure that we're getting clean experimental results. If you're worried about this instance, 30s is a good compromise.

@pH14
Copy link
Contributor Author

pH14 commented Feb 5, 2020

Haven't forgotten about this... hoping to test this out later this week 🤞

@deepthi
Copy link
Member

deepthi commented Feb 5, 2020

@pH14 can you record the size of the database and the size of the largest table when you test this?

@pH14
Copy link
Contributor Author

pH14 commented Feb 18, 2020

Phew, finally got this one running.

Settings were (accidentally) slightly different than earlier runs, but they were the same between running with the per-second execution of debug.FreeOSMemory() vs without.

xtrabackup_stripes: 8
backup_storage_block_size: 250000
backup_storage_number_blocks: 8
CPU limit available to the vttablet container: 10
Memory request for vttablet container: 1GB
Memory limit for vttablet container: 2GB
DB Size / Largest Tablet: 90GB (all the data on this db is in a single table)

On the default run (without the FreeOSMemory calls) we can see the RSS of the container spike when the backup kicks off:

withoutpatch-rss

And heap flamegraph:

withoutpatch-flamegraph

Notably, the backup also only took ~7 minutes, but it took twice as long for the vttablet memory to come back down. e.g. we can see the CPU came down 8 minutes before the memory usage did:

withoutpatch-cpu


With the patch to run FreeOSMemory every second, our RSS usage is quite different:

withpatch-rss

Note that the absolute memory usage is about halved from before.

The flamegraph is of a similar shape, but with ~20% less overall usage (this is also sampling though):

withpatch-flamegraph

@pH14
Copy link
Contributor Author

pH14 commented Feb 18, 2020

Looks like forcing debug.FreeOSMemory() pretty significantly affects the vttablet memory footprint!

@enisoc
Copy link
Member

enisoc commented Feb 26, 2020

That's definitely interesting! I looked up how to see more about those not-in-use objects, and it looks like it should be as simple as grabbing pprof/allocs instead of pprof/heap. Can you try to grab both of those on a run without FreeOSMemory so we can compare in-use objects vs. all past objects?

@pH14
Copy link
Contributor Author

pH14 commented May 18, 2020

^ I don't have all the data for that one handy at this point, but for posterity we discussed that offline. It looked like pgzip was making a lot of short-lived allocations.

Interestingly, we recently deployed our vttablets with go1.13 up from go1.11 and saw memory usage degrade even further.

vttablet container memory usage, after go1.13:

Screen Shot 2020-05-15 at 6 29 06 PM

We can see it spike instantaneously, and go never fully released memory back to the OS, even hours after the backup finished. It remained at nearly full usage (despite very low active heap usage) until we restarted the container.

To revert to the previous behavior, we set GODEBUG="madvdontneed" when running vttablet (from https://golang.org/doc/go1.12#runtime) and were able to restore the previous behavior:

Screen Shot 2020-05-15 at 6 27 51 PM

Curious if any others have seen such behavior going from go1.11 to 1.12 or 1.13

@enisoc
Copy link
Member

enisoc commented May 18, 2020

We've been running on go1.13 and haven't seen this ourselves, but maybe our shards are small enough to not hit it?

At this point, I think the most promising route is to try swapping in https://godoc.org/golang.org/x/build/pargzip.

@pH14 If I made a branch with that swap, would you be willing to test it on a shard that exhibits this problem?

@pH14
Copy link
Contributor Author

pH14 commented May 18, 2020

Yep! We could give that a whirl

@enisoc
Copy link
Member

enisoc commented May 18, 2020

@pH14 Here's an experimental branch you could try:

master...planetscale:pargzip

In addition to memory usage, it would also be interesting to compare how long the backup takes with pargzip vs. pgzip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants