Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtrabackup engine can use tons of vttablet memory #5613

Closed
pH14 opened this issue Dec 23, 2019 · 24 comments · Fixed by #7037
Closed

xtrabackup engine can use tons of vttablet memory #5613

pH14 opened this issue Dec 23, 2019 · 24 comments · Fixed by #7037

Comments

@pH14
Copy link
Contributor

pH14 commented Dec 23, 2019

Using the xtrabackup engine can skyrocket vttablet memory usage. As a result. we often run into OOM kills on the xtrabackup child process it spawns, due to container memory limits.

We can use xtrabackup_stripes as a coarse-grain control for memory usage, but it'd be better to more precisely set an upper-bound on how much memory the backup engine can use.

When looking at the heap flamegraph, we can see that, unsurprisingly, all of our allocations are going into various byte buffers along the way.

Screen Shot 2019-11-27 at 5 13 27 PM

@morgo morgo added the Type: Bug label Jan 7, 2020
@enisoc
Copy link
Member

enisoc commented Jan 7, 2020

we often run into OOM kills on the xtrabackup child process it spawns

Were you able to see the mem usage of vttablet separately from xtrabackup (even though they're both in the same container) to know that the unexpected growth in usage is from vttablet and not xtrabackup itself? I noticed your flamegraph is only for vttablet.

Also, do you have any time-series graph of memory usage during a backup? It would be helpful to see if there is any sawtooth pattern, which would indicate we might be doing a lot of allocations that get cleaned up periodically by the GC.

We can use xtrabackup_stripes as a coarse-grain control for memory usage

Does this mean that you found more stripes leads to more memory usage? If so, then it seems like the main problem is the big chunk in your flamegraph attributed to pgzip, since we create a separate compressor for each stripe.

We do expose the underlying settings that are passed to pgzip as flags, so if this is the problem, it should be solvable with some tuning. The memory used by pgzip will scale like:

xtrabackup_stripes * backup_storage_block_size * backup_storage_number_blocks

If you want to increase xtrabackup_stripes while staying under the same memory limit, you can reduce backup_storage_block_size or backup_storage_number_blocks by the same factor.

@pH14
Copy link
Contributor Author

pH14 commented Jan 7, 2020

Were you able to see the mem usage of vttablet separately from xtrabackup (even though they're both in the same container) to know that the unexpected growth in usage is from vttablet and not xtrabackup itself? I noticed your flamegraph is only for vttablet.

Yes, when the OOM killer runs we see xtrabackup running with ~50MB of memory, and vttablet with effectively containerMemoryLimit - xtrabackupMemory

Also, do you have any time-series graph of memory usage during a backup? It would be helpful to see if there is any sawtooth pattern, which would indicate we might be doing a lot of allocations that get cleaned up periodically by the GC.

Sure -- here's one recent replica pod. Guess when xtrabackup started and when the OOM killer stepped in 😁

Screen Shot 2020-01-07 at 5 12 41 PM

Does this mean that you found more stripes leads to more memory usage? If so, then it seems like the main problem is the big chunk in your flamegraph attributed to pgzip, since we create a separate compressor for each stripe.

Yes, more stripes means more usage, we've avoided the problem on a few keyspaces by lowering the # of strips. Unfortunately it's a bit of guess-and-check for where it lands in total usage.

@pH14
Copy link
Contributor Author

pH14 commented Jan 7, 2020

If you want to increase xtrabackup_stripes while staying under the same memory limit, you can reduce backup_storage_block_size or backup_storage_number_blocks by the same factor.

This is a good suggestion, I can look into trying that in our environment.

Admittedly haven't dug super deep into the code here, but I'm a generally a little puzzled that sync.Pool doesn't have an inherent way to set an upper bound. Seems like if you're ingesting data faster than you're compressing / uploading, the usage can balloon indefinitely

@enisoc
Copy link
Member

enisoc commented Jan 7, 2020

I'm a generally a little puzzled that sync.Pool doesn't have an inherent way to set an upper bound

At a practical level, there should be a soft bound based on how many concurrent users of the pool there are. So the backup_storage_block_size and backup_storage_number_blocks settings should still be effective at reducing usage, although the effectiveness may be less than 1x depending on how sync.Pool works under the hood.

Seems like if you're ingesting data faster than you're compressing / uploading, the usage can balloon indefinitely

The pgzip lib claims that it blocks if its fixed compression buffers fill up. Do you see a way that data throughput mismatch could still cause memory bloat?

@enisoc
Copy link
Member

enisoc commented Jan 7, 2020

@pH14 If you have a chance, it would also be helpful if you can try a backup with this patch to see how it affects the flamegraph: #5666

That should address another, smaller section of the flamegraph, even though I still suspect the pgzip section is the main culprit.

@pH14
Copy link
Contributor Author

pH14 commented Jan 9, 2020

Thanks for jumping on that -- it might be a bit before we can test it out, but I'm keen to see what happens. I'll do a bit more digging into the pgzip lib as well, perhaps it is more limited than I thought. If so, then maybe we can get some estimation of memory used which would help us determine which settings to use

@enisoc
Copy link
Member

enisoc commented Jan 9, 2020

Can you share what values of xtrabackup_stripes, backup_storage_block_size, and backup_storage_number_blocks were in use for that graph showing 1.4G mem usage? And how many logical CPUs are accessible/visible (not necessarily reserved) to vttablet?

Also, what was the total mem usage represented by that flamegraph at the top? I'm trying to get an idea for the absolute sizes of these things to sanity check my theory. If the usage is way over the formula I proposed, you may be right that we need to fix pgzip to introduce true bounds.

@pH14
Copy link
Contributor Author

pH14 commented Jan 10, 2020

For that one:

xtrabackup_stripes: 6
backup_storage_block_size: 250000 (default)
backup_storage_number_blocks: 6
CPU limit available to the vttablet container: 6

Unfortunately it doesn't look like I have the original dataset for that particular flamegraph. The runtime.Memstats for another heap profile I took for a tablet with the same configuration had:

# runtime.MemStats
# Alloc = 33034256
# TotalAlloc = 50845079400
# Sys = 1598191864
# Lookups = 0
# Mallocs = 186161662
# Frees = 186017549
# HeapAlloc = 33034256
# HeapSys = 1538981888
# HeapIdle = 1501822976
# HeapInuse = 37158912
# HeapReleased = 1427701760
# HeapObjects = 144113
# Stack = 4521984 / 4521984
# MSpan = 364344 / 1425408
# MCache = 12096 / 16384
# BuckHashSys = 1631199
# GCSys = 50413568
# OtherSys = 1201433
# NextGC = 52332416
# LastGC = 1574472023506697383
# PauseNs = [41928 219746 57021 74638 48606 54464 53100 105071 46524 50256 75897 75989 82120 69213 53516 56021 56539 59561 59536 44207 71136 70527 53860 71671 98397 77927 83243 68294 53262 62361 37154 75530 51476 56809 36191 55622 66242 52555 58215 61597 52101 35786 55041 57462 140730 53144 80121 45777 59278 77986 63235 74173 43519 60681 58208 58100 72939 41386 58819 55749 43588 47021 51543 61073 50496 63813 37998687 31089433 160585 234971 556087 554559 813087 116851 168849 30028401 178236 513240 91509 92482 8296131 248172 221647 132349 27922876 216584 246076 530352 7938011 261465 192565 7360664 232284 269826 146394 3991675 295095 8135008 171347 59378 51669 56071 53453 67189 41612 72225 66018 52339 66547 89580 62420 57923 43673 43878 52250 49427 58781 73841 69693 72881 77828 48380 50768 74521 48341 98080 50682 63601 47551 68250 51623 71675 57784 59317 48306 52762 100271 80529 59811 51879 65629 50222 54875 59038 88965 66218 73696 40007 62134 61160 53833 43927 48634 47882 60609 37865 53453 46281 43615 91949 41756 51074 48934 62479 65023 69111 51359 54280 57490 50980 37626 60979 71835 51755 59191 51489 67329 49135 52654 71479 62732 43966 52931 51340 54345 45880 43956 44496 46865 62386 58260 51760 77188 38753 62118 70365 199266 86807 72610 72541 51052 51325 69030 51481 48960 66143 63817 1294965 45935 71176 57393 52384 94340 45654 62170 59087 115320 78799 68465 65751 52693 70318 46899 58750 69468 49841 48625 51374 41500 59702 52242 55489 62194 96653 37638 70371 66405 49379 58334 69797 69963 55410 51126 54586 73746 67594 69815 66076 50083 48987 106815 58280 62347 74055 68665 58720]
# PauseEnd
# NumGC = 2153
# NumForcedGC = 0
# GCCPUFraction = 5.668797717612457e-06
# DebugGC = false

@pH14
Copy link
Contributor Author

pH14 commented Jan 10, 2020

But, I'll see if I can pull a flamegraph w/ actual sizes from a tablet running into the same issues shortly

@enisoc
Copy link
Member

enisoc commented Jan 11, 2020

Hm that definitely seems like a lot of the mem usage is not accounted for. Almost all of the heap space is released or idle. If the space usage is idle heap space, it might even be that the flamegraph won't show it. The graph might only be showing us objects that are still allocated.

This might be a case of Go not releasing unused memory back to the OS. Usually that becomes a problem if you have tons of churn in objects being allocated and then collected, so it's possible that #5666 will help more than I thought it would. We'll have to wait and see.

@enisoc
Copy link
Member

enisoc commented Jan 22, 2020

If we suspect pgzip is using too much memory, we could try https://godoc.org/golang.org/x/build/pargzip.

@pH14
Copy link
Contributor Author

pH14 commented Jan 22, 2020

We rolled out #5666 and it hasn't changed overall memory usage, though the flamegraph looks pretty different now.

With:

xtrabackup_stripes: 4
backup_storage_block_size: 250000 (default)
backup_storage_number_blocks: 4
CPU limit available to the vttablet container: 10
Memory request for vttablet container: 2GB
Memory limit for vttablet container: 3GB

We get this flamegraph:

Screen Shot 2020-01-22 at 8 02 18 AM

Total vttablet process RSS was at 1.845388GB, the total heap usage there is 444MB and the whole tree from s3backupstorage.(*S3BackupHandle).AddFile.func1 --> s3manager.(*uploader).init.func1 is 393MB

@acharis pointed out that our S3 uploader runs with a patch (!): https://gist.github.com/pH14/caee0c2be14e5db09c69e480be9f8a42 -- I don't think it affects this particular backup, since the file size is calculable + we're doing a striped backup, but including it for full disclosure.

@enisoc
Copy link
Member

enisoc commented Jan 22, 2020

Given that the heap usage according to Go is only 444MB while the RSS is 1.8GB, it does feel like this might be a case of unused memory not being released to the OS. Since this memory is unused, it won't show up in a heap snapshot.

A quick-and-dirty way to test whether this is the problem would be to patch vttablet to periodically call debug.FreeOSMemory(). This will degrade performance/latency but will make the RSS more representative of actual allocated objects, which will give us evidence as to whether we're on the right track.

If you don't have a safe way to run that experimental patch, we may be able to find another way to test this but I can't think of one off the top of my head.

@pH14
Copy link
Contributor Author

pH14 commented Jan 22, 2020

We'll be able to run that safely, I'll try hacking that in -- any suggested interval? Every 30s?

@enisoc
Copy link
Member

enisoc commented Jan 22, 2020

If you don't care about the performance of this instance, you could do it every 1s just to be really sure that we're getting clean experimental results. If you're worried about this instance, 30s is a good compromise.

@pH14
Copy link
Contributor Author

pH14 commented Feb 5, 2020

Haven't forgotten about this... hoping to test this out later this week 🤞

@deepthi
Copy link
Member

deepthi commented Feb 5, 2020

@pH14 can you record the size of the database and the size of the largest table when you test this?

@pH14
Copy link
Contributor Author

pH14 commented Feb 18, 2020

Phew, finally got this one running.

Settings were (accidentally) slightly different than earlier runs, but they were the same between running with the per-second execution of debug.FreeOSMemory() vs without.

xtrabackup_stripes: 8
backup_storage_block_size: 250000
backup_storage_number_blocks: 8
CPU limit available to the vttablet container: 10
Memory request for vttablet container: 1GB
Memory limit for vttablet container: 2GB
DB Size / Largest Tablet: 90GB (all the data on this db is in a single table)

On the default run (without the FreeOSMemory calls) we can see the RSS of the container spike when the backup kicks off:

withoutpatch-rss

And heap flamegraph:

withoutpatch-flamegraph

Notably, the backup also only took ~7 minutes, but it took twice as long for the vttablet memory to come back down. e.g. we can see the CPU came down 8 minutes before the memory usage did:

withoutpatch-cpu


With the patch to run FreeOSMemory every second, our RSS usage is quite different:

withpatch-rss

Note that the absolute memory usage is about halved from before.

The flamegraph is of a similar shape, but with ~20% less overall usage (this is also sampling though):

withpatch-flamegraph

@pH14
Copy link
Contributor Author

pH14 commented Feb 18, 2020

Looks like forcing debug.FreeOSMemory() pretty significantly affects the vttablet memory footprint!

@enisoc
Copy link
Member

enisoc commented Feb 26, 2020

That's definitely interesting! I looked up how to see more about those not-in-use objects, and it looks like it should be as simple as grabbing pprof/allocs instead of pprof/heap. Can you try to grab both of those on a run without FreeOSMemory so we can compare in-use objects vs. all past objects?

@pH14
Copy link
Contributor Author

pH14 commented May 18, 2020

^ I don't have all the data for that one handy at this point, but for posterity we discussed that offline. It looked like pgzip was making a lot of short-lived allocations.

Interestingly, we recently deployed our vttablets with go1.13 up from go1.11 and saw memory usage degrade even further.

vttablet container memory usage, after go1.13:

Screen Shot 2020-05-15 at 6 29 06 PM

We can see it spike instantaneously, and go never fully released memory back to the OS, even hours after the backup finished. It remained at nearly full usage (despite very low active heap usage) until we restarted the container.

To revert to the previous behavior, we set GODEBUG="madvdontneed" when running vttablet (from https://golang.org/doc/go1.12#runtime) and were able to restore the previous behavior:

Screen Shot 2020-05-15 at 6 27 51 PM

Curious if any others have seen such behavior going from go1.11 to 1.12 or 1.13

@enisoc
Copy link
Member

enisoc commented May 18, 2020

We've been running on go1.13 and haven't seen this ourselves, but maybe our shards are small enough to not hit it?

At this point, I think the most promising route is to try swapping in https://godoc.org/golang.org/x/build/pargzip.

@pH14 If I made a branch with that swap, would you be willing to test it on a shard that exhibits this problem?

@pH14
Copy link
Contributor Author

pH14 commented May 18, 2020

Yep! We could give that a whirl

@enisoc
Copy link
Member

enisoc commented May 18, 2020

@pH14 Here's an experimental branch you could try:

master...planetscale:pargzip

In addition to memory usage, it would also be interesting to compare how long the backup takes with pargzip vs. pgzip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants