Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dagger: magicache memory pressure #26769

Closed
alafanechere opened this issue May 30, 2023 · 14 comments
Closed

dagger: magicache memory pressure #26769

alafanechere opened this issue May 30, 2023 · 14 comments

Comments

@alafanechere
Copy link
Contributor

alafanechere commented May 30, 2023

The magicache pod memory usage can grow very fast when multiple pipelines are running at the same time. It can put nodes under memory pressure and lead to failed jobs.
To workaround this problem the Dagger team will implement an automatic cache pruning logic. In the meantime, when we face this problem we should ask the Dagger team to manually prune the cache.

We'd also like to know:

  • if the dagger engine can fallback on local caching if the magicache pod is unavailable.
  • if we could temporarily disable remote caching and magicache on our critical pipelines.
@sipsma
Copy link

sipsma commented May 30, 2023

There's a new image for magicache w/ automatic pruning available here: 125635003186.dkr.ecr.us-east-1.amazonaws.com/magicache:latest@sha256:5835bf44e8c72ad56f944623b9f9e9b4040d15d9749ac228453d43bab437b1bf

It by default triggers pruning at 250MB of cache metadata, which based on our past experience was a lot but not enough to trigger memory usage problems. That number is configurable if needed though via an env var you can set on the magicache container MAX_CACHE_METADATA_BYTES. I would suggest starting with the default and seeing how that goes though.

Also, we're making quick progress on moving magicache to run in our cloud, so that will be available soon and should eliminate this problem

cc @cpdeethree @alafanechere

if the dagger engine can fallback on local caching if the magicache pod is unavailable.

Yes that's the behavior today, with the exception that if magicache is not available when the engine starts, it will currently get an error and exit. That's a bug though, we have an issue in our backlog for fixing it. The engine already will just rely on the local cache if the remote cache isn't available while it's running, so we just need to fix the startup bug to get the behavior you want here. I will prioritize implementing that fix.

if we could temporarily disable remote caching and magicache on our critical pipelines.

You certainly can, it's just a matter of dropping the env var set on the engine container, but I think if we fix the aforementioned bug where the engine errors if magicache isn't available at startup, then there's no harm in leaving it on. The only critical bug that's caused problems for engines related to magicache has been the memory usage issue, everything else is unrelated and would affect engines without magicache enabled.

@alafanechere
Copy link
Contributor Author

Thank you @sipsma , your explanations are 💯 clear!
I've updated the magicache image ref in our dev-magicache deployment. I'll ask @cpdeethree to review and apply it.
I've opened a draft PR to try out the kind of job that were previously putting the cluster under memory pressure.

@c-p-b
Copy link
Contributor

c-p-b commented May 31, 2023

Reviewed and applied

@alafanechere
Copy link
Contributor Author

@sipsma I confirmed I observed a cache pruning operation ~250MB. We re-ran the same type of jobs that were previously causing memory pressure without problem. 🎉
Image

@sipsma
Copy link

sipsma commented May 31, 2023

Awesome, great to hear! I can confirm on our side that the bucket's metadata is pruned, so looks like it's working as intended

@alafanechere
Copy link
Contributor Author

We deployed the new magicache image to our production runners. I'll close this issue as the problem is likely solved now. Will reopen if I observe magicache related memory problems.

Thank you so much @sipsma for the quick fix.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Dagger Support May 31, 2023
@alafanechere alafanechere moved this from Done to In Progress in Dagger Support Jun 1, 2023
@alafanechere alafanechere reopened this Jun 1, 2023
@alafanechere
Copy link
Contributor Author

alafanechere commented Jun 1, 2023

@sipsma @mircubed I'm reopening because our production magicache pod (using 125635003186.dkr.ecr.us-east-1.amazonaws.com/magicache:latest@sha256:5835bf44e8c72ad56f944623b9f9e9b4040d15d9749ac228453d43bab437b1b) memory usage is continuously growing. I don't spot agressive pruning that would keep a low the memory usage. A regular pruning operation of ~250mb happens but the underlying memory usage still grows.

Image

@alafanechere
Copy link
Contributor Author

When the pod reaches ~5GB of memory usage its evicted and a new one is starting

Image

@sipsma
Copy link

sipsma commented Jun 1, 2023

@alafanechere Sorry this came back. You cache metadata is currently at 232MB, so the pruning hasn't kicked in yet. It also hasn't grown a ton over the last 24 hours, so I suspect something else may be at play here.

I will look into this. I'll start by trying to run magicache locally with your cacheState.json to see if something else is happening and go from there.

In the mean time, it may be worth a shot to deploy magicache with the env var MAX_CACHE_METADATA_BYTES set to 104857600, to cap it at 100MB, just in case that helps.

@c-p-b
Copy link
Contributor

c-p-b commented Jun 1, 2023

Happy to inject that env var, just want to double check what, if any performance implications will arise from capping it here

@sipsma
Copy link

sipsma commented Jun 1, 2023

Happy to inject that env var, just want to double check what, if any performance implications will arise from capping it here

You will get cache misses more often, there will be an initial burst of misses whenever that 100MB threshold is hit, but it should recover quickly. Based on the growth rate of the metadata I've observed, this should only happen every few days.

@sipsma
Copy link

sipsma commented Jun 1, 2023

Been running with the cache metadata from your bucket on my own cluster, have not been able to reproduce this behavior yet. But I do have a somewhat promising lead.

I found that most of the CPU time during import calls was spent on memory allocations and gc. In the heap, there were no memory leaks, but almost all of the space was taken by gc-able memory. I don't know for sure how plausible this is, need to read up on the go gc details again, but I'm wondering if the increase could be related to the gc just not being able to keep up with allocations.

I tracked down the allocations to be from the s3 download manager client we use in the implementation. It turns out that if you don't give it a pre-allocated buffer of the size of the object or greater, then it devolves into performing an absurd number of allocations.

Creating a buffer pre-allocated to the size of the metadata resulted in lower and more steady memory usage and much faster time to complete operations (~6s instead of ~30s previously). So we'll want this fix for the cpu usage improvement alone (may help with the long start/shutdown times during engine rollouts), but there's a chance it may also help with the memory increases.

I'm gonna finish fixing this up on the download and upload side, do a quick verification test w/ Airbyte workflows on my cluster, and then send it over to you all. After that I'll still continue to see if I can repro the behavior you are getting with some long running tests.

@sipsma
Copy link

sipsma commented Jun 1, 2023

@cpdeethree @alafanechere have a new magicache image for you at 125635003186.dkr.ecr.us-east-1.amazonaws.com/magicache:latest@sha256:b49b9a308c32603dbe07bcd495f7d93d3afec422240cf966277511d01277f896

It has the fixes in my previous comment plus a few more easy optimizations I found along the way:

  1. Forces a full GC cycle at the end of api calls that are known to allocate a lot of transient memory
  2. Switches to a json encode/decode library that makes way less allocations than the stdlib one

All together, when I was testing import calls w/ your ~230MB cache state, it went from >30s per call to ~4s. So this has a nice side-effect of possibly improving the time it takes to rollout new engines since they will spend less time being blocked on long-running ops.

Memory usage is still high when the cache state size goes up, but the hope is that this improves the overall memory stability as a result of fewer allocations. That being said, I still can't fully explain the behavior you were seeing with the increases in memory despite little change in cache metadata size, so I'm still letting some engines run in my cluster in a loop to see if I can reproduce it.

@alafanechere
Copy link
Contributor Author

@sipsma the new magicache image is deployed to our cluster. I'm closing the issue and will monitor memory in the next couple of days. Will reopen if the memory usage surges.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Dagger Support Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants