-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dagger: magicache memory pressure #26769
Comments
There's a new image for magicache w/ automatic pruning available here: It by default triggers pruning at 250MB of cache metadata, which based on our past experience was a lot but not enough to trigger memory usage problems. That number is configurable if needed though via an env var you can set on the magicache container Also, we're making quick progress on moving magicache to run in our cloud, so that will be available soon and should eliminate this problem cc @cpdeethree @alafanechere
Yes that's the behavior today, with the exception that if magicache is not available when the engine starts, it will currently get an error and exit. That's a bug though, we have an issue in our backlog for fixing it. The engine already will just rely on the local cache if the remote cache isn't available while it's running, so we just need to fix the startup bug to get the behavior you want here. I will prioritize implementing that fix.
You certainly can, it's just a matter of dropping the env var set on the engine container, but I think if we fix the aforementioned bug where the engine errors if magicache isn't available at startup, then there's no harm in leaving it on. The only critical bug that's caused problems for engines related to magicache has been the memory usage issue, everything else is unrelated and would affect engines without magicache enabled. |
Thank you @sipsma , your explanations are 💯 clear! |
Reviewed and applied |
@sipsma I confirmed I observed a cache pruning operation ~250MB. We re-ran the same type of jobs that were previously causing memory pressure without problem. 🎉 |
Awesome, great to hear! I can confirm on our side that the bucket's metadata is pruned, so looks like it's working as intended |
We deployed the new magicache image to our production runners. I'll close this issue as the problem is likely solved now. Will reopen if I observe magicache related memory problems. Thank you so much @sipsma for the quick fix. |
@sipsma @mircubed I'm reopening because our production magicache pod (using |
@alafanechere Sorry this came back. You cache metadata is currently at 232MB, so the pruning hasn't kicked in yet. It also hasn't grown a ton over the last 24 hours, so I suspect something else may be at play here. I will look into this. I'll start by trying to run magicache locally with your cacheState.json to see if something else is happening and go from there. In the mean time, it may be worth a shot to deploy magicache with the env var |
Happy to inject that env var, just want to double check what, if any performance implications will arise from capping it here |
You will get cache misses more often, there will be an initial burst of misses whenever that 100MB threshold is hit, but it should recover quickly. Based on the growth rate of the metadata I've observed, this should only happen every few days. |
Been running with the cache metadata from your bucket on my own cluster, have not been able to reproduce this behavior yet. But I do have a somewhat promising lead. I found that most of the CPU time during import calls was spent on memory allocations and gc. In the heap, there were no memory leaks, but almost all of the space was taken by gc-able memory. I don't know for sure how plausible this is, need to read up on the go gc details again, but I'm wondering if the increase could be related to the gc just not being able to keep up with allocations. I tracked down the allocations to be from the s3 download manager client we use in the implementation. It turns out that if you don't give it a pre-allocated buffer of the size of the object or greater, then it devolves into performing an absurd number of allocations. Creating a buffer pre-allocated to the size of the metadata resulted in lower and more steady memory usage and much faster time to complete operations (~6s instead of ~30s previously). So we'll want this fix for the cpu usage improvement alone (may help with the long start/shutdown times during engine rollouts), but there's a chance it may also help with the memory increases. I'm gonna finish fixing this up on the download and upload side, do a quick verification test w/ Airbyte workflows on my cluster, and then send it over to you all. After that I'll still continue to see if I can repro the behavior you are getting with some long running tests. |
@cpdeethree @alafanechere have a new magicache image for you at It has the fixes in my previous comment plus a few more easy optimizations I found along the way:
All together, when I was testing import calls w/ your ~230MB cache state, it went from >30s per call to ~4s. So this has a nice side-effect of possibly improving the time it takes to rollout new engines since they will spend less time being blocked on long-running ops. Memory usage is still high when the cache state size goes up, but the hope is that this improves the overall memory stability as a result of fewer allocations. That being said, I still can't fully explain the behavior you were seeing with the increases in memory despite little change in cache metadata size, so I'm still letting some engines run in my cluster in a loop to see if I can reproduce it. |
@sipsma the new magicache image is deployed to our cluster. I'm closing the issue and will monitor memory in the next couple of days. Will reopen if the memory usage surges. |
The magicache pod memory usage can grow very fast when multiple pipelines are running at the same time. It can put nodes under memory pressure and lead to failed jobs.
To workaround this problem the Dagger team will implement an automatic cache pruning logic. In the meantime, when we face this problem we should ask the Dagger team to manually prune the cache.
We'd also like to know:
The text was updated successfully, but these errors were encountered: