-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak issue with Benthos memory cache #2404
Comments
Hey @youngjosh, thanks that should be all the info I need for now. I'll try and prioritise this. |
Hey @youngjosh I'm struggling to reproduce this with a naive set up, I'm generating a large volume of data but I suspect this may be related to spikes in traffic or some other throughput related behaviour. Comment from @mihaitodor:
^ This issue (maps not shrinking from deletes) could potentially be the culprit, but I would only expect for this to result in a continuous growth of map size if the pipeline is seeing persistent spikes in growth that triggers before deleting. An alternative might be to use https://github.com/dolthub/swiss |
@Jeffail this could definitely be the culprit, the throughput is a little spikey, I've attached a graph of the last 24 hours below: And a graph of a much shorter period shows fairly significant spikes in the short term: With regards to the alternative, is that something we should implement or do you mean swapping out the cache in Benthos? |
Hey, I wanted to contribute and saw this open issue, I hope the solution is good enough :) |
@youngjosh have you given https://docs.redpanda.com/redpanda-connect/components/caches/lru/ or https://docs.redpanda.com/redpanda-connect/components/caches/ttlru/ a try? I forgot to mention this earlier but it would be good to have a chart showing the differences between these caches. |
Benthos Version: 4.24.0 (As pulled from the jeffail/benthos docker repo)
We have had an issue recently where our high volume Benthos dedupe pods (in the realm of ~2-3k messages per second) will slowly use more and more memory until eventually crashing. This does escalate slowly, for example below is this graph of memory usage over a 12 hour period for a pair of pods that dedupe ~1.5-2k messages per second, and as such are making 1.5k-2k dedupe checks against the cache each.
These pods have very little running on them beside a dedupe against the kafka key, I have given an example of our benthos configs below, removing specifics about our input/output and two (simple) bloblang processors. We have several very similar pipelines with slightly different bloblang processors, but all use this style of cache and all suffer the slow growth of memory usage until an OOM occurs.
In trying to diagnose why this was happening we spun the pods up in debug mode and ran pprof against the debug endpoints to track memory usage over time. The culprit was the memory cache, as I've shown below. It sits at 147.75MB in this screenshot but on leaving overnight went up by well over a hundred megabytes in this scenario despite the TTL on the cache being ~15 minutes (unfortunately I did not screenshot the later high memory usage. I can spin the pods back up in debug if this would help)
We tried messing with the memory cache resources (i.e. explicitly setting a compaction interval) but we did not see any change in the growth of the cache size. I also looked at one of our other benthos pods which performs a much larger set of operations and used a small memory cache (at around 40/s message throughput) and after tracking for a day, noticed that it grew too - although at a much, much slower rate.
In this way, I believe (unless I've missed something stupid, please correct me if so) that there is some form of memory leak in the Benthos memory cache which is directly proportional to the throughput utilising the cache.
The text was updated successfully, but these errors were encountered: