Memory leak issue with Benthos memory cache #2404

youngjosh · 2024-02-29T08:49:33Z

Benthos Version: 4.24.0 (As pulled from the jeffail/benthos docker repo)

We have had an issue recently where our high volume Benthos dedupe pods (in the realm of ~2-3k messages per second) will slowly use more and more memory until eventually crashing. This does escalate slowly, for example below is this graph of memory usage over a 12 hour period for a pair of pods that dedupe ~1.5-2k messages per second, and as such are making 1.5k-2k dedupe checks against the cache each.

These pods have very little running on them beside a dedupe against the kafka key, I have given an example of our benthos configs below, removing specifics about our input/output and two (simple) bloblang processors. We have several very similar pipelines with slightly different bloblang processors, but all use this style of cache and all suffer the slow growth of memory usage until an OOM occurs.

cache_resources:
  - label: dedupe
    memory:
      default_ttl: 800s

input:
  kafka:
    # A standard kafka input is used

pipeline:
  threads: ${THREADS:8}
  processors:
    - for_each:
        - dedupe:
            cache: dedupe
            drop_on_err: false
            key: ${!meta("kafka_key")}
        - switch:
            # Two simple bloblang processors just for adjusting simple fields are contained here

output:
  kafka:
    # A standard kafka output is used

In trying to diagnose why this was happening we spun the pods up in debug mode and ran pprof against the debug endpoints to track memory usage over time. The culprit was the memory cache, as I've shown below. It sits at 147.75MB in this screenshot but on leaving overnight went up by well over a hundred megabytes in this scenario despite the TTL on the cache being ~15 minutes (unfortunately I did not screenshot the later high memory usage. I can spin the pods back up in debug if this would help)

We tried messing with the memory cache resources (i.e. explicitly setting a compaction interval) but we did not see any change in the growth of the cache size. I also looked at one of our other benthos pods which performs a much larger set of operations and used a small memory cache (at around 40/s message throughput) and after tracking for a day, noticed that it grew too - although at a much, much slower rate.

In this way, I believe (unless I've missed something stupid, please correct me if so) that there is some form of memory leak in the Benthos memory cache which is directly proportional to the throughput utilising the cache.

The text was updated successfully, but these errors were encountered:

Jeffail · 2024-02-29T09:28:47Z

Hey @youngjosh, thanks that should be all the info I need for now. I'll try and prioritise this.

Jeffail · 2024-03-01T08:17:54Z

Hey @youngjosh I'm struggling to reproduce this with a naive set up, I'm generating a large volume of data but I suspect this may be related to spikes in traffic or some other throughput related behaviour.

Comment from @mihaitodor:

looks like golang/go#20135. See here https://github.com/benthosdev/benthos/blob/d7050096061cd78db0a51278986edfe20082c8e6/internal/impl/pure/cache_memory.go#L130

^ This issue (maps not shrinking from deletes) could potentially be the culprit, but I would only expect for this to result in a continuous growth of map size if the pipeline is seeing persistent spikes in growth that triggers before deleting.

An alternative might be to use https://github.com/dolthub/swiss

youngjosh · 2024-03-01T17:51:42Z

@Jeffail this could definitely be the culprit, the throughput is a little spikey, I've attached a graph of the last 24 hours below:

And a graph of a much shorter period shows fairly significant spikes in the short term:

With regards to the alternative, is that something we should implement or do you mean swapping out the cache in Benthos?

Mizaro · 2024-05-11T15:57:37Z

Hey, I wanted to contribute and saw this open issue, I hope the solution is good enough :)

Jeffail · 2024-06-12T19:34:30Z

@youngjosh have you given https://docs.redpanda.com/redpanda-connect/components/caches/lru/ or https://docs.redpanda.com/redpanda-connect/components/caches/ttlru/ a try? I forgot to mention this earlier but it would be good to have a chart showing the differences between these caches.

Jeffail added bug caches Any tasks or issues relating to cache resources needs investigation It looks as though have all the information needed but investigation is required labels Feb 29, 2024

Mizaro mentioned this issue May 11, 2024

Replaced map with swiss.Map #2579

Open

mihaitodor mentioned this issue Sep 16, 2024

memory: Cache needs to periodically compact redpanda-data/benthos#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak issue with Benthos memory cache #2404

Memory leak issue with Benthos memory cache #2404

youngjosh commented Feb 29, 2024

Jeffail commented Feb 29, 2024

Jeffail commented Mar 1, 2024

youngjosh commented Mar 1, 2024

Mizaro commented May 11, 2024

Jeffail commented Jun 12, 2024

Memory leak issue with Benthos memory cache #2404

Memory leak issue with Benthos memory cache #2404

Comments

youngjosh commented Feb 29, 2024

Jeffail commented Feb 29, 2024

Jeffail commented Mar 1, 2024

youngjosh commented Mar 1, 2024

Mizaro commented May 11, 2024

Jeffail commented Jun 12, 2024