Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak issue with Benthos memory cache #2404

Open
youngjosh opened this issue Feb 29, 2024 · 5 comments
Open

Memory leak issue with Benthos memory cache #2404

youngjosh opened this issue Feb 29, 2024 · 5 comments
Labels
bug caches Any tasks or issues relating to cache resources needs investigation It looks as though have all the information needed but investigation is required

Comments

@youngjosh
Copy link
Contributor

Benthos Version: 4.24.0 (As pulled from the jeffail/benthos docker repo)

We have had an issue recently where our high volume Benthos dedupe pods (in the realm of ~2-3k messages per second) will slowly use more and more memory until eventually crashing. This does escalate slowly, for example below is this graph of memory usage over a 12 hour period for a pair of pods that dedupe ~1.5-2k messages per second, and as such are making 1.5k-2k dedupe checks against the cache each.
Screenshot 2024-02-29 at 08 16 01

These pods have very little running on them beside a dedupe against the kafka key, I have given an example of our benthos configs below, removing specifics about our input/output and two (simple) bloblang processors. We have several very similar pipelines with slightly different bloblang processors, but all use this style of cache and all suffer the slow growth of memory usage until an OOM occurs.

cache_resources:
  - label: dedupe
    memory:
      default_ttl: 800s

input:
  kafka:
    # A standard kafka input is used

pipeline:
  threads: ${THREADS:8}
  processors:
    - for_each:
        - dedupe:
            cache: dedupe
            drop_on_err: false
            key: ${!meta("kafka_key")}
        - switch:
            # Two simple bloblang processors just for adjusting simple fields are contained here

output:
  kafka:
    # A standard kafka output is used

In trying to diagnose why this was happening we spun the pods up in debug mode and ran pprof against the debug endpoints to track memory usage over time. The culprit was the memory cache, as I've shown below. It sits at 147.75MB in this screenshot but on leaving overnight went up by well over a hundred megabytes in this scenario despite the TTL on the cache being ~15 minutes (unfortunately I did not screenshot the later high memory usage. I can spin the pods back up in debug if this would help)
Screenshot 2024-02-29 at 08 29 41

We tried messing with the memory cache resources (i.e. explicitly setting a compaction interval) but we did not see any change in the growth of the cache size. I also looked at one of our other benthos pods which performs a much larger set of operations and used a small memory cache (at around 40/s message throughput) and after tracking for a day, noticed that it grew too - although at a much, much slower rate.

In this way, I believe (unless I've missed something stupid, please correct me if so) that there is some form of memory leak in the Benthos memory cache which is directly proportional to the throughput utilising the cache.

@Jeffail Jeffail added bug caches Any tasks or issues relating to cache resources needs investigation It looks as though have all the information needed but investigation is required labels Feb 29, 2024
@Jeffail
Copy link
Collaborator

Jeffail commented Feb 29, 2024

Hey @youngjosh, thanks that should be all the info I need for now. I'll try and prioritise this.

@Jeffail
Copy link
Collaborator

Jeffail commented Mar 1, 2024

Hey @youngjosh I'm struggling to reproduce this with a naive set up, I'm generating a large volume of data but I suspect this may be related to spikes in traffic or some other throughput related behaviour.

Comment from @mihaitodor:

looks like golang/go#20135. See here https://github.com/benthosdev/benthos/blob/d7050096061cd78db0a51278986edfe20082c8e6/internal/impl/pure/cache_memory.go#L130

^ This issue (maps not shrinking from deletes) could potentially be the culprit, but I would only expect for this to result in a continuous growth of map size if the pipeline is seeing persistent spikes in growth that triggers before deleting.

An alternative might be to use https://github.com/dolthub/swiss

@youngjosh
Copy link
Contributor Author

@Jeffail this could definitely be the culprit, the throughput is a little spikey, I've attached a graph of the last 24 hours below:

Screenshot 2024-03-01 at 17 36 46

And a graph of a much shorter period shows fairly significant spikes in the short term:

Screenshot 2024-03-01 at 17 48 44

With regards to the alternative, is that something we should implement or do you mean swapping out the cache in Benthos?

@Mizaro
Copy link
Contributor

Mizaro commented May 11, 2024

Hey, I wanted to contribute and saw this open issue, I hope the solution is good enough :)

@Jeffail
Copy link
Collaborator

Jeffail commented Jun 12, 2024

@youngjosh have you given https://docs.redpanda.com/redpanda-connect/components/caches/lru/ or https://docs.redpanda.com/redpanda-connect/components/caches/ttlru/ a try? I forgot to mention this earlier but it would be good to have a chart showing the differences between these caches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug caches Any tasks or issues relating to cache resources needs investigation It looks as though have all the information needed but investigation is required
Projects
None yet
Development

No branches or pull requests

3 participants