-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with kafka source, http sink and 429 code
responses
#17276
Comments
👋 Thanks for the detailed report @Ilmarii! I'm working on reproducing this locally. I'm curious if you still see this behavior if acknowledgements are disabled, or if you've tried switching from a memory buffer to a disk buffer |
Noting I wasn't able to reproduce this locally when only returning EDIT: with some |
@Ilmarii it looks to me, based on my local testing that if a disk buffer is used, or acknowledgements are disabled, this memory use isn't observed. I'll continue to track down the cause, but in the interim you could use that to workaround the issue in your deployment. |
Fairly certain I've @Ilmarii would you be able to run a nightly version to see if my changes help enough in your environment |
@spencergilbert Thanks for your fix! I will try the fix and answer with the results. |
As for the disk buffer, we switched out, since with the disk buffer we have a strong underutilization of the CPU (only 1-2 cores out of 8 were used). And now we use a memory buffer + acknowledgements for some protection against data loss. |
@spencergilbert I built and tested vector from |
Thanks! Curious was there any improvement at all on your end? |
I found no improvement, leak size is still the same. But I tested a273ff7, without you latest two commits. |
Now I testing with acknowledges disabled, maybe this would helpful. |
Well, with acknowledges disabled there is no memory leak :-) |
Another thought we had is that there's an issue with how acknowledgements are handled by the old style sink(s) and rewriting |
@Ilmarii any chance you could try some variations of your configuration to try to isolate the issue? We could try swapping out the |
@jszwedko Yes, I will try it. |
Hi @Ilmarii, did you have a chance to try any of the combinations suggested above? |
I believe we may also be hitting this issue with the Once the throttling subsides, so does the OOMing. I can also remove the kinesis sink (but keep the k8s logs source), and the problem disappears (which points to the issue being on the sink side). |
That would be consistent with the sink building up a backlog of requests to process due to the throttling 🤔 |
I could also add: the explosion in memory happens quite quickly. In a steady-state, when things are good, per-pod memory usage is a couple of hundred Mi. We have the pod limit set to 3Gi. Monitoring |
Observing this via a worker node, once the journal entry for kubernetes_logs is accessed, it dies in a matter of seconds. It happens so quick, it's hard to debug. Right before the OOM, I did notice about ~50 additional sockets under |
Interesting, yeah, if you are using a concurrency of Could you share your configuration, as well? |
Will enable debug logs and see if we can ascertain the concurrency level. I also think the rate-limiting is a red-herring, as after further testing, this behavior is also observed when we are not rate-limited by AWS. Also, apologies for hijacking this thread. I've opened a separate ticket for this issue (I should have done this initially), and uploaded the configuration as well: #18397 |
I suspect the |
Reading up on this issue, it sounds like we are still waiting to hear back from @Ilmarii on trying out the adjustments to the config in order to help isolate the issue. |
Hi @Ilmarii, I believe this should be resolved in the latest version (v0.34.0). We fixed a memory leak in the http sink (#18637) that would be triggered when the downstream service returns 429 and also refactored the Kafka source to better handle acknowledgements (#17497). Let us know if you still experience issues after upgrading. |
A note for the community
Problem
We have a configuration with a kafka source and an http sink, and with acknowledgements enabled. And it works fine when the http sink only receives successful responses, but when the sink receives
429
responses, the memory starts growing until the vector is killed by the OOM.Also I tried turning on
allocation-tracing
and the graph looks like there is a memory leak in the kafka source.Some graphs:
Memory usage from vector node
Internal allocation tracing, all components:
Internal allocation tracing, only kafka source:
Rate of http sink requests with response code:
Configuration
Version
0.29.1
Debug Output
Example Data
No response
Additional Context
For MRE: I created an http server for http sink with simple logic: when concurrent requests < 20 then sleeps for 40 seconds and returns success, otherwise returns 429 immediately.
The content of the kafka topic is not important, just need enough amount.
Vector running in k8s.
References
No response
The text was updated successfully, but these errors were encountered: