-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whiplash effect from stalled ingestion #837
Comments
The Prometheus client RAM explosion is something we've investigated a bit, and it's extremely consistent in our experiments. Even setting the configurations for the single remote write queue down to 1 shard with ~10000 sample capacity yields a massive memory footprint once ingestion is blocked downstream. Without having grokked the remote write queue manager code quite yet, it seems that the discarding of samples is either insufficiently responsive or the discarded samples are not able to be GC'd quickly enough to avoid this ballooning behavior. There's been talk in the past of changing up the remote writer to use the WAL, which seems like it could mitigate this behavior pretty effectively; when downstream ingestion breaks or slows down, the client would be able to simply stop trying for a time and then pick up reading forward on the WAL where it left off. Not a trivial amount of work, but if it's acceptable in the Prometheus community this could be a good thing all 'round. We have a story to try our hand at this work, but haven't pulled it yet. I might give it another shot in the next few days while waiting for various slow things to run... :-) |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Situation: you have a Cortex installation accepting data via
remote_write
from a number of Prometheus clients. This Cortex stalls for several minutes, so each Prometheus client starts to queue samples in memory.When Cortex service resumes, the queued data is sent as fast as possible by Prometheus clients.
Problem 1: decoding and ingesting all this data chews up CPU, so Cortex bogs down.
Problem 2: the per-user rate limiter will trigger, returning 500s.
Problem 3: Prometheus re-sends (by default after 30-100 milliseconds) on a 500, exacerbating Problem 1.
Problem 4: the Prometheus clients seem to blow up in RAM. This is anecdotal; there is code to discard excess data and I couldn't repeat the problem in a brief experiment.
Problem 5: if the stall lasts longer than the idle time, all ingesters are flushing all chunks to storage at this point, exacerbating Problem 1.
#836 is one approach to improve problems 2 and 3.
The text was updated successfully, but these errors were encountered: