Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whiplash effect from stalled ingestion #837

Closed
bboreham opened this issue Jun 4, 2018 · 3 comments
Closed

Whiplash effect from stalled ingestion #837

bboreham opened this issue Jun 4, 2018 · 3 comments
Labels

Comments

@bboreham
Copy link
Contributor

bboreham commented Jun 4, 2018

Situation: you have a Cortex installation accepting data via remote_write from a number of Prometheus clients. This Cortex stalls for several minutes, so each Prometheus client starts to queue samples in memory.

When Cortex service resumes, the queued data is sent as fast as possible by Prometheus clients.

Problem 1: decoding and ingesting all this data chews up CPU, so Cortex bogs down.
Problem 2: the per-user rate limiter will trigger, returning 500s.
Problem 3: Prometheus re-sends (by default after 30-100 milliseconds) on a 500, exacerbating Problem 1.
Problem 4: the Prometheus clients seem to blow up in RAM. This is anecdotal; there is code to discard excess data and I couldn't repeat the problem in a brief experiment.
Problem 5: if the stall lasts longer than the idle time, all ingesters are flushing all chunks to storage at this point, exacerbating Problem 1.

#836 is one approach to improve problems 2 and 3.

@cboggs
Copy link
Contributor

cboggs commented Jun 5, 2018

The Prometheus client RAM explosion is something we've investigated a bit, and it's extremely consistent in our experiments. Even setting the configurations for the single remote write queue down to 1 shard with ~10000 sample capacity yields a massive memory footprint once ingestion is blocked downstream. Without having grokked the remote write queue manager code quite yet, it seems that the discarding of samples is either insufficiently responsive or the discarded samples are not able to be GC'd quickly enough to avoid this ballooning behavior.

There's been talk in the past of changing up the remote writer to use the WAL, which seems like it could mitigate this behavior pretty effectively; when downstream ingestion breaks or slows down, the client would be able to simply stop trying for a time and then pick up reading forward on the WAL where it left off. Not a trivial amount of work, but if it's acceptable in the Prometheus community this could be a good thing all 'round.

We have a story to try our hand at this work, but haven't pulled it yet. I might give it another shot in the next few days while waiting for various slow things to run... :-)

@bboreham
Copy link
Contributor Author

Note the per-user rate-limiter changed in #836 to return 4xx instead of 5xx so the data is dropped.

Prometheus was changed to use the WAL, which means it behaves much better.

Possibly also related: #879

@stale
Copy link

stale bot commented Feb 3, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 3, 2020
@stale stale bot closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants