Whiplash effect from stalled ingestion #837

bboreham · 2018-06-04T12:31:31Z

Situation: you have a Cortex installation accepting data via remote_write from a number of Prometheus clients. This Cortex stalls for several minutes, so each Prometheus client starts to queue samples in memory.

When Cortex service resumes, the queued data is sent as fast as possible by Prometheus clients.

Problem 1: decoding and ingesting all this data chews up CPU, so Cortex bogs down.
Problem 2: the per-user rate limiter will trigger, returning 500s.
Problem 3: Prometheus re-sends (by default after 30-100 milliseconds) on a 500, exacerbating Problem 1.
Problem 4: the Prometheus clients seem to blow up in RAM. This is anecdotal; there is code to discard excess data and I couldn't repeat the problem in a brief experiment.
Problem 5: if the stall lasts longer than the idle time, all ingesters are flushing all chunks to storage at this point, exacerbating Problem 1.

#836 is one approach to improve problems 2 and 3.

The text was updated successfully, but these errors were encountered:

cboggs · 2018-06-05T15:46:44Z

The Prometheus client RAM explosion is something we've investigated a bit, and it's extremely consistent in our experiments. Even setting the configurations for the single remote write queue down to 1 shard with ~10000 sample capacity yields a massive memory footprint once ingestion is blocked downstream. Without having grokked the remote write queue manager code quite yet, it seems that the discarding of samples is either insufficiently responsive or the discarded samples are not able to be GC'd quickly enough to avoid this ballooning behavior.

There's been talk in the past of changing up the remote writer to use the WAL, which seems like it could mitigate this behavior pretty effectively; when downstream ingestion breaks or slows down, the client would be able to simply stop trying for a time and then pick up reading forward on the WAL where it left off. Not a trivial amount of work, but if it's acceptable in the Prometheus community this could be a good thing all 'round.

We have a story to try our hand at this work, but haven't pulled it yet. I might give it another shot in the next few days while waiting for various slow things to run... :-)

bboreham · 2019-10-21T10:34:55Z

Note the per-user rate-limiter changed in #836 to return 4xx instead of 5xx so the data is dropped.

Prometheus was changed to use the WAL, which means it behaves much better.

Possibly also related: #879

stale · 2020-02-03T11:56:30Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

bboreham mentioned this issue Jan 27, 2020

Cortex should tell remote write clients to slow down on rate-limiting #2037

Open

stale bot added the stale label Feb 3, 2020

stale bot closed this as completed Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whiplash effect from stalled ingestion #837

Whiplash effect from stalled ingestion #837

bboreham commented Jun 4, 2018 •

edited

Loading

cboggs commented Jun 5, 2018

bboreham commented Oct 21, 2019

stale bot commented Feb 3, 2020

Whiplash effect from stalled ingestion #837

Whiplash effect from stalled ingestion #837

Comments

bboreham commented Jun 4, 2018 • edited Loading

cboggs commented Jun 5, 2018

bboreham commented Oct 21, 2019

stale bot commented Feb 3, 2020

bboreham commented Jun 4, 2018 •

edited

Loading