admission: epoch based LIFO to prevent throughput collapse #71882

sumeerbhola · 2021-10-22T19:22:12Z

The epoch-LIFO scheme monitors the queueing delay for each (tenant, priority)
pair and switches between FIFO and LIFO queueing based on the maximum
observed delay. Lower percentile latency can be reduced under LIFO, at
the expense of increasing higher percentile latency. This behavior can
help when it is important to finish some transactions in a timely manner,
for scenarios which have external deadlines. Under FIFO, one could
experience throughput collapse in the presence of such deadlines and
an open loop workload, since when the first work item for a transaction
reaches the front of the queue, the transaction is close to exceeding
its deadline.

The epoch aspect of this scheme relies on clock synchronization (which
we have in CockroachDB deployments) and the expectation that
transaction/query deadlines will be significantly higher than execution
time under low load. A standard LIFO scheme suffers from a severe problem
when a single user transaction can result in multiple units of lower-level
work that get distributed to many nodes, and work execution can result in
new work being submitted for admission: the later work for a transaction
may no longer be the latest seen by the system (since "latest" is defined
based on transaction start time), so will not be preferred. This means
LIFO would do some work items from each transaction and starve the
remaining work, so nothing would complete. This can be as bad or worse
than FIFO which at least prefers the same transactions until they are
complete (both FIFO and LIFO are using the transaction start time, and
not the individual work arrival time).

Consider a case where transaction deadlines are 1s (note this may not
necessarily be an actual deadline, and could be a time duration after which
the user impact is extremely negative), and typical transaction execution
times (under low load) of 10ms. A 100ms epoch will increase transaction
latency to at most 100ms + 5ms + 10ms, since execution will not start until
the epoch of the transaction's start time is closed (5ms is the grace
period before we "close" an epoch). At that time, due to clock
synchronization, all nodes will start executing that epoch and will
implicitly have the same set of competing transactions, which are ordered
in the same manner. This set of competing transactions will stay unchanged
until the next epoch close. And by the time the next epoch closes and
the current epoch's transactions are deprioritized, 100ms will have
elapsed, which is enough time for most of these transactions that got
admitted to have finished all their work. The clock synchronization
expected here is stronger than the default 500ms value of --max-offset,
but that value is deliberately set to be extremely conservative to avoid
stale reads, while the use here has no effect on correctness.

Note that LIFO queueing will only happen at bottleneck nodes, and decided
on a (tenant, priority) basis. So if there is even a single bottleneck node
for a (tenant, priority), the above delay will occur. When the epoch closes
at the bottleneck node, the creation time for this transaction will be
sufficiently in the past, so the non-bottleneck nodes (using FIFO) will
prioritize it over recent transactions. There is a queue ordering
inversion in that the non-bottleneck nodes are ordering in the opposite
way for such closed epochs, but since they are not bottlenecked, the
queueing delay should be minimal.

Preliminary experiments with kv50/enc=false/nodes=1/conc=8192 are
promising in reducing p50 and p75 latency. See attached screenshots
showing the latency change when admission.epoch_lifo.enabled is set
to true.

Release note (ops change): The admission.epoch_lifo.enabled cluster
setting, disabled by default, enabled the use of epoch-LIFO adaptive
queueing behavior in admission control.

cockroach-teamcity · 2021-10-22T19:22:22Z

This change is

sumeerbhola · 2021-10-26T19:51:08Z

made some improvements, though they need some cleaning up. The latency drop is more significant:

sumeerbhola

@ajwerner @RaduBerinde I have cleaned up this PR, and would like to get this reviewed and merged prior to the start of the stability period.

There are a few "TODO:" log statements in the code. These are temporary, for some experiments I will be running concurrent with the review. I'll remove them before merging.
There is a TODO to make epochLengthNanos and epochLengthDeltaNanos configurable using a cluster setting. I plan to do this in the following PR.
The new behavior is gated behind a admission.epoch_lifo.enabled cluster setting that is false, since enabling this requires an operator to understand the tradeoffs. I plan to leave this as false until we have some real world experience with this being enabled.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

sumeerbhola · 2022-02-11T20:56:40Z

Latest before and after latencies from kv50/enc=false/nodes=1/conc=8192. The p50, p75 drop to less than 100ms, 200ms respectively (from 670ms, 1s)

sumeerbhola · 2022-02-14T14:56:42Z

A 3 node kv50 shows similar improvement.

I ran tpcc-overload/nodes=3/cpu=16/w=3800 -- the high warehouse count of 3800 was to force overload. This did not show any latency improvement at p50 when turning on epoch-LIFO. I am wondering whether this is because some of the txns are expensive enough that the 100ms epoch is not working well. When looking at a successful tpccbench run with 3136 warehouses (much lower than the 3800), the p50 for newOrder and delivery are > 240ms -- I'll need to understand how much of that latency is admission control itself, and how much is "real" work (including contention queueing).

ajwerner · 2022-02-14T16:20:02Z

@ajwerner @RaduBerinde I have cleaned up this PR, and would like to get this reviewed and merged prior to the start of the stability period.

Ack, I'll try to get to it later today or tomorrow.

the p50 for newOrder and delivery are > 240ms -- I'll need to understand how much of that latency is admission control itself, and how much is "real" work (including contention queueing).

One thing to note is that newOrder is something like 9 statements and delivery is 6. I wouldn't be surprised if they approach 100ms just during execution when there's very little load. I'd be interested to know what happens at a 500ms epoch

ajwerner

I couldn't find anything other than superficial comments to leave. This is very cool. It does seem like the epoch length is an important parameter here. I know knobs come with risks, but it seems like providing one more knob here for that duration could be worth it.

I spent a little while puzzling over the tryCloseEpoch loop with the ticker, and eventually came to like it. I tried to analyze what happens in various heap operations in weird ordering scenarios, but didn't reach any sort of enlightenment.

ajwerner · 2022-02-15T04:41:07Z