-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: epoch based LIFO to prevent throughput collapse #71882
Conversation
cca9c09
to
f4a7ba3
Compare
923a7a5
to
eb37576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajwerner @RaduBerinde I have cleaned up this PR, and would like to get this reviewed and merged prior to the start of the stability period.
- There are a few "TODO:" log statements in the code. These are temporary, for some experiments I will be running concurrent with the review. I'll remove them before merging.
- There is a TODO to make
epochLengthNanos
andepochLengthDeltaNanos
configurable using a cluster setting. I plan to do this in the following PR. - The new behavior is gated behind a
admission.epoch_lifo.enabled
cluster setting that is false, since enabling this requires an operator to understand the tradeoffs. I plan to leave this as false until we have some real world experience with this being enabled.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)
A 3 node kv50 shows similar improvement. I ran |
Ack, I'll try to get to it later today or tomorrow.
One thing to note is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find anything other than superficial comments to leave. This is very cool. It does seem like the epoch length is an important parameter here. I know knobs come with risks, but it seems like providing one more knob here for that duration could be worth it.
I spent a little while puzzling over the tryCloseEpoch loop with the ticker, and eventually came to like it. I tried to analyze what happens in various heap operations in weird ordering scenarios, but didn't reach any sort of enlightenment.
// enabled, and we don't want to carry forward unnecessarily granular | ||
// settings. | ||
var EpochLIFOEnabled = settings.RegisterBoolSetting( | ||
settings.TenantWritable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be TenantReadOnly
? Seems like the other settings here should also be changed to TenantReadOnly
.
pkg/util/admission/work_queue.go
Outdated
@@ -164,10 +178,15 @@ type WorkQueue struct { | |||
tenantHeap tenantHeap | |||
// All tenants, including those without waiting work. Periodically cleaned. | |||
tenants map[uint64]*tenantInfo | |||
// The highest epoch that is closed. | |||
closedEpochThreshold int64 | |||
lastLogStatementWithThreshold time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know about util.Every
? I think it's intended for precisely this pattern.
pkg/util/admission/work_queue.go
Outdated
}() | ||
q.tryCloseEpoch() | ||
if !opts.disableEpochClosingGoroutine { | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pull this loop out of this constructor into either a function or method? It closes over effectively nothing but q
.
pkg/util/admission/work_queue.go
Outdated
done := false | ||
for !done { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this done
dance better than just returning in the one place where you currently set done = true
?
pkg/util/admission/work_queue.go
Outdated
// tenant. However, currently we share metrics across WorkQueues -- | ||
// specifically all the store WorkQueues share the same metric. We | ||
// should eliminate that sharing and make those per store metrics. | ||
log.Infof(context.Background(), "%s: FIFO threshold for tenant %d %s %d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice for this to have annotated context. I'd suggest you construct one in makeWorkQueue. You have a log.AmbientContext
which could be used to annotate such a context.Context
just one call frame up from makeWorkQueue
.
How rare is this? Other commentary indicates that this should not change often.
pkg/util/admission/work_queue.go
Outdated
ps.ps = append(ps.ps, state) | ||
if i == n { | ||
return | ||
} | ||
for j := n; j > i; j-- { | ||
ps.ps[j] = ps.ps[j-1] | ||
} | ||
ps.ps[i] = state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this dance seems like:
if i == n {
ps.ps = append(ps.ps, n)
} else {
ps.ps = append(ps.ps[:i+1], ps.ps[i:]...)
ps.ps[i] = state
}
pkg/util/admission/work_queue.go
Outdated
// priorityStates tracks information about admission requests and admission | ||
// grants at various priorities. It is used to set a priority threshold for | ||
// LIFO queuing. | ||
type priorityStates struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there's one of these per tenant?
4731069
to
29a8691
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
It does seem like the epoch length is an important parameter here. I know knobs come with risks, but it seems like providing one more knob here for that duration could be worth it.
Will definitely do that in a followup PR.
I spent a little while puzzling over the tryCloseEpoch loop with the ticker, and eventually came to like it.
As mentioned in the other comment, this ticker seems to cause problems for TestLogic. Will try to improve in a later PR, but it may not be critical since I didn't see any overhead when running the roachtests.
I tried to analyze what happens in various heap operations in weird ordering scenarios, but didn't reach any sort of enlightenment.
:-) That is indeed tricky. Other than that code comment about not having strict weak ordering, and one of the test cases, I haven't spent any real time thinking about it.
One thing to note is that newOrder is something like 9 statements and delivery is 6. I wouldn't be surprised if they approach 100ms just during execution when there's very little load. I'd be interested to know what happens at a 500ms epoch
Thanks. I'll test some more. Are you ok if it doesn't block merging this PR?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)
pkg/util/admission/work_queue.go, line 74 at r4 (raw file):
Previously, ajwerner wrote…
Should this be
TenantReadOnly
? Seems like the other settings here should also be changed toTenantReadOnly
.
Glad you asked. I was initially surprised by the TenantWritable
nature of the existing settings, but had convinced myself that it was mostly correct. I've now added a comment earlier in the file reflecting my current mental model (also changed the KV setting to SystemOnly). Let me know if you think this is correct.
pkg/util/admission/work_queue.go, line 183 at r4 (raw file):
Previously, ajwerner wrote…
do you know about
util.Every
? I think it's intended for precisely this pattern.
Thanks for the pointer. Done
pkg/util/admission/work_queue.go, line 260 at r4 (raw file):
Previously, ajwerner wrote…
nit: pull this loop out of this constructor into either a function or method? It closes over effectively nothing but
q
.
Done
pkg/util/admission/work_queue.go, line 277 at r4 (raw file):
Previously, ajwerner wrote…
nit: is this
done
dance better than just returning in the one place where you currently setdone = true
?
Definitely not better. Simplified. Also made the change in the other goroutine that ticks every 1s.
pkg/util/admission/work_queue.go, line 349 at r4 (raw file):
Done.
How rare is this? Other commentary indicates that this should not change often.
I'm unsure. It is possible that it will be spammy (every 100ms), if we see fluctuations. I've seen some bursts of changes in TPCC. On the other hand, not seeing all the transitions is going to make it hard to understand how this is behaving in a production environment.
I wouldn't be surprised if we end up delaying the state transitions that reduce the FIFO threshold (i.e., stay in LIFO mode for a priority for longer, once we have transitioned there), which would also make this less spammy. I think it is a bit premature to make such tweaks.
pkg/util/admission/work_queue.go, line 741 at r4 (raw file):
Previously, ajwerner wrote…
Note that there's one of these per tenant?
Done
pkg/util/admission/work_queue.go, line 799 at r4 (raw file):
Previously, ajwerner wrote…
this dance seems like:
if i == n { ps.ps = append(ps.ps, n) } else { ps.ps = append(ps.ps[:i+1], ps.ps[i:]...) ps.ps[i] = state }
Ah yes, much better. Done.
pkg/util/admission/work_queue.go, line 272 at r5 (raw file):
// will be doing 1ms ticks, which is fine since there are no idle // processors. tickerDurShort := time.Millisecond
I suspect this 1ms ticking is causing TestLogic
to timeout. Maybe too much overhead when running lots of tiny tests, or something.
I'll explore a better solution, but meanwhile I've set it up such that a prerequisite to 1ms ticks is that epoch-LIFO is enabled. Either the timer facility in golang is not very efficient (I couldn't quite tell based on glancing at the code in https://go.dev/src/runtime/time.go) or the fact that each tick requires a goroutine to be scheduled (versus a single thread calling back for expiry of many timers) is causing too much overhead.
Maybe I'll move it to 1 timer per 1 GrantCoordinators
, so that a CockroachDB node has only 1 of these (in a subsequent PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you ok if it doesn't block merging this PR?
Absolutely.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @sumeerbhola)
pkg/util/admission/work_queue.go, line 74 at r4 (raw file):
Previously, sumeerbhola wrote…
Glad you asked. I was initially surprised by the
TenantWritable
nature of the existing settings, but had convinced myself that it was mostly correct. I've now added a comment earlier in the file reflecting my current mental model (also changed the KV setting to SystemOnly). Let me know if you think this is correct.
Yeah, it makes sense now.
pkg/util/admission/work_queue.go, line 272 at r5 (raw file):
Previously, sumeerbhola wrote…
I suspect this 1ms ticking is causing
TestLogic
to timeout. Maybe too much overhead when running lots of tiny tests, or something.
I'll explore a better solution, but meanwhile I've set it up such that a prerequisite to 1ms ticks is that epoch-LIFO is enabled. Either the timer facility in golang is not very efficient (I couldn't quite tell based on glancing at the code in https://go.dev/src/runtime/time.go) or the fact that each tick requires a goroutine to be scheduled (versus a single thread calling back for expiry of many timers) is causing too much overhead.
Maybe I'll move it to 1 timer per 1GrantCoordinators
, so that a CockroachDB node has only 1 of these (in a subsequent PR).
👍
If I'm being honest, when I first read it, I was thinking that instead of a ticker, you'd use a timer and just change how you schedule it. As I pondered more, I came to see the logic of this approach. You really want to have a execution that happens right around when the epoch ends. If there's fear that you might miss it due to bad scheduling/overload, then you may as well go high frequency to try to get a scheduling that hits you right around when the epoch ends and then, once you've hit it, if you're hitting it closely, then you may as well give the runtime an opportunity to get it right again.
That being said, say you're 4ms late, then you ask the runtime to tell you to run 100ms later, it feels like if it is just off by 1ms, then you'll miss your goal. It feels like with a timer set to exactly the next epoch boundary (whcih you do know), then you could be more robust to having set the ticker to run a 4ms past the epoch.
The epoch-LIFO scheme monitors the queueing delay for each (tenant, priority) pair and switches between FIFO and LIFO queueing based on the maximum observed delay. Lower percentile latency can be reduced under LIFO, at the expense of increasing higher percentile latency. This behavior can help when it is important to finish some transactions in a timely manner, for scenarios which have external deadlines. Under FIFO, one could experience throughput collapse in the presence of such deadlines and an open loop workload, since when the first work item for a transaction reaches the front of the queue, the transaction is close to exceeding its deadline. The epoch aspect of this scheme relies on clock synchronization (which we have in CockroachDB deployments) and the expectation that transaction/query deadlines will be significantly higher than execution time under low load. A standard LIFO scheme suffers from a severe problem when a single user transaction can result in multiple units of lower-level work that get distributed to many nodes, and work execution can result in new work being submitted for admission: the later work for a transaction may no longer be the latest seen by the system (since "latest" is defined based on transaction start time), so will not be preferred. This means LIFO would do some work items from each transaction and starve the remaining work, so nothing would complete. This can be as bad or worse than FIFO which at least prefers the same transactions until they are complete (both FIFO and LIFO are using the transaction start time, and not the individual work arrival time). Consider a case where transaction deadlines are 1s (note this may not necessarily be an actual deadline, and could be a time duration after which the user impact is extremely negative), and typical transaction execution times (under low load) of 10ms. A 100ms epoch will increase transaction latency to at most 100ms + 5ms + 10ms, since execution will not start until the epoch of the transaction's start time is closed (5ms is the grace period before we "close" an epoch). At that time, due to clock synchronization, all nodes will start executing that epoch and will implicitly have the same set of competing transactions, which are ordered in the same manner. This set of competing transactions will stay unchanged until the next epoch close. And by the time the next epoch closes and the current epoch's transactions are deprioritized, 100ms will have elapsed, which is enough time for most of these transactions that got admitted to have finished all their work. The clock synchronization expected here is stronger than the default 500ms value of --max-offset, but that value is deliberately set to be extremely conservative to avoid stale reads, while the use here has no effect on correctness. Note that LIFO queueing will only happen at bottleneck nodes, and decided on a (tenant, priority) basis. So if there is even a single bottleneck node for a (tenant, priority), the above delay will occur. When the epoch closes at the bottleneck node, the creation time for this transaction will be sufficiently in the past, so the non-bottleneck nodes (using FIFO) will prioritize it over recent transactions. There is a queue ordering inversion in that the non-bottleneck nodes are ordering in the opposite way for such closed epochs, but since they are not bottlenecked, the queueing delay should be minimal. Preliminary experiments with kv50/enc=false/nodes=1/conc=8192 are promising in reducing p50 and p75 latency. Release note (ops change): The admission.epoch_lifo.enabled cluster setting, disabled by default, enabled the use of epoch-LIFO adaptive queueing behavior in admission control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, thanks for the review -- I realize the long inactivity on my part was non ideal.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde and @sumeerbhola)
pkg/util/admission/work_queue.go, line 272 at r5 (raw file):
Previously, ajwerner wrote…
👍
If I'm being honest, when I first read it, I was thinking that instead of a ticker, you'd use a timer and just change how you schedule it. As I pondered more, I came to see the logic of this approach. You really want to have a execution that happens right around when the epoch ends. If there's fear that you might miss it due to bad scheduling/overload, then you may as well go high frequency to try to get a scheduling that hits you right around when the epoch ends and then, once you've hit it, if you're hitting it closely, then you may as well give the runtime an opportunity to get it right again.
That being said, say you're 4ms late, then you ask the runtime to tell you to run 100ms later, it feels like if it is just off by 1ms, then you'll miss your goal. It feels like with a timer set to exactly the next epoch boundary (whcih you do know), then you could be more robust to having set the ticker to run a 4ms past the epoch.
You have a good point. I have added a TODO to try with a Timer instead.
bors r=ajwerner |
Build succeeded: |
The epoch-LIFO scheme monitors the queueing delay for each (tenant, priority)
pair and switches between FIFO and LIFO queueing based on the maximum
observed delay. Lower percentile latency can be reduced under LIFO, at
the expense of increasing higher percentile latency. This behavior can
help when it is important to finish some transactions in a timely manner,
for scenarios which have external deadlines. Under FIFO, one could
experience throughput collapse in the presence of such deadlines and
an open loop workload, since when the first work item for a transaction
reaches the front of the queue, the transaction is close to exceeding
its deadline.
The epoch aspect of this scheme relies on clock synchronization (which
we have in CockroachDB deployments) and the expectation that
transaction/query deadlines will be significantly higher than execution
time under low load. A standard LIFO scheme suffers from a severe problem
when a single user transaction can result in multiple units of lower-level
work that get distributed to many nodes, and work execution can result in
new work being submitted for admission: the later work for a transaction
may no longer be the latest seen by the system (since "latest" is defined
based on transaction start time), so will not be preferred. This means
LIFO would do some work items from each transaction and starve the
remaining work, so nothing would complete. This can be as bad or worse
than FIFO which at least prefers the same transactions until they are
complete (both FIFO and LIFO are using the transaction start time, and
not the individual work arrival time).
Consider a case where transaction deadlines are 1s (note this may not
necessarily be an actual deadline, and could be a time duration after which
the user impact is extremely negative), and typical transaction execution
times (under low load) of 10ms. A 100ms epoch will increase transaction
latency to at most 100ms + 5ms + 10ms, since execution will not start until
the epoch of the transaction's start time is closed (5ms is the grace
period before we "close" an epoch). At that time, due to clock
synchronization, all nodes will start executing that epoch and will
implicitly have the same set of competing transactions, which are ordered
in the same manner. This set of competing transactions will stay unchanged
until the next epoch close. And by the time the next epoch closes and
the current epoch's transactions are deprioritized, 100ms will have
elapsed, which is enough time for most of these transactions that got
admitted to have finished all their work. The clock synchronization
expected here is stronger than the default 500ms value of --max-offset,
but that value is deliberately set to be extremely conservative to avoid
stale reads, while the use here has no effect on correctness.
Note that LIFO queueing will only happen at bottleneck nodes, and decided
on a (tenant, priority) basis. So if there is even a single bottleneck node
for a (tenant, priority), the above delay will occur. When the epoch closes
at the bottleneck node, the creation time for this transaction will be
sufficiently in the past, so the non-bottleneck nodes (using FIFO) will
prioritize it over recent transactions. There is a queue ordering
inversion in that the non-bottleneck nodes are ordering in the opposite
way for such closed epochs, but since they are not bottlenecked, the
queueing delay should be minimal.
Preliminary experiments with kv50/enc=false/nodes=1/conc=8192 are
promising in reducing p50 and p75 latency. See attached screenshots
showing the latency change when admission.epoch_lifo.enabled is set
to true.
Release note (ops change): The admission.epoch_lifo.enabled cluster
setting, disabled by default, enabled the use of epoch-LIFO adaptive
queueing behavior in admission control.