Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: epoch based LIFO to prevent throughput collapse #71882

Merged
merged 1 commit into from
Feb 16, 2022

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Oct 22, 2021

The epoch-LIFO scheme monitors the queueing delay for each (tenant, priority)
pair and switches between FIFO and LIFO queueing based on the maximum
observed delay. Lower percentile latency can be reduced under LIFO, at
the expense of increasing higher percentile latency. This behavior can
help when it is important to finish some transactions in a timely manner,
for scenarios which have external deadlines. Under FIFO, one could
experience throughput collapse in the presence of such deadlines and
an open loop workload, since when the first work item for a transaction
reaches the front of the queue, the transaction is close to exceeding
its deadline.

The epoch aspect of this scheme relies on clock synchronization (which
we have in CockroachDB deployments) and the expectation that
transaction/query deadlines will be significantly higher than execution
time under low load. A standard LIFO scheme suffers from a severe problem
when a single user transaction can result in multiple units of lower-level
work that get distributed to many nodes, and work execution can result in
new work being submitted for admission: the later work for a transaction
may no longer be the latest seen by the system (since "latest" is defined
based on transaction start time), so will not be preferred. This means
LIFO would do some work items from each transaction and starve the
remaining work, so nothing would complete. This can be as bad or worse
than FIFO which at least prefers the same transactions until they are
complete (both FIFO and LIFO are using the transaction start time, and
not the individual work arrival time).

Consider a case where transaction deadlines are 1s (note this may not
necessarily be an actual deadline, and could be a time duration after which
the user impact is extremely negative), and typical transaction execution
times (under low load) of 10ms. A 100ms epoch will increase transaction
latency to at most 100ms + 5ms + 10ms, since execution will not start until
the epoch of the transaction's start time is closed (5ms is the grace
period before we "close" an epoch). At that time, due to clock
synchronization, all nodes will start executing that epoch and will
implicitly have the same set of competing transactions, which are ordered
in the same manner. This set of competing transactions will stay unchanged
until the next epoch close. And by the time the next epoch closes and
the current epoch's transactions are deprioritized, 100ms will have
elapsed, which is enough time for most of these transactions that got
admitted to have finished all their work. The clock synchronization
expected here is stronger than the default 500ms value of --max-offset,
but that value is deliberately set to be extremely conservative to avoid
stale reads, while the use here has no effect on correctness.

Note that LIFO queueing will only happen at bottleneck nodes, and decided
on a (tenant, priority) basis. So if there is even a single bottleneck node
for a (tenant, priority), the above delay will occur. When the epoch closes
at the bottleneck node, the creation time for this transaction will be
sufficiently in the past, so the non-bottleneck nodes (using FIFO) will
prioritize it over recent transactions. There is a queue ordering
inversion in that the non-bottleneck nodes are ordering in the opposite
way for such closed epochs, but since they are not bottlenecked, the
queueing delay should be minimal.

Preliminary experiments with kv50/enc=false/nodes=1/conc=8192 are
promising in reducing p50 and p75 latency. See attached screenshots
showing the latency change when admission.epoch_lifo.enabled is set
to true.
Screen Shot 2021-10-22 at 2 22 40 PM
Screen Shot 2021-10-22 at 2 22 21 PM

Release note (ops change): The admission.epoch_lifo.enabled cluster
setting, disabled by default, enabled the use of epoch-LIFO adaptive
queueing behavior in admission control.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@sumeerbhola sumeerbhola force-pushed the lifo branch 3 times, most recently from cca9c09 to f4a7ba3 Compare October 26, 2021 19:30
@sumeerbhola
Copy link
Collaborator Author

made some improvements, though they need some cleaning up. The latency drop is more significant:
Screen Shot 2021-10-26 at 3 46 55 PM
Screen Shot 2021-10-26 at 3 47 17 PM
Screen Shot 2021-10-26 at 3 49 22 PM
Screen Shot 2021-10-26 at 3 49 59 PM

@sumeerbhola sumeerbhola requested a review from a team as a code owner February 10, 2022 18:59
@sumeerbhola sumeerbhola force-pushed the lifo branch 2 times, most recently from 923a7a5 to eb37576 Compare February 11, 2022 16:46
@sumeerbhola sumeerbhola changed the title [DNM] admission: epoch based LIFO to prevent throughput collapse admission: epoch based LIFO to prevent throughput collapse Feb 11, 2022
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner @RaduBerinde I have cleaned up this PR, and would like to get this reviewed and merged prior to the start of the stability period.

  • There are a few "TODO:" log statements in the code. These are temporary, for some experiments I will be running concurrent with the review. I'll remove them before merging.
  • There is a TODO to make epochLengthNanos and epochLengthDeltaNanos configurable using a cluster setting. I plan to do this in the following PR.
  • The new behavior is gated behind a admission.epoch_lifo.enabled cluster setting that is false, since enabling this requires an operator to understand the tradeoffs. I plan to leave this as false until we have some real world experience with this being enabled.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

@sumeerbhola
Copy link
Collaborator Author

Latest before and after latencies from kv50/enc=false/nodes=1/conc=8192. The p50, p75 drop to less than 100ms, 200ms respectively (from 670ms, 1s)
Screen Shot 2022-02-11 at 3 03 24 PM
Screen Shot 2022-02-11 at 3 03 14 PM

@sumeerbhola
Copy link
Collaborator Author

A 3 node kv50 shows similar improvement.

I ran tpcc-overload/nodes=3/cpu=16/w=3800 -- the high warehouse count of 3800 was to force overload. This did not show any latency improvement at p50 when turning on epoch-LIFO. I am wondering whether this is because some of the txns are expensive enough that the 100ms epoch is not working well. When looking at a successful tpccbench run with 3136 warehouses (much lower than the 3800), the p50 for newOrder and delivery are > 240ms -- I'll need to understand how much of that latency is admission control itself, and how much is "real" work (including contention queueing).

@ajwerner
Copy link
Contributor

@ajwerner @RaduBerinde I have cleaned up this PR, and would like to get this reviewed and merged prior to the start of the stability period.

Ack, I'll try to get to it later today or tomorrow.

the p50 for newOrder and delivery are > 240ms -- I'll need to understand how much of that latency is admission control itself, and how much is "real" work (including contention queueing).

One thing to note is that newOrder is something like 9 statements and delivery is 6. I wouldn't be surprised if they approach 100ms just during execution when there's very little load. I'd be interested to know what happens at a 500ms epoch

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find anything other than superficial comments to leave. This is very cool. It does seem like the epoch length is an important parameter here. I know knobs come with risks, but it seems like providing one more knob here for that duration could be worth it.

I spent a little while puzzling over the tryCloseEpoch loop with the ticker, and eventually came to like it. I tried to analyze what happens in various heap operations in weird ordering scenarios, but didn't reach any sort of enlightenment.

// enabled, and we don't want to carry forward unnecessarily granular
// settings.
var EpochLIFOEnabled = settings.RegisterBoolSetting(
settings.TenantWritable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be TenantReadOnly? Seems like the other settings here should also be changed to TenantReadOnly.

@@ -164,10 +178,15 @@ type WorkQueue struct {
tenantHeap tenantHeap
// All tenants, including those without waiting work. Periodically cleaned.
tenants map[uint64]*tenantInfo
// The highest epoch that is closed.
closedEpochThreshold int64
lastLogStatementWithThreshold time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know about util.Every? I think it's intended for precisely this pattern.

}()
q.tryCloseEpoch()
if !opts.disableEpochClosingGoroutine {
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pull this loop out of this constructor into either a function or method? It closes over effectively nothing but q.

Comment on lines 276 to 277
done := false
for !done {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this done dance better than just returning in the one place where you currently set done = true?

// tenant. However, currently we share metrics across WorkQueues --
// specifically all the store WorkQueues share the same metric. We
// should eliminate that sharing and make those per store metrics.
log.Infof(context.Background(), "%s: FIFO threshold for tenant %d %s %d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice for this to have annotated context. I'd suggest you construct one in makeWorkQueue. You have a log.AmbientContext which could be used to annotate such a context.Context just one call frame up from makeWorkQueue.

How rare is this? Other commentary indicates that this should not change often.

Comment on lines 792 to 799
ps.ps = append(ps.ps, state)
if i == n {
return
}
for j := n; j > i; j-- {
ps.ps[j] = ps.ps[j-1]
}
ps.ps[i] = state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dance seems like:

if i == n {
    ps.ps = append(ps.ps, n)
} else {
    ps.ps = append(ps.ps[:i+1], ps.ps[i:]...)
    ps.ps[i] = state
}

Comment on lines 738 to 767
// priorityStates tracks information about admission requests and admission
// grants at various priorities. It is used to set a priority threshold for
// LIFO queuing.
type priorityStates struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's one of these per tenant?

@sumeerbhola sumeerbhola force-pushed the lifo branch 2 times, most recently from 4731069 to 29a8691 Compare February 16, 2022 03:28
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

It does seem like the epoch length is an important parameter here. I know knobs come with risks, but it seems like providing one more knob here for that duration could be worth it.

Will definitely do that in a followup PR.

I spent a little while puzzling over the tryCloseEpoch loop with the ticker, and eventually came to like it.

As mentioned in the other comment, this ticker seems to cause problems for TestLogic. Will try to improve in a later PR, but it may not be critical since I didn't see any overhead when running the roachtests.

I tried to analyze what happens in various heap operations in weird ordering scenarios, but didn't reach any sort of enlightenment.

:-) That is indeed tricky. Other than that code comment about not having strict weak ordering, and one of the test cases, I haven't spent any real time thinking about it.

One thing to note is that newOrder is something like 9 statements and delivery is 6. I wouldn't be surprised if they approach 100ms just during execution when there's very little load. I'd be interested to know what happens at a 500ms epoch

Thanks. I'll test some more. Are you ok if it doesn't block merging this PR?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/util/admission/work_queue.go, line 74 at r4 (raw file):

Previously, ajwerner wrote…

Should this be TenantReadOnly? Seems like the other settings here should also be changed to TenantReadOnly.

Glad you asked. I was initially surprised by the TenantWritable nature of the existing settings, but had convinced myself that it was mostly correct. I've now added a comment earlier in the file reflecting my current mental model (also changed the KV setting to SystemOnly). Let me know if you think this is correct.


pkg/util/admission/work_queue.go, line 183 at r4 (raw file):

Previously, ajwerner wrote…

do you know about util.Every? I think it's intended for precisely this pattern.

Thanks for the pointer. Done


pkg/util/admission/work_queue.go, line 260 at r4 (raw file):

Previously, ajwerner wrote…

nit: pull this loop out of this constructor into either a function or method? It closes over effectively nothing but q.

Done


pkg/util/admission/work_queue.go, line 277 at r4 (raw file):

Previously, ajwerner wrote…

nit: is this done dance better than just returning in the one place where you currently set done = true?

Definitely not better. Simplified. Also made the change in the other goroutine that ticks every 1s.


pkg/util/admission/work_queue.go, line 349 at r4 (raw file):
Done.

How rare is this? Other commentary indicates that this should not change often.

I'm unsure. It is possible that it will be spammy (every 100ms), if we see fluctuations. I've seen some bursts of changes in TPCC. On the other hand, not seeing all the transitions is going to make it hard to understand how this is behaving in a production environment.
I wouldn't be surprised if we end up delaying the state transitions that reduce the FIFO threshold (i.e., stay in LIFO mode for a priority for longer, once we have transitioned there), which would also make this less spammy. I think it is a bit premature to make such tweaks.


pkg/util/admission/work_queue.go, line 741 at r4 (raw file):

Previously, ajwerner wrote…

Note that there's one of these per tenant?

Done


pkg/util/admission/work_queue.go, line 799 at r4 (raw file):

Previously, ajwerner wrote…

this dance seems like:

if i == n {
    ps.ps = append(ps.ps, n)
} else {
    ps.ps = append(ps.ps[:i+1], ps.ps[i:]...)
    ps.ps[i] = state
}

Ah yes, much better. Done.


pkg/util/admission/work_queue.go, line 272 at r5 (raw file):

			// will be doing 1ms ticks, which is fine since there are no idle
			// processors.
			tickerDurShort := time.Millisecond

I suspect this 1ms ticking is causing TestLogic to timeout. Maybe too much overhead when running lots of tiny tests, or something.
I'll explore a better solution, but meanwhile I've set it up such that a prerequisite to 1ms ticks is that epoch-LIFO is enabled. Either the timer facility in golang is not very efficient (I couldn't quite tell based on glancing at the code in https://go.dev/src/runtime/time.go) or the fact that each tick requires a goroutine to be scheduled (versus a single thread calling back for expiry of many timers) is causing too much overhead.
Maybe I'll move it to 1 timer per 1 GrantCoordinators, so that a CockroachDB node has only 1 of these (in a subsequent PR).

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Are you ok if it doesn't block merging this PR?

Absolutely.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @sumeerbhola)


pkg/util/admission/work_queue.go, line 74 at r4 (raw file):

Previously, sumeerbhola wrote…

Glad you asked. I was initially surprised by the TenantWritable nature of the existing settings, but had convinced myself that it was mostly correct. I've now added a comment earlier in the file reflecting my current mental model (also changed the KV setting to SystemOnly). Let me know if you think this is correct.

Yeah, it makes sense now.


pkg/util/admission/work_queue.go, line 272 at r5 (raw file):

Previously, sumeerbhola wrote…

I suspect this 1ms ticking is causing TestLogic to timeout. Maybe too much overhead when running lots of tiny tests, or something.
I'll explore a better solution, but meanwhile I've set it up such that a prerequisite to 1ms ticks is that epoch-LIFO is enabled. Either the timer facility in golang is not very efficient (I couldn't quite tell based on glancing at the code in https://go.dev/src/runtime/time.go) or the fact that each tick requires a goroutine to be scheduled (versus a single thread calling back for expiry of many timers) is causing too much overhead.
Maybe I'll move it to 1 timer per 1 GrantCoordinators, so that a CockroachDB node has only 1 of these (in a subsequent PR).

👍

If I'm being honest, when I first read it, I was thinking that instead of a ticker, you'd use a timer and just change how you schedule it. As I pondered more, I came to see the logic of this approach. You really want to have a execution that happens right around when the epoch ends. If there's fear that you might miss it due to bad scheduling/overload, then you may as well go high frequency to try to get a scheduling that hits you right around when the epoch ends and then, once you've hit it, if you're hitting it closely, then you may as well give the runtime an opportunity to get it right again.

That being said, say you're 4ms late, then you ask the runtime to tell you to run 100ms later, it feels like if it is just off by 1ms, then you'll miss your goal. It feels like with a timer set to exactly the next epoch boundary (whcih you do know), then you could be more robust to having set the ticker to run a 4ms past the epoch.

The epoch-LIFO scheme monitors the queueing delay for each (tenant, priority)
pair and switches between FIFO and LIFO queueing based on the maximum
observed delay. Lower percentile latency can be reduced under LIFO, at
the expense of increasing higher percentile latency. This behavior can
help when it is important to finish some transactions in a timely manner,
for scenarios which have external deadlines. Under FIFO, one could
experience throughput collapse in the presence of such deadlines and
an open loop workload, since when the first work item for a transaction
reaches the front of the queue, the transaction is close to exceeding
its deadline.

The epoch aspect of this scheme relies on clock synchronization (which
we have in CockroachDB deployments) and the expectation that
transaction/query deadlines will be significantly higher than execution
time under low load. A standard LIFO scheme suffers from a severe problem
when a single user transaction can result in multiple units of lower-level
work that get distributed to many nodes, and work execution can result in
new work being submitted for admission: the later work for a transaction
may no longer be the latest seen by the system (since "latest" is defined
based on transaction start time), so will not be preferred. This means
LIFO would do some work items from each transaction and starve the
remaining work, so nothing would complete. This can be as bad or worse
than FIFO which at least prefers the same transactions until they are
complete (both FIFO and LIFO are using the transaction start time, and
not the individual work arrival time).

Consider a case where transaction deadlines are 1s (note this may not
necessarily be an actual deadline, and could be a time duration after which
the user impact is extremely negative), and typical transaction execution
times (under low load) of 10ms. A 100ms epoch will increase transaction
latency to at most 100ms + 5ms + 10ms, since execution will not start until
the epoch of the transaction's start time is closed (5ms is the grace
period before we "close" an epoch). At that time, due to clock
synchronization, all nodes will start executing that epoch and will
implicitly have the same set of competing transactions, which are ordered
in the same manner. This set of competing transactions will stay unchanged
until the next epoch close. And by the time the next epoch closes and
the current epoch's transactions are deprioritized, 100ms will have
elapsed, which is enough time for most of these transactions that got
admitted to have finished all their work. The clock synchronization
expected here is stronger than the default 500ms value of --max-offset,
but that value is deliberately set to be extremely conservative to avoid
stale reads, while the use here has no effect on correctness.

Note that LIFO queueing will only happen at bottleneck nodes, and decided
on a (tenant, priority) basis. So if there is even a single bottleneck node
for a (tenant, priority), the above delay will occur. When the epoch closes
at the bottleneck node, the creation time for this transaction will be
sufficiently in the past, so the non-bottleneck nodes (using FIFO) will
prioritize it over recent transactions. There is a queue ordering
inversion in that the non-bottleneck nodes are ordering in the opposite
way for such closed epochs, but since they are not bottlenecked, the
queueing delay should be minimal.

Preliminary experiments with kv50/enc=false/nodes=1/conc=8192 are
promising in reducing p50 and p75 latency.

Release note (ops change): The admission.epoch_lifo.enabled cluster
setting, disabled by default, enabled the use of epoch-LIFO adaptive
queueing behavior in admission control.
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thanks for the review -- I realize the long inactivity on my part was non ideal.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde and @sumeerbhola)


pkg/util/admission/work_queue.go, line 272 at r5 (raw file):

Previously, ajwerner wrote…

👍

If I'm being honest, when I first read it, I was thinking that instead of a ticker, you'd use a timer and just change how you schedule it. As I pondered more, I came to see the logic of this approach. You really want to have a execution that happens right around when the epoch ends. If there's fear that you might miss it due to bad scheduling/overload, then you may as well go high frequency to try to get a scheduling that hits you right around when the epoch ends and then, once you've hit it, if you're hitting it closely, then you may as well give the runtime an opportunity to get it right again.

That being said, say you're 4ms late, then you ask the runtime to tell you to run 100ms later, it feels like if it is just off by 1ms, then you'll miss your goal. It feels like with a timer set to exactly the next epoch boundary (whcih you do know), then you could be more robust to having set the ticker to run a 4ms past the epoch.

You have a good point. I have added a TODO to try with a Timer instead.

@sumeerbhola
Copy link
Collaborator Author

bors r=ajwerner

@craig
Copy link
Contributor

craig bot commented Feb 16, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants