Skip to content

Commit

Permalink
admission: epoch based LIFO to prevent throughput collapse
Browse files Browse the repository at this point in the history
The epoch-LIFO scheme monitors the queueing delay for each (tenant, priority)
pair and switches between FIFO and LIFO queueing based on the maximum
observed delay. Lower percentile latency can be reduced under LIFO, at
the expense of increasing higher percentile latency. This behavior can
help when it is important to finish some transactions in a timely manner,
for scenarios which have external deadlines. Under FIFO, one could
experience throughput collapse in the presence of such deadlines and
an open loop workload, since when the first work item for a transaction
reaches the front of the queue, the transaction is close to exceeding
its deadline.

The epoch aspect of this scheme relies on clock synchronization (which
we have in CockroachDB deployments) and the expectation that
transaction/query deadlines will be significantly higher than execution
time under low load. A standard LIFO scheme suffers from a severe problem
when a single user transaction can result in multiple units of lower-level
work that get distributed to many nodes, and work execution can result in
new work being submitted for admission: the later work for a transaction
may no longer be the latest seen by the system (since "latest" is defined
based on transaction start time), so will not be preferred. This means
LIFO would do some work items from each transaction and starve the
remaining work, so nothing would complete. This can be as bad or worse
than FIFO which at least prefers the same transactions until they are
complete (both FIFO and LIFO are using the transaction start time, and
not the individual work arrival time).

Consider a case where transaction deadlines are 1s (note this may not
necessarily be an actual deadline, and could be a time duration after which
the user impact is extremely negative), and typical transaction execution
times (under low load) of 10ms. A 100ms epoch will increase transaction
latency to at most 100ms + 5ms + 10ms, since execution will not start until
the epoch of the transaction's start time is closed (5ms is the grace
period before we "close" an epoch). At that time, due to clock
synchronization, all nodes will start executing that epoch and will
implicitly have the same set of competing transactions, which are ordered
in the same manner. This set of competing transactions will stay unchanged
until the next epoch close. And by the time the next epoch closes and
the current epoch's transactions are deprioritized, 100ms will have
elapsed, which is enough time for most of these transactions that got
admitted to have finished all their work. The clock synchronization
expected here is stronger than the default 500ms value of --max-offset,
but that value is deliberately set to be extremely conservative to avoid
stale reads, while the use here has no effect on correctness.

Note that LIFO queueing will only happen at bottleneck nodes, and decided
on a (tenant, priority) basis. So if there is even a single bottleneck node
for a (tenant, priority), the above delay will occur. When the epoch closes
at the bottleneck node, the creation time for this transaction will be
sufficiently in the past, so the non-bottleneck nodes (using FIFO) will
prioritize it over recent transactions. There is a queue ordering
inversion in that the non-bottleneck nodes are ordering in the opposite
way for such closed epochs, but since they are not bottlenecked, the
queueing delay should be minimal.

Preliminary experiments with kv50/enc=false/nodes=1/conc=8192 are
promising in reducing p50 and p75 latency.

Release note (ops change): The admission.epoch_lifo.enabled cluster
setting, disabled by default, enabled the use of epoch-LIFO adaptive
queueing behavior in admission control.
  • Loading branch information
sumeerbhola committed Feb 11, 2022
1 parent d10188f commit 7f4bf73
Show file tree
Hide file tree
Showing 11 changed files with 1,226 additions and 96 deletions.
1 change: 1 addition & 0 deletions docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
Setting Type Default Description
admission.epoch_lifo.enabled boolean false when true, epoch-LIFO behavior is enabled when there is significant delay in admission
admission.kv.enabled boolean true when true, work performed by the KV layer is subject to admission control
admission.sql_kv_response.enabled boolean true when true, work performed by the SQL layer when receiving a KV response is subject to admission control
admission.sql_sql_response.enabled boolean true when true, work performed by the SQL layer when receiving a DistSQL response is subject to admission control
Expand Down
1 change: 1 addition & 0 deletions docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
<table>
<thead><tr><th>Setting</th><th>Type</th><th>Default</th><th>Description</th></tr></thead>
<tbody>
<tr><td><code>admission.epoch_lifo.enabled</code></td><td>boolean</td><td><code>false</code></td><td>when true, epoch-LIFO behavior is enabled when there is significant delay in admission</td></tr>
<tr><td><code>admission.kv.enabled</code></td><td>boolean</td><td><code>true</code></td><td>when true, work performed by the KV layer is subject to admission control</td></tr>
<tr><td><code>admission.sql_kv_response.enabled</code></td><td>boolean</td><td><code>true</code></td><td>when true, work performed by the SQL layer when receiving a KV response is subject to admission control</td></tr>
<tr><td><code>admission.sql_sql_response.enabled</code></td><td>boolean</td><td><code>true</code></td><td>when true, work performed by the SQL layer when receiving a DistSQL response is subject to admission control</td></tr>
Expand Down
1 change: 1 addition & 0 deletions pkg/kv/kvclient/kvstreamer/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ go_library(
"//pkg/settings/cluster",
"//pkg/util/admission",
"//pkg/util/buildutil",
"//pkg/util/log",
"//pkg/util/mon",
"//pkg/util/quotapool",
"//pkg/util/stop",
Expand Down
4 changes: 4 additions & 0 deletions pkg/kv/kvclient/kvstreamer/streamer.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/settings/cluster"
"github.com/cockroachdb/cockroach/pkg/util/admission"
"github.com/cockroachdb/cockroach/pkg/util/buildutil"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/mon"
"github.com/cockroachdb/cockroach/pkg/util/quotapool"
"github.com/cockroachdb/cockroach/pkg/util/stop"
Expand Down Expand Up @@ -1101,6 +1102,9 @@ func (w *workerCoordinator) performRequestAsync(
}

// Do admission control after we've finalized the memory accounting.
// TODO: remove
log.Infof(ctx, "sql-kv-response (streamer): br %t, read admission q %t",
br != nil, w.responseAdmissionQ != nil)
if br != nil && w.responseAdmissionQ != nil {
responseAdmission := admission.WorkInfo{
TenantID: roachpb.SystemTenantID,
Expand Down
3 changes: 3 additions & 0 deletions pkg/sql/row/kv_batch_fetcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,9 @@ func (f *txnKVFetcher) fetch(ctx context.Context) error {
f.batchResponseAccountedFor = returnedBytes
}
// Do admission control after we've accounted for the response bytes.
// TODO: remove
log.Infof(ctx, "sql-kv-response: br %t, read admission q %t",
br != nil, f.responseAdmissionQ != nil)
if br != nil && f.responseAdmissionQ != nil {
responseAdmission := admission.WorkInfo{
TenantID: roachpb.SystemTenantID,
Expand Down
2 changes: 2 additions & 0 deletions pkg/sql/tablewriter.go
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,8 @@ func (tb *tableWriterBase) tryDoResponseAdmission(ctx context.Context) error {
// Do admission control for response processing. This is the shared write
// path for most SQL mutations.
responseAdmissionQ := tb.txn.DB().SQLKVResponseAdmissionQ
// TODO: remove
log.Infof(ctx, "sql-kv-response: write admission q %t", responseAdmissionQ != nil)
if responseAdmissionQ != nil {
requestAdmissionHeader := tb.txn.AdmissionHeader()
responseAdmission := admission.WorkInfo{
Expand Down
1 change: 1 addition & 0 deletions pkg/util/admission/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ go_test(
"//pkg/util/leaktest",
"//pkg/util/log",
"//pkg/util/syncutil",
"//pkg/util/timeutil",
"@com_github_cockroachdb_datadriven//:datadriven",
"@com_github_cockroachdb_pebble//:pebble",
"@com_github_stretchr_testify//require",
Expand Down
231 changes: 231 additions & 0 deletions pkg/util/admission/testdata/priority_states
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
init
----

# One request at priority=-128 sees high latency. Requests at priority 0, 127
# do not see high latency. So FIFO priority is set >= -127.
request-received priority=127
----
lowest-priority: 127

update priority=127 delay-millis=10
----
lowest-priority: 127 (pri: 127, delay-millis: 10, admitted: 1)

request-received priority=-128
----
lowest-priority: -128 (pri: 127, delay-millis: 10, admitted: 1)

update priority=-128 delay-millis=106
----
lowest-priority: -128 (pri: -128, delay-millis: 106, admitted: 1) (pri: 127, delay-millis: 10, admitted: 1)

request-received priority=0
----
lowest-priority: -128 (pri: -128, delay-millis: 106, admitted: 1) (pri: 127, delay-millis: 10, admitted: 1)

update priority=0 delay-millis=20
----
lowest-priority: -128 (pri: -128, delay-millis: 106, admitted: 1) (pri: 0, delay-millis: 20, admitted: 1) (pri: 127, delay-millis: 10, admitted: 1)

get-threshold
----
threshold: -127

# The latency seen by priority=-128 decreases but not below the threshold
# needed to return to FIFO. So FIFO priority continues to be >= -127.
request-received priority=-128
----
lowest-priority: -128

update priority=-128 delay-millis=11
----
lowest-priority: -128 (pri: -128, delay-millis: 11, admitted: 1)

get-threshold
----
threshold: -127

# The latency seen by priority=-128 is low enough to return to FIFO.
request-received priority=-128
----
lowest-priority: -128

update priority=-128 delay-millis=10
----
lowest-priority: -128 (pri: -128, delay-millis: 10, admitted: 1)

get-threshold
----
threshold: -128

# Priority=127 sees high latency. FIFO priority is now >= 128.
request-received priority=127
----
lowest-priority: 127

update priority=127 delay-millis=106
----
lowest-priority: 127 (pri: 127, delay-millis: 106, admitted: 1)

get-threshold
----
threshold: 128

# Both priority 24 and 127 see high latency. FIFO priority stays at >=128.
request-received priority=127
----
lowest-priority: 127

update priority=127 delay-millis=106
----
lowest-priority: 127 (pri: 127, delay-millis: 106, admitted: 1)

request-received priority=24
----
lowest-priority: 24 (pri: 127, delay-millis: 106, admitted: 1)

update priority=24 delay-millis=107
----
lowest-priority: 24 (pri: 24, delay-millis: 107, admitted: 1) (pri: 127, delay-millis: 106, admitted: 1)

get-threshold
----
threshold: 128

# Priority -5 and 20 see high latency. There are no requests at any other
# priority. The FIFO priority threshold reduces to >= 21.
request-received priority=20
----
lowest-priority: 20

update priority=20 delay-millis=111
----
lowest-priority: 20 (pri: 20, delay-millis: 111, admitted: 1)

request-received priority=-5
----
lowest-priority: -5 (pri: 20, delay-millis: 111, admitted: 1)

update priority=-5 delay-millis=110
----
lowest-priority: -5 (pri: -5, delay-millis: 110, admitted: 1) (pri: 20, delay-millis: 111, admitted: 1)

get-threshold
----
threshold: 21

# Priority 0 is LIFO and sees latency that is not low enough to return it to
# FIFO. The FIFO priority threshold reduces to >= 1.
request-received priority=0
----
lowest-priority: 0

update priority=0 delay-millis=11
----
lowest-priority: 0 (pri: 0, delay-millis: 11, admitted: 1)

get-threshold
----
threshold: 1

# Priority -128 is LIFO and sees latency that is not low enough to return it
# to FIFO. The FIFO priority threshold reduces to >= -127.
request-received priority=-128
----
lowest-priority: -128

update priority=-128 delay-millis=11
----
lowest-priority: -128 (pri: -128, delay-millis: 11, admitted: 1)

get-threshold
----
threshold: -127

# Priority -128 is LIFO and sees very low latency and switches back to FIFO.
request-received priority=-128
----
lowest-priority: -128

update priority=-128 delay-millis=9
----
lowest-priority: -128 (pri: -128, delay-millis: 9, admitted: 1)

get-threshold
----
threshold: -128

# Priority 0 is FIFO and sees a canceled request that does not meet the
# latency threshold to switch to LIFO. It stays as FIFO.
request-received priority=0
----
lowest-priority: 0

update priority=0 delay-millis=20 canceled=true
----
lowest-priority: 0 (pri: 0, delay-millis: 20, admitted: 0)

get-threshold
----
threshold: -128

# Priority 0 is FIFO and sees a canceled request with very high latency, so
# switched to LIFO.
request-received priority=0
----
lowest-priority: 0

update priority=0 delay-millis=120 canceled=true
----
lowest-priority: 0 (pri: 0, delay-millis: 120, admitted: 0)

get-threshold
----
threshold: 1

# Priority 0 receives a request, but nothing exits admission control, so it
# stays as LIFO.
request-received priority=0
----
lowest-priority: 0

get-threshold
----
threshold: 1

# Priority 10 sees a request with low latency. Priority 0 has a request that
# does not exit admission control. Priority 0 stays as LIFO.
request-received priority=10
----
lowest-priority: 10

update priority=10 delay-millis=5
----
lowest-priority: 10 (pri: 10, delay-millis: 5, admitted: 1)

request-received priority=0
----
lowest-priority: 0 (pri: 10, delay-millis: 5, admitted: 1)

get-threshold
----
threshold: 1

# Priority -10 sees a request with low enough latency to switch back to FIFO.
# Priority 0 has a request that does not exit admission control. Because of
# the observation at priority=-10 we switch everything back to FIFO.
request-received priority=-10
----
lowest-priority: -10

update priority=-10 delay-millis=5
----
lowest-priority: -10 (pri: -10, delay-millis: 5, admitted: 1)

request-received priority=0
----
lowest-priority: -10 (pri: -10, delay-millis: 5, admitted: 1)

get-threshold
----
threshold: -128
Loading

0 comments on commit 7f4bf73

Please sign in to comment.