Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813

Closed
wants to merge 3 commits into from

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Jun 13, 2022

The first commit is from 82440

We assume that:

  • There is a provisioned known limit on the sum of read and write
    bandwidth. This limit is allowed to change.
  • Admission control can only shape the rate of admission of writes. Writes
    also cause reads, since compactions do reads and writes.

There are multiple challenges:

  • We are unable to precisely track the causes of disk read bandwidth, since
    we do not have observability into what reads missed the OS page cache.
    That is, we don't know how much of the reads were due to incoming reads
    (that we don't shape) and how much due to compaction read bandwidth.
  • We don't shape incoming reads.
  • There can be a large time lag between the shaping of incoming writes, and when
    it affects actual writes in the system, since compaction backlog can
    build up in various levels of the LSM store.
  • Signals of overload are coarse, since we cannot view all the internal
    queues that can build up due to resource overload. For instance,
    different examples of bandwidth saturation exhibit wildly different
    latency effects, presumably because the queue buildup is different. So it
    is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.

  • The disk load is abstracted using an enum. The diskLoadWatcher can be
    evolved independently.
  • The approach uses easy to understand additive increase and multiplicative
    decrease, (unlike what we do for flush and compaction tokens, where we
    try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:

  • Incoming writes that are deemed "elastic": This can be done by
    introducing a work-class (in addition to admissionpb.WorkPriority), or by
    implying a work-class from the priority (e.g. priorities < NormalPri are
    deemed elastic). This prototype does the latter.
  • Optional compactions: We assume that the LSM store is configured with a
    ceiling on number of regular concurrent compactions, and if it needs more
    it can request resources for additional (optional) compactions. These
    latter compactions can be limited by this approach. See
    db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp pebble#1329 for motivation.

The reader should start with disk_bandwidth.go, consisting of

  • diskLoadWatcher: which computes load levels.
  • compactionLimiter: which tracks all compaction slots and limits
    optional compactions.
  • diskBandwidthLimiter: It composes the previous two objects and
    uses load information to limit write tokens for elastic writes
    and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:

  • Previously the tokens were for L0 and now we need to support tokens for
    bytes into L0 and tokens for bytes into the LSM (the former being a subset
    of the latter).
  • Elastic work is in a different WorkQueue than regular work, but they
    are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we can request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, can tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. There was
also the (unrelated) realization that we can use the information
of the size of the batch in the call to AdmittedWorkDone and fix
estimation that we had to make pre-evaluation. This resulted in a
bunch of changes to how we do estimation to adjust the tokens consumed:
we now estimate how much we need to compensate what is being asked
for at (a) admission time, (b) work done time, for the bytes added
to the LSM, (c) work done time, for the bytes added to L0. Since we
are askinf for tokens at admission time based on the full incoming
bytes, the estimation for what fraction of an ingest goes into L0 is
eliminated. This had the consequence of simplifying some of the
estimation logic that was distinguishing writes from ingests.

There are no tests (and breaks existing tests) so this code is probably littered with bugs.

Next steps:

  • Unit tests
  • Pebble changes for IntervalCompactionInfo
  • CockroachDB changes for IntervalDiskLoadInfo
  • Experimental evaluation and tuning
  • Separate into multiple PRs for review
  • KV and storage package plumbing for properly populating
    StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and
    StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and
    ingestions.

Some experimental results with artificially set provisioned bandwidth limit of 95MiB and a kv0 workload with 4KB writes that are all considered elastic traffic. There were 4 runs: the first one has no provisioned bw limit and the subsequent ones are iterations over heuristics. The last one is the latest code: it is tuned to not increase load if we have reached 70% of provisioned bandwidth.
Screen Shot 2022-07-12 at 2 48 35 PM

The challenge in doing better is the sharp transitions from < 0.7 fraction bandwidth utilization to > 0.95, due to the lag in compactions. For example:

I220712 18:17:34.770083 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 543  diskLoadWatcher: rb: 0 B, wb: 3.0 MiB, pb: 95 MiB, util: 0.03
I220712 18:17:49.770806 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 555  diskLoadWatcher: rb: 0 B, wb: 54 MiB, pb: 95 MiB, util: 0.57
I220712 18:18:04.770748 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 566  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:18:19.770290 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 578  diskLoadWatcher: rb: 0 B, wb: 67 MiB, pb: 95 MiB, util: 0.70
I220712 18:18:34.770280 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 589  diskLoadWatcher: rb: 0 B, wb: 104 MiB, pb: 95 MiB, util: 1.10
I220712 18:18:49.769979 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 600  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:19:04.770342 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 612  diskLoadWatcher: rb: 0 B, wb: 17 MiB, pb: 95 MiB, util: 0.18
I220712 18:19:19.771061 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 623  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69
I220712 18:19:34.770318 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 636  diskLoadWatcher: rb: 0 B, wb: 96 MiB, pb: 95 MiB, util: 1.01

Release note: None

In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_percent, which defaults to 50% (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
1000%.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): The cluster setting
admission.min_flush_util_percent can be used to disable or tune flush
throughput based admission tokens, for writes to a store. Tokens
based on flush throughput attempt to reduce storage layer write stalls.
The first commit is from 82440

We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit wildly different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation.

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- compactionLimiter: which tracks all compaction slots and limits
  optional compactions.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we can request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, can tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. There was
also the (unrelated) realization that we can use the information
of the size of the batch in the call to AdmittedWorkDone and fix
estimation that we had to make pre-evaluation. This resulted in a
bunch of changes to how we do estimation to adjust the tokens consumed:
we now estimate how much we need to compensate what is being asked
for at (a) admission time, (b) work done time, for the bytes added
to the LSM, (c) work done time, for the bytes added to L0. Since we
are askinf for tokens at admission time based on the full incoming
bytes, the estimation for what fraction of an ingest goes into L0 is
eliminated. This had the consequence of simplifying some of the
estimation logic that was distinguishing writes from ingests.

There are no tests, so this code is probably littered with bugs.

Next steps:
- Unit tests
- Pebble changes for IntervalCompactionInfo
- CockroachDB changes for IntervalDiskLoadInfo
- Experimental evaluation and tuning
- Separate into multiple PRs for review
- KV and storage package plumbing for properly populating
  StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and
  StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and
  ingestions.

Release note: None
@sumeerbhola sumeerbhola requested review from tbg, irfansharif, bananabrick and a team June 13, 2022 12:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@sumeerbhola
Copy link
Collaborator Author

Interestingly, we fare better with the provisioned disk bandwidth set to the actual provisioned value of 250MiB/s. See the graph below where the red line represents when we switched from an outrageously high configuration of hack.provisioned_bandwidth to a value of 250MiB/s. Compactions (which were set to a max of 8) had been falling behind earlier (because of the actual disk bandwidth limit). We see some high fluctuations, and then because there is spare disk bandwidth for compactions, they eventually catch up. We then setting into a stable regime of 80+% of disk bandwidth used. My theory on why this one is more stable is that because this limit is representative of the actual limit, we do not have compactions bursting significantly over the limit to complete their work -- i.e. the rate shaping done by EBS is keeping compactions in-check, which means our utilization doesn't blow over into overload territory.
Screen Shot 2022-07-13 at 12 56 34 PM

At a finer-granularity this behavior can be observed in the following log statements

I220713 16:26:09.084355 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1742  diskLoadWatcher: rb: 0 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:26:24.110345 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1754  diskLoadWatcher: rb: 546 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:26:39.093034 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1766  diskLoadWatcher: rb: 546 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:26:54.091511 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1778  diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:27:09.083969 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1789  diskLoadWatcher: rb: 546 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:27:24.084199 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1800  diskLoadWatcher: rb: 273 B, wb: 97 MiB, pb: 250 MiB, util: 0.39
I220713 16:27:39.084573 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1812  diskLoadWatcher: rb: 0 B, wb: 1.4 MiB, pb: 250 MiB, util: 0.01
I220713 16:27:54.084110 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1823  diskLoadWatcher: rb: 0 B, wb: 1.4 MiB, pb: 250 MiB, util: 0.01
I220713 16:28:09.084457 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1835  diskLoadWatcher: rb: 546 B, wb: 266 MiB, pb: 250 MiB, util: 1.07
I220713 16:28:24.084627 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1846  diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:28:39.084221 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1858  diskLoadWatcher: rb: 273 B, wb: 103 MiB, pb: 250 MiB, util: 0.41
I220713 16:28:54.085021 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1869  diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 250 MiB, util: 0.19
I220713 16:29:09.084317 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1880  diskLoadWatcher: rb: 273 B, wb: 81 MiB, pb: 250 MiB, util: 0.32
I220713 16:29:24.084402 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1891  diskLoadWatcher: rb: 273 B, wb: 128 MiB, pb: 250 MiB, util: 0.51
I220713 16:29:39.084058 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1905  diskLoadWatcher: rb: 273 B, wb: 157 MiB, pb: 250 MiB, util: 0.63
I220713 16:29:54.084239 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1919  diskLoadWatcher: rb: 546 B, wb: 206 MiB, pb: 250 MiB, util: 0.83
I220713 16:30:09.084490 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1931  diskLoadWatcher: rb: 546 B, wb: 235 MiB, pb: 250 MiB, util: 0.94
I220713 16:30:24.084371 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1943  diskLoadWatcher: rb: 0 B, wb: 240 MiB, pb: 250 MiB, util: 0.96
I220713 16:30:39.084111 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1954  diskLoadWatcher: rb: 0 B, wb: 130 MiB, pb: 250 MiB, util: 0.52
I220713 16:30:54.084203 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1966  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 250 MiB, util: 0.26
I220713 16:31:09.083723 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1977  diskLoadWatcher: rb: 0 B, wb: 106 MiB, pb: 250 MiB, util: 0.42
I220713 16:31:24.084763 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1988  diskLoadWatcher: rb: 0 B, wb: 153 MiB, pb: 250 MiB, util: 0.61
I220713 16:31:39.083880 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2000  diskLoadWatcher: rb: 0 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:31:54.083780 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2015  diskLoadWatcher: rb: 0 B, wb: 198 MiB, pb: 250 MiB, util: 0.79
I220713 16:32:09.084175 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2026  diskLoadWatcher: rb: 0 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 16:32:24.084038 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2038  diskLoadWatcher: rb: 0 B, wb: 211 MiB, pb: 250 MiB, util: 0.84
I220713 16:32:39.084352 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2049  diskLoadWatcher: rb: 0 B, wb: 209 MiB, pb: 250 MiB, util: 0.84
I220713 16:32:54.083970 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2061  diskLoadWatcher: rb: 0 B, wb: 251 MiB, pb: 250 MiB, util: 1.01
I220713 16:33:09.083874 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2073  diskLoadWatcher: rb: 34 KiB, wb: 149 MiB, pb: 250 MiB, util: 0.59
I220713 16:33:24.112694 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2085  diskLoadWatcher: rb: 0 B, wb: 73 MiB, pb: 250 MiB, util: 0.29
I220713 16:33:39.084302 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2096  diskLoadWatcher: rb: 0 B, wb: 210 MiB, pb: 250 MiB, util: 0.84
I220713 16:33:54.083833 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2107  diskLoadWatcher: rb: 0 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:34:09.084128 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2118  diskLoadWatcher: rb: 0 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:34:24.084762 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2130  diskLoadWatcher: rb: 0 B, wb: 129 MiB, pb: 250 MiB, util: 0.51
I220713 16:34:39.084357 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2143  diskLoadWatcher: rb: 0 B, wb: 163 MiB, pb: 250 MiB, util: 0.65
I220713 16:34:54.084647 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2156  diskLoadWatcher: rb: 0 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:35:09.084342 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2169  diskLoadWatcher: rb: 0 B, wb: 211 MiB, pb: 250 MiB, util: 0.84
I220713 16:35:24.084554 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2181  diskLoadWatcher: rb: 0 B, wb: 183 MiB, pb: 250 MiB, util: 0.73
I220713 16:35:39.084211 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2192  diskLoadWatcher: rb: 0 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:35:54.083938 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2203  diskLoadWatcher: rb: 0 B, wb: 191 MiB, pb: 250 MiB, util: 0.76
I220713 16:36:09.083808 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2215  diskLoadWatcher: rb: 0 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:36:24.084340 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2226  diskLoadWatcher: rb: 0 B, wb: 184 MiB, pb: 250 MiB, util: 0.74
I220713 16:36:39.084443 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2237  diskLoadWatcher: rb: 546 B, wb: 186 MiB, pb: 250 MiB, util: 0.74
I220713 16:36:54.084441 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2248  diskLoadWatcher: rb: 546 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:37:09.084365 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2260  diskLoadWatcher: rb: 546 B, wb: 211 MiB, pb: 250 MiB, util: 0.84
I220713 16:37:24.084351 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2273  diskLoadWatcher: rb: 273 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:37:39.084338 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2284  diskLoadWatcher: rb: 546 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:37:54.084637 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2295  diskLoadWatcher: rb: 546 B, wb: 204 MiB, pb: 250 MiB, util: 0.82
I220713 16:38:09.084089 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2306  diskLoadWatcher: rb: 546 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:38:24.083903 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2318  diskLoadWatcher: rb: 546 B, wb: 198 MiB, pb: 250 MiB, util: 0.79
I220713 16:38:39.084134 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2329  diskLoadWatcher: rb: 546 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:38:54.084425 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2340  diskLoadWatcher: rb: 546 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 16:39:09.084089 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2351  diskLoadWatcher: rb: 273 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:39:24.084313 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2363  diskLoadWatcher: rb: 546 B, wb: 209 MiB, pb: 250 MiB, util: 0.83
I220713 16:39:39.083771 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2376  diskLoadWatcher: rb: 546 B, wb: 204 MiB, pb: 250 MiB, util: 0.82
I220713 16:39:54.084614 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2390  diskLoadWatcher: rb: 546 B, wb: 201 MiB, pb: 250 MiB, util: 0.81
I220713 16:40:09.084715 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2403  diskLoadWatcher: rb: 546 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:40:24.084363 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2415  diskLoadWatcher: rb: 546 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:40:39.084409 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2426  diskLoadWatcher: rb: 273 B, wb: 206 MiB, pb: 250 MiB, util: 0.83
I220713 16:40:54.084186 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2437  diskLoadWatcher: rb: 546 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:41:09.084607 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2448  diskLoadWatcher: rb: 546 B, wb: 195 MiB, pb: 250 MiB, util: 0.78
I220713 16:41:24.083822 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2459  diskLoadWatcher: rb: 546 B, wb: 188 MiB, pb: 250 MiB, util: 0.75
I220713 16:41:39.083878 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2471  diskLoadWatcher: rb: 546 B, wb: 224 MiB, pb: 250 MiB, util: 0.89
I220713 16:41:54.084186 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2482  diskLoadWatcher: rb: 0 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:42:09.084315 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2493  diskLoadWatcher: rb: 0 B, wb: 188 MiB, pb: 250 MiB, util: 0.75
I220713 16:42:24.083722 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2505  diskLoadWatcher: rb: 0 B, wb: 233 MiB, pb: 250 MiB, util: 0.93
I220713 16:42:39.084307 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2518  diskLoadWatcher: rb: 0 B, wb: 190 MiB, pb: 250 MiB, util: 0.76
I220713 16:42:54.084574 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2529  diskLoadWatcher: rb: 0 B, wb: 210 MiB, pb: 250 MiB, util: 0.84
I220713 16:43:09.084527 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2540  diskLoadWatcher: rb: 0 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:43:24.084559 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2551  diskLoadWatcher: rb: 0 B, wb: 192 MiB, pb: 250 MiB, util: 0.77
I220713 16:43:39.085508 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2562  diskLoadWatcher: rb: 0 B, wb: 218 MiB, pb: 250 MiB, util: 0.87
I220713 16:43:54.083781 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2574  diskLoadWatcher: rb: 0 B, wb: 202 MiB, pb: 250 MiB, util: 0.81
I220713 16:44:09.084171 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2585  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:44:24.084059 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2596  diskLoadWatcher: rb: 0 B, wb: 204 MiB, pb: 250 MiB, util: 0.81
I220713 16:44:39.084728 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2609  diskLoadWatcher: rb: 0 B, wb: 204 MiB, pb: 250 MiB, util: 0.82
I220713 16:44:54.084473 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2623  diskLoadWatcher: rb: 0 B, wb: 209 MiB, pb: 250 MiB, util: 0.84
I220713 16:45:09.084400 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2634  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:45:24.084551 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2645  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:45:39.084682 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2656  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:45:54.084178 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2668  diskLoadWatcher: rb: 0 B, wb: 201 MiB, pb: 250 MiB, util: 0.80
I220713 16:46:09.084265 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2679  diskLoadWatcher: rb: 0 B, wb: 234 MiB, pb: 250 MiB, util: 0.94
I220713 16:46:24.083794 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2690  diskLoadWatcher: rb: 0 B, wb: 219 MiB, pb: 250 MiB, util: 0.88
I220713 16:46:39.084311 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2701  diskLoadWatcher: rb: 0 B, wb: 190 MiB, pb: 250 MiB, util: 0.76
I220713 16:46:54.084156 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2712  diskLoadWatcher: rb: 0 B, wb: 220 MiB, pb: 250 MiB, util: 0.88
I220713 16:47:09.084472 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2724  diskLoadWatcher: rb: 0 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:47:24.085097 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2735  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:47:39.083773 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2746  diskLoadWatcher: rb: 0 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 16:47:54.084655 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2757  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:48:09.084053 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2768  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:48:24.084351 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2780  diskLoadWatcher: rb: 0 B, wb: 227 MiB, pb: 250 MiB, util: 0.91
I220713 16:48:39.084335 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2791  diskLoadWatcher: rb: 0 B, wb: 201 MiB, pb: 250 MiB, util: 0.80
I220713 16:48:54.084414 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2802  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:49:09.084379 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2813  diskLoadWatcher: rb: 0 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:49:24.083817 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2825  diskLoadWatcher: rb: 0 B, wb: 221 MiB, pb: 250 MiB, util: 0.88
I220713 16:49:39.084453 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2838  diskLoadWatcher: rb: 0 B, wb: 223 MiB, pb: 250 MiB, util: 0.89

@sumeerbhola
Copy link
Collaborator Author

Now running a mix of regular and elastic traffic.
regular: consumes 40-50% of the disk bandwidth (note the low concurrency=2, since regular traffic does not cause any disk bandwidth controls to be actived -- so we've explicitly set it up to leave significant unused bw)

roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=2 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096  {pgurl:1-1}

Then added elastic traffic with a high concurrency=1024 (this is more than enough to blow past the provisioned limit if there was no disk bw control). The throughput of regular traffic stays stable.

roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=1024 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096 --background-qos=true {pgurl:1-1}

Logs before adding elastic traffic

I220713 18:27:24.083678 382 util/admission/granter.go:2091 ⋮ [-] 7598  Incoming LSM 105 MiB, tokens (regular, elastic): 92 MiB, 0 B, per-req: (6.9 KiB,6.9 KiB), compaction-w: 1.6 GiB
I220713 18:27:24.083688 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7599  diskLoadWatcher: rb: 0 B, wb: 130 MiB, pb: 250 MiB, util: 0.52
I220713 18:27:24.083707 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7600  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:27:39.084211 382 util/admission/granter.go:2091 ⋮ [-] 7609  Incoming LSM 106 MiB, tokens (regular, elastic): 99 MiB, 0 B, per-req: (7.4 KiB,7.4 KiB), compaction-w: 1.1 GiB
I220713 18:27:39.084220 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7610  diskLoadWatcher: rb: 0 B, wb: 94 MiB, pb: 250 MiB, util: 0.38
I220713 18:27:39.084227 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7611  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:27:54.083990 382 util/admission/granter.go:2091 ⋮ [-] 7620  Incoming LSM 53 MiB, tokens (regular, elastic): 105 MiB, 0 B, per-req: (7.7 KiB,7.7 KiB), compaction-w: 1.4 GiB
I220713 18:27:54.084013 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7621  diskLoadWatcher: rb: 0 B, wb: 113 MiB, pb: 250 MiB, util: 0.45
I220713 18:27:54.084052 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7622  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:09.083881 382 util/admission/granter.go:2091 ⋮ [-] 7632  Incoming LSM 105 MiB, tokens (regular, elastic): 78 MiB, 0 B, per-req: (5.8 KiB,5.8 KiB), compaction-w: 1.5 GiB
I220713 18:28:09.083891 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7633  diskLoadWatcher: rb: 0 B, wb: 121 MiB, pb: 250 MiB, util: 0.48
I220713 18:28:09.083910 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7634  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:24.084186 382 util/admission/granter.go:2091 ⋮ [-] 7643  Incoming LSM 105 MiB, tokens (regular, elastic): 90 MiB, 0 B, per-req: (6.8 KiB,6.8 KiB), compaction-w: 1.7 GiB
I220713 18:28:24.084195 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7644  diskLoadWatcher: rb: 0 B, wb: 137 MiB, pb: 250 MiB, util: 0.55
I220713 18:28:24.084212 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7645  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:39.084139 382 util/admission/granter.go:2091 ⋮ [-] 7654  Incoming LSM 106 MiB, tokens (regular, elastic): 98 MiB, 0 B, per-req: (7.4 KiB,7.4 KiB), compaction-w: 1.2 GiB
I220713 18:28:39.084148 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7655  diskLoadWatcher: rb: 0 B, wb: 111 MiB, pb: 250 MiB, util: 0.44
I220713 18:28:39.084155 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7656  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:54.084309 382 util/admission/granter.go:2091 ⋮ [-] 7666  Incoming LSM 52 MiB, tokens (regular, elastic): 102 MiB, 0 B, per-req: (7.7 KiB,7.7 KiB), compaction-w: 1.2 GiB
I220713 18:28:54.084331 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7667  diskLoadWatcher: rb: 0 B, wb: 89 MiB, pb: 250 MiB, util: 0.36
I220713 18:28:54.084369 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7668  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0

Then after adding elastic traffic, we first start increasing the elastic tokens:

I220713 18:29:09.083725 382 util/admission/granter.go:2091 ⋮ [-] 7679  Incoming LSM 105 MiB, tokens (regular, elastic): 77 MiB, 5.4 MiB, per-req: (5.8 KiB,5.8 KiB), compaction-w: 1.7 GiB
I220713 18:29:09.083736 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7680  diskLoadWatcher: rb: 0 B, wb: 139 MiB, pb: 250 MiB, util: 0.56
I220713 18:29:09.083756 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7681  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 5663679
I220713 18:29:24.084020 382 util/admission/granter.go:2091 ⋮ [-] 7690  Incoming LSM 148 MiB, tokens (regular, elastic): 86 MiB, 11 MiB, per-req: (6.6 KiB,6.6 KiB), compaction-w: 1.7 GiB
I220713 18:29:24.084030 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7691  diskLoadWatcher: rb: 0 B, wb: 139 MiB, pb: 250 MiB, util: 0.56
I220713 18:29:24.084039 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7692  diskBandwidthLimiter: moderate fr: 0.07, smoothed-incoming: 120 MiB, unusedBW: 111 MiB, elasticBytes/Tokens: 26 MiB
I220713 18:29:39.083879 382 util/admission/granter.go:2091 ⋮ [-] 7701  Incoming LSM 99 MiB, tokens (regular, elastic): 110 MiB, 26 MiB, per-req: (8.4 KiB,8.4 KiB), compaction-w: 2.1 GiB
I220713 18:29:39.083889 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7702  diskLoadWatcher: rb: 0 B, wb: 165 MiB, pb: 250 MiB, util: 0.66
I220713 18:29:39.083897 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7703  diskBandwidthLimiter: moderate fr: 0.13, smoothed-incoming: 109 MiB, unusedBW: 85 MiB, elasticBytes/Tokens: 28 MiB
I220713 18:29:54.083861 382 util/admission/granter.go:2091 ⋮ [-] 7717  Incoming LSM 100 MiB, tokens (regular, elastic): 98 MiB, 28 MiB, per-req: (7.2 KiB,7.2 KiB), compaction-w: 2.0 GiB
I220713 18:29:54.083871 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7718  diskLoadWatcher: rb: 0 B, wb: 163 MiB, pb: 250 MiB, util: 0.65
I220713 18:29:54.083891 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7719  diskBandwidthLimiter: moderate fr: 0.18, smoothed-incoming: 104 MiB, unusedBW: 87 MiB, elasticBytes/Tokens: 31 MiB
I220713 18:30:09.083951 382 util/admission/granter.go:2091 ⋮ [-] 7730  Incoming LSM 149 MiB, tokens (regular, elastic): 87 MiB, 31 MiB, per-req: (6.4 KiB,6.4 KiB), compaction-w: 2.1 GiB
I220713 18:30:09.083961 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7731  diskLoadWatcher: rb: 0 B, wb: 173 MiB, pb: 250 MiB, util: 0.69
I220713 18:30:09.083970 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7732  diskBandwidthLimiter: moderate fr: 0.22, smoothed-incoming: 127 MiB, unusedBW: 77 MiB, elasticBytes/Tokens: 48 MiB

We then stabilize

I220713 18:30:24.084233 382 util/admission/granter.go:2091 ⋮ [-] 7741  Incoming LSM 150 MiB, tokens (regular, elastic): 97 MiB, 48 MiB, per-req: (7.3 KiB,7.3 KiB), compaction-w: 2.2 GiB
I220713 18:30:24.084258 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7742  diskLoadWatcher: rb: 0 B, wb: 176 MiB, pb: 250 MiB, util: 0.71
I220713 18:30:24.084273 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7743  diskBandwidthLimiter: high elastic fr: 0.28, smoothed-incoming: 145111778, elasticTokens: 50186744
I220713 18:30:39.083901 382 util/admission/granter.go:2091 ⋮ [-] 7752  Incoming LSM 149 MiB, tokens (regular, elastic): 98 MiB, 48 MiB, per-req: (7.4 KiB,7.4 KiB), compaction-w: 2.2 GiB
I220713 18:30:39.083912 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7753  diskLoadWatcher: rb: 0 B, wb: 179 MiB, pb: 250 MiB, util: 0.71
I220713 18:30:39.083919 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7754  diskBandwidthLimiter: high elastic fr: 0.30, smoothed-incoming: 150528601, elasticTokens: 50186744
I220713 18:30:54.084006 382 util/admission/granter.go:2091 ⋮ [-] 7763  Incoming LSM 99 MiB, tokens (regular, elastic): 101 MiB, 48 MiB, per-req: (7.5 KiB,7.5 KiB), compaction-w: 2.1 GiB
I220713 18:30:54.084028 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7764  diskLoadWatcher: rb: 0 B, wb: 170 MiB, pb: 250 MiB, util: 0.68
I220713 18:30:54.084071 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7765  diskBandwidthLimiter: moderate fr: 0.31, smoothed-incoming: 121 MiB, unusedBW: 80 MiB, elasticBytes/Tokens: 53 MiB
I220713 18:31:09.083909 382 util/admission/granter.go:2091 ⋮ [-] 7775  Incoming LSM 152 MiB, tokens (regular, elastic): 84 MiB, 53 MiB, per-req: (6.3 KiB,6.3 KiB), compaction-w: 2.4 GiB
I220713 18:31:09.083920 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7776  diskLoadWatcher: rb: 0 B, wb: 194 MiB, pb: 250 MiB, util: 0.78
I220713 18:31:09.083938 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7777  diskBandwidthLimiter: high elastic fr: 0.35, smoothed-incoming: 143081287, elasticTokens: 55205990
I220713 18:31:24.084075 382 util/admission/granter.go:2091 ⋮ [-] 7786  Incoming LSM 152 MiB, tokens (regular, elastic): 87 MiB, 53 MiB, per-req: (6.6 KiB,6.6 KiB), compaction-w: 2.6 GiB
I220713 18:31:24.084086 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7787  diskLoadWatcher: rb: 0 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 18:31:24.084103 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7788  diskBandwidthLimiter: high elastic fr: 0.36, smoothed-incoming: 151084905, elasticTokens: 55205990
I220713 18:31:39.083915 382 util/admission/granter.go:2091 ⋮ [-] 7797  Incoming LSM 152 MiB, tokens (regular, elastic): 92 MiB, 53 MiB, per-req: (6.9 KiB,6.9 KiB), compaction-w: 2.4 GiB
I220713 18:31:39.083927 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7798  diskLoadWatcher: rb: 0 B, wb: 194 MiB, pb: 250 MiB, util: 0.78
I220713 18:31:39.083945 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7799  diskBandwidthLimiter: high elastic fr: 0.36, smoothed-incoming: 155080500, elasticTokens: 55205990
I220713 18:31:54.084142 382 util/admission/granter.go:2091 ⋮ [-] 7808  Incoming LSM 151 MiB, tokens (regular, elastic): 93 MiB, 53 MiB, per-req: (7.0 KiB,7.0 KiB), compaction-w: 2.5 GiB
I220713 18:31:54.084164 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7809  diskLoadWatcher: rb: 0 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 18:31:54.084204 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7810  diskBandwidthLimiter: high elastic fr: 0.36, smoothed-incoming: 156691037, elasticTokens: 55205990

Challenge is the sharp transition from 0.7 or less to > 0.95.
This is all because of compactions. There is a lag from writes
to the full implication in terms of write amp. Also, when we
start cutting there is a sharp fall from > 0.95 -- that is
partly because of our multiplicative decrease but we've tried
to dampen the multiplicative decrease and start growing quickly
again otherwise we would fall too much

I220712 18:12:04.770141 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 289  diskLoadWatcher: rb: 0 B, wb: 80 MiB, pb: 95 MiB, util: 0.84
I220712 18:12:19.770363 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 300  diskLoadWatcher: rb: 273 B, wb: 54 MiB, pb: 95 MiB, util: 0.57
I220712 18:12:34.770694 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 312  diskLoadWatcher: rb: 0 B, wb: 115 MiB, pb: 95 MiB, util: 1.21
I220712 18:12:49.770632 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 324  diskLoadWatcher: rb: 0 B, wb: 102 MiB, pb: 95 MiB, util: 1.07
I220712 18:13:04.769926 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 335  diskLoadWatcher: rb: 0 B, wb: 80 MiB, pb: 95 MiB, util: 0.84
I220712 18:13:19.770618 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 347  diskLoadWatcher: rb: 0 B, wb: 33 MiB, pb: 95 MiB, util: 0.35
I220712 18:13:34.770323 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 358  diskLoadWatcher: rb: 0 B, wb: 11 MiB, pb: 95 MiB, util: 0.11
I220712 18:13:49.770645 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 370  diskLoadWatcher: rb: 0 B, wb: 2.6 MiB, pb: 95 MiB, util: 0.03
I220712 18:14:04.769960 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 383  diskLoadWatcher: rb: 0 B, wb: 266 MiB, pb: 95 MiB, util: 2.79
I220712 18:14:19.770059 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 394  diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 95 MiB, util: 2.63
I220712 18:14:34.769914 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 406  diskLoadWatcher: rb: 546 B, wb: 243 MiB, pb: 95 MiB, util: 2.54
I220712 18:14:49.770237 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 419  diskLoadWatcher: rb: 0 B, wb: 76 MiB, pb: 95 MiB, util: 0.80
I220712 18:15:04.770697 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 431  diskLoadWatcher: rb: 0 B, wb: 2.1 MiB, pb: 95 MiB, util: 0.02
I220712 18:15:19.770365 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 442  diskLoadWatcher: rb: 273 B, wb: 52 MiB, pb: 95 MiB, util: 0.55
I220712 18:15:34.770506 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 453  diskLoadWatcher: rb: 0 B, wb: 39 MiB, pb: 95 MiB, util: 0.41
I220712 18:15:49.771073 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 465  diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74
I220712 18:16:04.770788 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 476  diskLoadWatcher: rb: 0 B, wb: 105 MiB, pb: 95 MiB, util: 1.10
I220712 18:16:19.769824 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 487  diskLoadWatcher: rb: 0 B, wb: 42 MiB, pb: 95 MiB, util: 0.44
I220712 18:16:34.770666 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 498  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63
I220712 18:16:49.770379 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 510  diskLoadWatcher: rb: 0 B, wb: 70 MiB, pb: 95 MiB, util: 0.73
I220712 18:17:04.770687 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 521  diskLoadWatcher: rb: 0 B, wb: 77 MiB, pb: 95 MiB, util: 0.80
I220712 18:17:19.770664 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 532  diskLoadWatcher: rb: 0 B, wb: 118 MiB, pb: 95 MiB, util: 1.24
I220712 18:17:34.770083 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 543  diskLoadWatcher: rb: 0 B, wb: 3.0 MiB, pb: 95 MiB, util: 0.03
I220712 18:17:49.770806 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 555  diskLoadWatcher: rb: 0 B, wb: 54 MiB, pb: 95 MiB, util: 0.57
I220712 18:18:04.770748 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 566  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:18:19.770290 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 578  diskLoadWatcher: rb: 0 B, wb: 67 MiB, pb: 95 MiB, util: 0.70
I220712 18:18:34.770280 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 589  diskLoadWatcher: rb: 0 B, wb: 104 MiB, pb: 95 MiB, util: 1.10
I220712 18:18:49.769979 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 600  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:19:04.770342 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 612  diskLoadWatcher: rb: 0 B, wb: 17 MiB, pb: 95 MiB, util: 0.18
I220712 18:19:19.771061 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 623  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69
I220712 18:19:34.770318 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 636  diskLoadWatcher: rb: 0 B, wb: 96 MiB, pb: 95 MiB, util: 1.01
I220712 18:19:49.769739 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 650  diskLoadWatcher: rb: 0 B, wb: 13 MiB, pb: 95 MiB, util: 0.14
I220712 18:20:04.769936 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 663  diskLoadWatcher: rb: 0 B, wb: 42 MiB, pb: 95 MiB, util: 0.44
I220712 18:20:19.770775 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 674  diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.54
I220712 18:20:34.775699 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 685  diskLoadWatcher: rb: 273 B, wb: 54 MiB, pb: 95 MiB, util: 0.57
I220712 18:20:49.770837 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 696  diskLoadWatcher: rb: 273 B, wb: 103 MiB, pb: 95 MiB, util: 1.08
I220712 18:21:04.770360 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 708  diskLoadWatcher: rb: 0 B, wb: 9.4 MiB, pb: 95 MiB, util: 0.10
I220712 18:21:19.771030 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 719  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63
I220712 18:21:34.769898 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 730  diskLoadWatcher: rb: 273 B, wb: 59 MiB, pb: 95 MiB, util: 0.62
I220712 18:21:49.770729 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 742  diskLoadWatcher: rb: 0 B, wb: 40 MiB, pb: 95 MiB, util: 0.42
I220712 18:22:04.769814 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 753  diskLoadWatcher: rb: 273 B, wb: 62 MiB, pb: 95 MiB, util: 0.65
I220712 18:22:19.770621 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 764  diskLoadWatcher: rb: 0 B, wb: 71 MiB, pb: 95 MiB, util: 0.75
I220712 18:22:34.769902 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 775  diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74
I220712 18:22:49.769792 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 787  diskLoadWatcher: rb: 0 B, wb: 84 MiB, pb: 95 MiB, util: 0.88
I220712 18:23:04.770131 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 798  diskLoadWatcher: rb: 273 B, wb: 74 MiB, pb: 95 MiB, util: 0.78
I220712 18:23:19.770370 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 809  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:23:34.770599 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 820  diskLoadWatcher: rb: 273 B, wb: 121 MiB, pb: 95 MiB, util: 1.27
I220712 18:23:49.771022 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 831  diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.51
I220712 18:24:04.770034 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 843  diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.49
I220712 18:24:19.770685 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 854  diskLoadWatcher: rb: 273 B, wb: 90 MiB, pb: 95 MiB, util: 0.95
I220712 18:24:34.770236 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 867  diskLoadWatcher: rb: 273 B, wb: 96 MiB, pb: 95 MiB, util: 1.00
I220712 18:24:49.770619 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 881  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63
I220712 18:25:04.769913 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 892  diskLoadWatcher: rb: 273 B, wb: 51 MiB, pb: 95 MiB, util: 0.53
I220712 18:25:19.770673 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 903  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.55
I220712 18:25:34.770651 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 914  diskLoadWatcher: rb: 273 B, wb: 63 MiB, pb: 95 MiB, util: 0.66
I220712 18:25:49.770871 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 926  diskLoadWatcher: rb: 273 B, wb: 122 MiB, pb: 95 MiB, util: 1.28
I220712 18:26:04.770125 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 938  diskLoadWatcher: rb: 0 B, wb: 69 MiB, pb: 95 MiB, util: 0.72
I220712 18:26:19.770632 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 949  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63
I220712 18:26:34.770592 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 960  diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.58
I220712 18:26:49.770778 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 972  diskLoadWatcher: rb: 0 B, wb: 62 MiB, pb: 95 MiB, util: 0.65
I220712 18:27:04.770242 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 983  diskLoadWatcher: rb: 0 B, wb: 117 MiB, pb: 95 MiB, util: 1.23
I220712 18:27:19.770137 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 994  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63
I220712 18:27:34.770353 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1005  diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.49
I220712 18:27:49.770114 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1017  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69
I220712 18:28:04.769837 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1029  diskLoadWatcher: rb: 0 B, wb: 116 MiB, pb: 95 MiB, util: 1.22
I220712 18:28:19.770701 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1040  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.55
I220712 18:28:34.770075 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1051  diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.50
I220712 18:28:49.769819 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1062  diskLoadWatcher: rb: 0 B, wb: 131 MiB, pb: 95 MiB, util: 1.37
I220712 18:29:04.770145 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1074  diskLoadWatcher: rb: 0 B, wb: 63 MiB, pb: 95 MiB, util: 0.66
I220712 18:29:19.770112 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1085  diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.52
I220712 18:29:34.770037 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1098  diskLoadWatcher: rb: 0 B, wb: 69 MiB, pb: 95 MiB, util: 0.72
I220712 18:29:49.770141 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1112  diskLoadWatcher: rb: 0 B, wb: 116 MiB, pb: 95 MiB, util: 1.22
I220712 18:30:04.770340 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1125  diskLoadWatcher: rb: 0 B, wb: 58 MiB, pb: 95 MiB, util: 0.60
I220712 18:30:19.770347 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1136  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63
I220712 18:30:34.770577 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1147  diskLoadWatcher: rb: 0 B, wb: 128 MiB, pb: 95 MiB, util: 1.34
I220712 18:30:49.770405 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1158  diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.54
I220712 18:31:04.770181 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1170  diskLoadWatcher: rb: 0 B, wb: 61 MiB, pb: 95 MiB, util: 0.64
I220712 18:31:19.770070 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1181  diskLoadWatcher: rb: 0 B, wb: 59 MiB, pb: 95 MiB, util: 0.61
I220712 18:31:34.770327 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1192  diskLoadWatcher: rb: 0 B, wb: 121 MiB, pb: 95 MiB, util: 1.27
I220712 18:31:49.771027 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1203  diskLoadWatcher: rb: 0 B, wb: 63 MiB, pb: 95 MiB, util: 0.66
I220712 18:32:04.770572 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1214  diskLoadWatcher: rb: 0 B, wb: 91 MiB, pb: 95 MiB, util: 0.96
I220712 18:32:19.770161 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1226  diskLoadWatcher: rb: 0 B, wb: 31 MiB, pb: 95 MiB, util: 0.32
I220712 18:32:34.770428 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1237  diskLoadWatcher: rb: 0 B, wb: 57 MiB, pb: 95 MiB, util: 0.60
I220712 18:32:49.770396 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1248  diskLoadWatcher: rb: 0 B, wb: 56 MiB, pb: 95 MiB, util: 0.58
I220712 18:33:04.770595 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1260  diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.55
I220712 18:33:19.770179 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1271  diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.51
I220712 18:33:34.770001 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1282  diskLoadWatcher: rb: 0 B, wb: 77 MiB, pb: 95 MiB, util: 0.81
I220712 18:33:49.770413 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1293  diskLoadWatcher: rb: 0 B, wb: 94 MiB, pb: 95 MiB, util: 0.98
I220712 18:34:04.770672 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1304  diskLoadWatcher: rb: 0 B, wb: 2.6 MiB, pb: 95 MiB, util: 0.03
I220712 18:34:19.770153 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1316  diskLoadWatcher: rb: 0 B, wb: 51 MiB, pb: 95 MiB, util: 0.53
I220712 18:34:34.770660 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1329  diskLoadWatcher: rb: 0 B, wb: 58 MiB, pb: 95 MiB, util: 0.61
I220712 18:34:49.770319 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1342  diskLoadWatcher: rb: 0 B, wb: 50 MiB, pb: 95 MiB, util: 0.53
I220712 18:35:04.770335 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1354  diskLoadWatcher: rb: 273 B, wb: 52 MiB, pb: 95 MiB, util: 0.55
I220712 18:35:19.771075 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1365  diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.57
I220712 18:35:34.769749 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1376  diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74
I220712 18:35:49.769878 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1387  diskLoadWatcher: rb: 273 B, wb: 95 MiB, pb: 95 MiB, util: 1.00
I220712 18:36:04.770526 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1399  diskLoadWatcher: rb: 0 B, wb: 18 MiB, pb: 95 MiB, util: 0.19
I220712 18:36:19.769831 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1410  diskLoadWatcher: rb: 0 B, wb: 45 MiB, pb: 95 MiB, util: 0.47
I220712 18:36:34.769896 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1421  diskLoadWatcher: rb: 273 B, wb: 50 MiB, pb: 95 MiB, util: 0.52
I220712 18:36:49.770154 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1432  diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.58
I220712 18:37:04.770233 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1443  diskLoadWatcher: rb: 273 B, wb: 63 MiB, pb: 95 MiB, util: 0.66
I220712 18:37:19.770668 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1455  diskLoadWatcher: rb: 0 B, wb: 57 MiB, pb: 95 MiB, util: 0.60
I220712 18:37:34.770360 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1466  diskLoadWatcher: rb: 273 B, wb: 114 MiB, pb: 95 MiB, util: 1.20
I220712 18:37:49.770815 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1478  diskLoadWatcher: rb: 273 B, wb: 40 MiB, pb: 95 MiB, util: 0.42
I220712 18:38:04.769966 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1490  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69
I220712 18:38:19.770167 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1501  diskLoadWatcher: rb: 273 B, wb: 89 MiB, pb: 95 MiB, util: 0.94
I220712 18:38:34.770339 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1512  diskLoadWatcher: rb: 0 B, wb: 82 MiB, pb: 95 MiB, util: 0.86
I220712 18:38:49.770747 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1523  diskLoadWatcher: rb: 273 B, wb: 98 MiB, pb: 95 MiB, util: 1.03
I220712 18:39:04.769737 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1534  diskLoadWatcher: rb: 0 B, wb: 2.9 MiB, pb: 95 MiB, util: 0.03
I220712 18:39:19.770772 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1546  diskLoadWatcher: rb: 273 B, wb: 61 MiB, pb: 95 MiB, util: 0.64
I220712 18:39:34.769759 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1559  diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.62
I220712 18:39:49.770282 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1578  diskLoadWatcher: rb: 273 B, wb: 66 MiB, pb: 95 MiB, util: 0.69
I220712 18:40:04.770348 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1590  diskLoadWatcher: rb: 273 B, wb: 119 MiB, pb: 95 MiB, util: 1.25
I220712 18:40:19.770883 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1602  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:40:34.770312 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1613  diskLoadWatcher: rb: 0 B, wb: 20 MiB, pb: 95 MiB, util: 0.21
I220712 18:40:49.770157 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1624  diskLoadWatcher: rb: 273 B, wb: 70 MiB, pb: 95 MiB, util: 0.73

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 7, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 8, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 8, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
@tbg tbg removed their request for review August 9, 2022 09:06
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 10, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 10, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 11, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 11, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 11, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand small multiplicative increase and
  large multiplicative decrease, (unlike what we do for flush and compaction
  tokens, where we try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens. A different WorkQueue is needed to
  prevent a situation where elastic work for one tenant is queued ahead
  of regualar work from another tenant, and stops the latter from making
  progress due to lack of elastic tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
craig bot pushed a commit that referenced this pull request Aug 12, 2022
85722: admission: add support for disk bandwidth as a bottleneck resource r=tbg,irfansharif a=sumeerbhola

We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand small multiplicative increase and 
  large multiplicative decrease, (unlike what we do for flush and compaction 
  tokens, where we try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble#1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in #82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
   to limit write tokens for elastic writes and in the future will also
   limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens. A different WorkQueue is needed to
  prevent a situation where elastic work for one tenant is queued ahead
  of regualar work from another tenant, and stops the latter from making
  progress due to lack of elastic tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs #82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).

85786: sql: support UDFs with named args, strictness, and volatility r=mgartner a=mgartner

#### sql: UDF with empty result should evaluate to NULL

If the last statement in a UDF returns no rows, the UDF will evaluate to
NULL. Prior to this commit the evaluation of the UDF would panic.

Release note: None

#### sql: support UDFs with named arguments

UDFs with named arguments can now be evaluated.

During query planning, statements in the function body are built with a
scope that includes the named arguments for the function as columns.
This allows references to arguments to be resolved as variables.

During evaluation, the input expressions are first evaluated into
datums. When a plan is built for each statement in the UDF, the argument
columns in the expression are replaced with the input datums before the
expression is optimized.

Note that anonymous arguments and integer references to arguments (e.g.,
`$1`) are not yet supported.

Also, the formatting of `UDFExpr`s has been improved to show argument
columns and input expressions.

Release note: None

#### sql: do not evaluate strict UDFs if any input values are NULL

A UDF can have one of two behaviors when it is invoked with NULL inputs:

  1. If the UDF is `CALLED ON NULL INPUT` (the default) then the
     function is evaluated regardless of whether or not any of the input
     values are NULL.
  2. If the UDF `RETURNS NULL ON NULL INPUT` or is `STRICT` then the
     function is not evaluated if any of the input values are NULL.
     Instead, the function directly results in NULL.

This commit implements these two behaviors.

In the future, we can add a normalization rule that folds a strict UDF
if any of its inputs are constant NULL values.

Release note: None

#### sql: make mutations visible to volatile UDFs

The volatility of a UDF affects the visibility of mutations made by the
statement calling the function. A volatile function will see these
mutations. Also, statements within a volatile function's body will see
changes made by previous statements the function body (note that this is
left untested in this commit because we do not currently support
mutations within UDF bodies). In contrast, a stable, immutable, or
leakproof function will see a snapshot of the data as of the start of
the statement calling the function.

Release note: None


Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@irfansharif irfansharif added the X-noremind Bots won't notify about PRs with X-noremind label Oct 3, 2022
@irfansharif irfansharif removed their request for review October 3, 2022 16:08
@irfansharif
Copy link
Contributor

This prototype was merged in #85722. I don't see any additional code here. I've linked the experimental notes above from #86857 since we want to re-run + run more experiments with this machinery.

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 2, 2023
Integration test for disk bandwidth tokens, copying over what we ran in
\cockroachdb#82813. Part of cockroachdb#86857

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants