Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: async compaction cause disk bandwidth spikes #132332

Closed
aadityasondhi opened this issue Oct 10, 2024 · 1 comment · Fixed by #133310
Closed

admission: async compaction cause disk bandwidth spikes #132332

aadityasondhi opened this issue Oct 10, 2024 · 1 comment · Fixed by #133310
Assignees
Labels
A-admission-control branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-1 Issues/test failures with a fix SLA of 1 month T-admission-control Admission Control

Comments

@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Oct 10, 2024

We noticed in #131484 (comment) that async compactions in pebble cause spikes in disk bandwidth, ultimately leading to the assertion failure since the disk bandwidth limiter cannot account for these reads and writes until after the 15s adjustment interval.

We can attribute these writes as error term in our write amp linear model and should account for them at a higher frequency. We discussed the implementation of such accounting internally.

Jira issue: CRDB-42982

Epic CRDB-37479

@aadityasondhi aadityasondhi added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-admission-control T-admission-control Admission Control labels Oct 10, 2024
Copy link

blathers-crl bot commented Oct 10, 2024

Hi @aadityasondhi, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@aadityasondhi aadityasondhi added branch-master Failures and bugs on the master branch. P-1 Issues/test failures with a fix SLA of 1 month labels Oct 10, 2024
@aadityasondhi aadityasondhi self-assigned this Oct 10, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 7, 2024
Previously, we would ignore errors in disk token accouting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwdith for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occured. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 7, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 7, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 8, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 8, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 8, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 8, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 13, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 18, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 18, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 18, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 19, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 21, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 21, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Dec 12, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Dec 12, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Dec 12, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Dec 12, 2024
Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes cockroachdb#132332.

Release note: None
craig bot pushed a commit that referenced this issue Dec 16, 2024
133310: admission: error accounting for disk reads and writes r=sumeerbhola a=aadityasondhi

Previously, we would ignore errors in disk token accounting hoping that
the next adjustment interval would capture them. This caused issues when
async compactions ran within an adjustment interval. These could spike
the bandwidth for a section of the 15 second interval without moving the
average too much (write-amp estimation).

This patch introduces a higher frequency error accounting mechanism. It
runs every 1s to account for any additional reads and writes that could
have occurred. Errors are only deducted if the read and/or write
bandwidth in the interval is greater than the number of tokens already
deducted in that interval.

Fixes #132332.

Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
@craig craig bot closed this as completed in 460d4b0 Dec 16, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Dec 17, 2024
With cockroachdb#132332 resolved, we
can turn this test back on.

Informs: cockroachdb#132332.

Release note: None
craig bot pushed a commit that referenced this issue Dec 18, 2024
137619: roachtest: run disk bandwidth roachtest nightly r=sumeerbhola a=aadityasondhi

With #132332 resolved, we can turn this test back on.

Informs: #132332.

Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-1 Issues/test failures with a fix SLA of 1 month T-admission-control Admission Control
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants