Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: admission-control/disk-bandwidth-limiter failed #136064

Closed
cockroach-teamcity opened this issue Nov 23, 2024 · 15 comments · Fixed by #137288
Closed

roachtest: admission-control/disk-bandwidth-limiter failed #136064

cockroach-teamcity opened this issue Nov 23, 2024 · 15 comments · Fixed by #137288
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.3.1-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 23, 2024

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ cea3ff5562160a3bf2802da052da2aaa40e1ccc1:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 78.917422 exceeded threshold of 78.750000, read bandwidth: 17.366953, total bandwidth: 96.284375
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/run_1

Parameters:

  • arch=amd64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-44847

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels Nov 23, 2024
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Nov 23, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ f717f6bd218121bb5e3376af658545f6bff30c22:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 78.983281 exceeded threshold of 78.750000, read bandwidth: 15.783750, total bandwidth: 94.767031
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 67caf19d3998bb3ca1ada7e3c14486d505b68012:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 81.545547 exceeded threshold of 78.750000, read bandwidth: 14.171250, total bandwidth: 95.716797
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/run_1

Parameters:

  • arch=amd64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@aadityasondhi aadityasondhi removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 26, 2024
@aadityasondhi
Copy link
Collaborator

Known limitation being fixed in #133310

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:

(test_runner.go:2175).func1: monitorForPreemptedVMs: Preempted VMs detected: [{i-0a25b19bff3f59f3d 0001-01-01 00:00:00 +0000 UTC}]
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(admission_control_disk_bandwidth_overload.go:142).func1: failed to set kvadmission.store.provisioned_bandwidth: context canceled
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 79.809464 exceeded threshold of 78.750000, read bandwidth: 16.464888, total bandwidth: 96.274352
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ b3fec61ca90095c664f6432af864f18e9946f8bb:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 78.781719 exceeded threshold of 78.750000, read bandwidth: 117.453594, total bandwidth: 196.235312
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 44de2d379610067e14a7ebfbc92e64311f13a232:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 79.187734 exceeded threshold of 78.750000, read bandwidth: 30.096172, total bandwidth: 109.283906
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=leader
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 21d1123dce5401b1ec580d4156db5db075074506:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 79.313437 exceeded threshold of 78.750000, read bandwidth: 26.561250, total bandwidth: 105.874687
(cluster.go:2456).Run: context canceled
(cluster.go:2456).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 9744e5f1676a752d5b200fe7bce84ca8b44afca0:

(admission_control_disk_bandwidth_overload.go:142).func1: failed to set kvadmission.store.provisioned_bandwidth: dial tcp 18.226.187.80:26257: connect: connection refused
(cluster.go:2467).Run: context canceled
(cluster.go:2467).Run: context canceled
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ d7ea85402dc35e36c6cc35520fa91f25fd5c999d:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 78.761797 exceeded threshold of 78.750000, read bandwidth: 15.141641, total bandwidth: 93.903437
(cluster.go:2467).Run: context canceled
(cluster.go:2467).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=leader
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 79.297812 exceeded threshold of 78.750000, read bandwidth: 16.240234, total bandwidth: 95.538047
(cluster.go:2467).Run: context canceled
(cluster.go:2467).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/run_1

Parameters:

  • arch=amd64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@aadityasondhi aadityasondhi added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 10, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ a3957e4bee7b2b77632e811203a86ecaa95422ca:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 79.747969 exceeded threshold of 78.750000, read bandwidth: 15.065937, total bandwidth: 94.813906
(cluster.go:2467).Run: context canceled
(cluster.go:2467).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ d18eb683b2759fd8814dacf0baa913f596074a17:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 79.441172 exceeded threshold of 78.750000, read bandwidth: 64.641172, total bandwidth: 144.082344
(cluster.go:2467).Run: context canceled
(cluster.go:2467).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=leader
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ e8ee6f574ddf1fce1a4cb53f392c5a9baf633b76:

(admission_control_disk_bandwidth_overload.go:196).3: write bandwidth 82.867734 exceeded threshold of 78.750000, read bandwidth: 17.543672, total bandwidth: 100.411406
(cluster.go:2481).Run: context canceled
(cluster.go:2481).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Dec 12, 2024
135924: sql: label latency histograms with statement fingerprint r=MattWhelan a=MattWhelan

Previously, we collected a single histogram for all statements. While
this sort of metric may be useful in overall database operations,
application developers need more detail, to know how their code changes
impact their query latencies.

By switching to HistogramVec, we can now offer per-statement-fingerprint
latency metrics.

This feature is disabled by default. To know whether its safe to enable,
we also track a cardinality estimate of the unique statement
fingerprints that we see.

TREQ: https://cockroachlabs.atlassian.net/browse/TREQ-703

Epic: none

Release note (ops change):
Introduces a new `sql.exec.latency.detail` histogram metric. This metric
is labeled with its statement fingerprint. Enable this feature using the
`sql.stats.detailed_latency_metrics.enabled` application setting.

The `sql.query.unique.count` metric is a new count metric that estimates
the cardinality of the set of all statement fingerprints. For most
workloads, this ranges from dozens to hundreds. For workloads with over
a couple thousand fingerprints, we advise caution in enabling
`sql.stats.detailed_latency_metrics.enabled`.

Benchmark results:

```
# Baseline (master branch)
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1800.0s     6314.1  98.2%     31.2     21.0     52.4     88.1    209.7    570.4

# PR, setting disabled
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1800.0s     6315.6  98.2%     24.2     19.9     33.6     48.2    117.4    805.3

# PR, setting enabled
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1800.0s     6309.1  98.1%     24.2     19.9     33.6     48.2    113.2    671.1
```

136995: kvserver/rangefeed: metamorphically enable RangefeedUseBufferedSender r=stevendanna a=wenyihu6

**kvserver/rangefeed: rename .buffered_stream_sender. to .buffered_sender.**

This patch renames kv.rangefeed.buffered_stream_sender.enabled to
kv.rangefeed.buffered_sender.enabled to align with variable and
struct names better.

Epic: none
Release: none

---

**kvserver/rangefeed: metamorphically enable RangefeedUseBufferedSender**

This patch metamorphically enables the cluster setting
RangefeedUseBufferedSender.

Release note: none
Part of: #135332

137288: roachtest: make disk bandwidth test manual only r=aadityasondhi a=aadityasondhi

This test is a little too noisy until we make further improvements to the disk bandwidth limiter.

Fixes: #136064.

Release note: None

Co-authored-by: Matt Whelan <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
@craig craig bot closed this as completed in 9952005 Dec 12, 2024
@github-project-automation github-project-automation bot moved this from Incoming to Done in [Deprecated] Storage Dec 12, 2024
Copy link

blathers-crl bot commented Dec 12, 2024

Based on the specified backports for linked PR #137288, I applied the following new label(s) to this issue: branch-release-24.3.1-rc. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Dec 12, 2024
This test is a little too noisy until we make further improvements to
the disk bandwidth limiter.

Fixes: cockroachdb#136064.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.3.1-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-storage Storage Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants