Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: perturbation/metamorphic/backfill failed #131713

Closed
cockroach-teamcity opened this issue Oct 1, 2024 · 16 comments · Fixed by #133115
Closed

roachtest: perturbation/metamorphic/backfill failed #131713

cockroach-teamcity opened this issue Oct 1, 2024 · 16 comments · Fixed by #133115
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-perturbation Bugs found by the perturbation framework O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 1, 2024

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 74333311616b937fea6a995462215a1cb5962686:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:739
	            				main/pkg/cmd/roachtest/test_runner.go:1284
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/backfill
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=32
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-42656

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Oct 1, 2024
@kvoli
Copy link
Collaborator

kvoli commented Oct 2, 2024

Unacceptable change in latency from the backfill caused the failure.

Few odd things stuck out, such as the disk read bandwidth and IO overload at the start, prior to the workload:

image

I don't think these caused the failure but they seem unexpected to me, the fill absolutely decimates L0. It'd probably be a good test in of itself for v2 rac with a send queue.

Didn't look too much further for the actual failure -- The metamorphic vars might be playing a part here.

We should consider disabling the pass-fail criteria for the metamorphic variants until after we enable RACv2 fully on master, post branch-cut in a month or so cc @andrewbaptist

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 2, 2024
@andrewbaptist
Copy link
Collaborator

The IO overload during the fill is expected without RAC v2 since it is non-elastic writes that are sent to a non-leaseholder. The test is set up to wait until the overload dissipates before running the rest of it.

In terms of this failure, I had left this as "non-Infinity" passing criteria as we had though that RAC v1 should have addressed this type of issue.

In terms of disabling for metamorphic, I'm adding notes so we can easily recreate it later:

COCKROACH_RANDOM_SEED=1705968366363561838

2024/10/01 19:52:49 admission_control_latency.go:628: test variations are: seed: 1705968366363561838, fillDuration: 10m0s, maxBlockBytes: 4096, perturbationDuration: 10m0s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 10000, numNodes: 30, numWorkloadNodes: 2, partitionSite: true, vcpu: 32, disks: 2, leaseType: epoch, cloud: gce

2024/10/01 20:47:42 admission_control_latency.go:725: Baseline stats
follower-read  : 2024-10-01 20:33:50 +0000 UTC: score: 8.432262ms, qps: 22952, p50: 1ms, p99: 2.4ms, pMax: 14.2ms
read           : 2024-10-01 20:32:26 +0000 UTC: score: 7.962909ms, qps: 23467, p50: 1ms, p99: 2.1ms, pMax: 26.2ms
write          : 2024-10-01 20:33:27 +0000 UTC: score: 9.930093ms, qps: 46244, p50: 1.9ms, p99: 7.6ms, pMax: 26.2ms

2024/10/01 20:47:42 admission_control_latency.go:726: Perturbation stats
follower-read  : 2024-10-01 20:36:31 +0000 UTC: score: 102.991568ms, qps: 10684, p50: 1.2ms, p99: 234.9ms, pMax: 637.5ms
read           : 2024-10-01 20:36:31 +0000 UTC: score: 101.17895ms, qps: 10667, p50: 1ms, p99: 226.5ms, pMax: 335.5ms
write          : 2024-10-01 20:36:31 +0000 UTC: score: 73.401832ms, qps: 21853, p50: 2ms, p99: 243.3ms, pMax: 352.3ms

2024/10/01 20:47:42 admission_control_latency.go:727: Recovery stats
follower-read  : 2024-10-01 20:47:00 +0000 UTC: score: 320.608238ms, qps: 5333, p50: 1.2ms, p99: 1.1409s, pMax: 1.4093s
read           : 2024-10-01 20:47:00 +0000 UTC: score: 316.662676ms, qps: 5144, p50: 1ms, p99: 1.0737s, pMax: 1.2751s
write          : 2024-10-01 20:47:00 +0000 UTC: score: 250.24621ms, qps: 10816, p50: 1.9ms, p99: 1.4093s, pMax: 1.6106s

2024/10/01 20:48:06 admission_control_latency.go:735: validating stats during the perturbation
2024/10/01 20:48:06 admission_control_latency.go:794: PASSED : follower-read  : Increase 12.2140 <= 20.0000 BASE: 8.432262ms SCORE: 102.991568ms
2024/10/01 20:48:06 admission_control_latency.go:794: PASSED : read           : Increase 12.7063 <= 20.0000 BASE: 7.962909ms SCORE: 101.17895ms
2024/10/01 20:48:06 admission_control_latency.go:794: PASSED : write          : Increase 7.3919 <= 20.0000 BASE: 9.930093ms SCORE: 73.401832ms
2024/10/01 20:48:06 admission_control_latency.go:737: validating stats after the perturbation
2024/10/01 20:48:06 admission_control_latency.go:791: FAILURE: follower-read  : Increase 38.0216 > 20.0000 BASE: 8.432262ms SCORE: 320.608238ms
2024/10/01 20:48:06 admission_control_latency.go:791: FAILURE: read           : Increase 39.7672 > 20.0000 BASE: 7.962909ms SCORE: 316.662676ms
2024/10/01 20:48:06 admission_control_latency.go:791: FAILURE: write          : Increase 25.2008 > 20.0000 BASE: 9.930093ms SCORE: 250.24621ms

I'll also submit a PR to bump the passing criteria to be 40 rather than 20. I want to keep failures for metamorphic tests in place otherwise it will be hard to notice when they are failing. However we can leave them as P3 until we decide to tackle them.

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 0c0af9540ed3f9d63eba523bc870eeb6c7eebe90:

(admission_control_latency.go:855).waitForIOOverloadToEnd: dial tcp 104.196.20.150:26257: connect: connection refused
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@andrewbaptist
Copy link
Collaborator

This failure is a process crash due to out of memory. The target node, n30, was killed by the kernel at 12:23:57 according to the logs and the nodes.json file.

CRDB runtime stats I241003 12:19:15.013352 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 90 runtime stats: 522 MiB RSS, 1546 goroutines (stacks: 26 MiB), 367 MiB/422 MiB Go alloc/total (heap fragmentation: 17 MiB, heap reserved: 168 KiB, heap released: 1.8 MiB), 10 MiB/18 MiB CGO alloc/total (0.0 CGO/sec), 0.0/0.0 %(u/s)time, 0.0 %gc (7x), 3.5 MiB/2.3 MiB (r/w)net I241003 12:19:25.013029 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 100 runtime stats: 626 MiB RSS, 1652 goroutines (stacks: 24 MiB), 119 MiB/527 MiB Go alloc/total (heap fragmentation: 90 MiB, heap reserved: 279 MiB, heap released: 3.8 MiB), 14 MiB/22 MiB CGO alloc/total (265.5 CGO/sec), 18.8/5.5 %(u/s)time, 0.0 %gc (8x), 1.2 MiB/1.5 MiB (r/w)net I241003 12:19:35.012879 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 103 runtime stats: 628 MiB RSS, 1643 goroutines (stacks: 25 MiB), 174 MiB/527 MiB Go alloc/total (heap fragmentation: 60 MiB, heap reserved: 254 MiB, heap released: 3.8 MiB), 14 MiB/22 MiB CGO alloc/total (4.2 CGO/sec), 9.5/4.3 %(u/s)time, 0.0 %gc (8x), 693 KiB/1.0 MiB (r/w)net I241003 12:19:45.013201 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 105 runtime stats: 2.4 GiB RSS, 1659 goroutines (stacks: 23 MiB), 1.5 GiB/1.8 GiB Go alloc/total (heap fragmentation: 66 MiB, heap reserved: 176 MiB, heap released: 40 MiB), 528 MiB/561 MiB CGO alloc/total (674.7 CGO/sec), 129.4/133.5 %(u/s)time, 0.1 %gc (16x), 973 MiB/4.1 MiB (r/w)net I241003 12:19:55.014263 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 107 runtime stats: 2.3 GiB RSS, 1655 goroutines (stacks: 23 MiB), 1.3 GiB/1.9 GiB Go alloc/total (heap fragmentation: 58 MiB, heap reserved: 437 MiB, heap released: 115 MiB), 330 MiB/373 MiB CGO alloc/total (1800.5 CGO/sec), 202.7/180.6 %(u/s)time, 0.1 %gc (24x), 1.0 GiB/4.8 MiB (r/w)net I241003 12:20:05.015152 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 110 runtime stats: 2.5 GiB RSS, 1665 goroutines (stacks: 20 MiB), 671 MiB/2.0 GiB Go alloc/total (heap fragmentation: 70 MiB, heap reserved: 1.2 GiB, heap released: 3.7 MiB), 394 MiB/434 MiB CGO alloc/total (1814.2 CGO/sec), 200.6/169.5 %(u/s)time, 0.0 %gc (31x), 954 MiB/5.9 MiB (r/w)net I241003 12:20:15.016002 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 114 runtime stats: 2.5 GiB RSS, 1671 goroutines (stacks: 24 MiB), 1.6 GiB/2.0 GiB Go alloc/total (heap fragmentation: 57 MiB, heap reserved: 243 MiB, heap released: 50 MiB), 392 MiB/434 MiB CGO alloc/total (1800.8 CGO/sec), 196.1/170.3 %(u/s)time, 0.0 %gc (36x), 869 MiB/4.9 MiB (r/w)net I241003 12:20:25.014170 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 118 runtime stats: 2.5 GiB RSS, 1667 goroutines (stacks: 24 MiB), 1.3 GiB/1.9 GiB Go alloc/total (heap fragmentation: 63 MiB, heap reserved: 554 MiB, heap released: 224 MiB), 394 MiB/435 MiB CGO alloc/total (1741.8 CGO/sec), 189.1/172.8 %(u/s)time, 0.0 %gc (42x), 927 MiB/5.0 MiB (r/w)net I241003 12:20:35.014458 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 120 runtime stats: 2.7 GiB RSS, 1673 goroutines (stacks: 17 MiB), 474 MiB/2.1 GiB Go alloc/total (heap fragmentation: 114 MiB, heap reserved: 1.5 GiB, heap released: 256 MiB), 398 MiB/440 MiB CGO alloc/total (1636.5 CGO/sec), 178.6/166.9 %(u/s)time, 0.1 %gc (48x), 886 MiB/4.5 MiB (r/w)net I241003 12:20:45.016764 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 122 runtime stats: 2.5 GiB RSS, 1674 goroutines (stacks: 22 MiB), 1.6 GiB/1.9 GiB Go alloc/total (heap fragmentation: 70 MiB, heap reserved: 265 MiB, heap released: 442 MiB), 464 MiB/509 MiB CGO alloc/total (1653.8 CGO/sec), 179.2/171.0 %(u/s)time, 0.1 %gc (54x), 976 MiB/4.9 MiB (r/w)net I241003 12:20:55.025204 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 130 runtime stats: 3.2 GiB RSS, 1717 goroutines (stacks: 25 MiB), 2.0 GiB/2.5 GiB Go alloc/total (heap fragmentation: 62 MiB, heap reserved: 346 MiB, heap released: 72 MiB), 645 MiB/704 MiB CGO alloc/total (2426.1 CGO/sec), 162.0/164.5 %(u/s)time, 0.0 %gc (59x), 809 MiB/4.4 MiB (r/w)net I241003 12:21:05.019299 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 133 runtime stats: 4.0 GiB RSS, 1716 goroutines (stacks: 25 MiB), 1.8 GiB/2.5 GiB Go alloc/total (heap fragmentation: 60 MiB, heap reserved: 531 MiB, heap released: 70 MiB), 1.3 GiB/1.4 GiB CGO alloc/total (6643.1 CGO/sec), 185.2/177.5 %(u/s)time, 0.0 %gc (65x), 817 MiB/6.8 MiB (r/w)net I241003 12:21:15.022883 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 135 runtime stats: 4.1 GiB RSS, 1711 goroutines (stacks: 24 MiB), 1.1 GiB/2.1 GiB Go alloc/total (heap fragmentation: 71 MiB, heap reserved: 863 MiB, heap released: 638 MiB), 1.6 GiB/1.9 GiB CGO alloc/total (6204.3 CGO/sec), 184.8/176.3 %(u/s)time, 0.0 %gc (71x), 802 MiB/15 MiB (r/w)net I241003 12:21:25.024510 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 138 runtime stats: 4.0 GiB RSS, 1714 goroutines (stacks: 25 MiB), 1.4 GiB/1.9 GiB Go alloc/total (heap fragmentation: 63 MiB, heap reserved: 402 MiB, heap released: 820 MiB), 1.7 GiB/2.1 GiB CGO alloc/total (6264.9 CGO/sec), 146.4/151.8 %(u/s)time, 0.0 %gc (77x), 716 MiB/8.1 MiB (r/w)net I241003 12:21:35.028061 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 143 runtime stats: 5.4 GiB RSS, 1749 goroutines (stacks: 26 MiB), 2.6 GiB/2.7 GiB Go alloc/total (heap fragmentation: 58 MiB, heap reserved: 28 MiB, heap released: 1.0 MiB), 1.9 GiB/2.7 GiB CGO alloc/total (8533.7 CGO/sec), 179.0/183.2 %(u/s)time, 0.0 %gc (83x), 1.1 GiB/5.0 MiB (r/w)net I241003 12:21:45.030811 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 145 runtime stats: 5.8 GiB RSS, 1768 goroutines (stacks: 26 MiB), 1.5 GiB/3.0 GiB Go alloc/total (heap fragmentation: 79 MiB, heap reserved: 1.4 GiB, heap released: 34 MiB), 2.0 GiB/2.8 GiB CGO alloc/total (12846.7 CGO/sec), 201.2/200.8 %(u/s)time, 0.0 %gc (89x), 1.2 GiB/4.1 MiB (r/w)net I241003 12:21:55.039574 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 147 runtime stats: 5.0 GiB RSS, 1766 goroutines (stacks: 21 MiB), 884 MiB/2.6 GiB Go alloc/total (heap fragmentation: 78 MiB, heap reserved: 1.7 GiB, heap released: 605 MiB), 2.1 GiB/2.3 GiB CGO alloc/total (12203.3 CGO/sec), 192.7/198.2 %(u/s)time, 0.1 %gc (96x), 1.3 GiB/4.0 MiB (r/w)net I241003 12:22:05.036282 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 150 runtime stats: 4.8 GiB RSS, 1768 goroutines (stacks: 27 MiB), 1.7 GiB/2.5 GiB Go alloc/total (heap fragmentation: 67 MiB, heap reserved: 781 MiB, heap released: 708 MiB), 2.1 GiB/2.2 GiB CGO alloc/total (13158.7 CGO/sec), 186.9/197.2 %(u/s)time, 0.1 %gc (101x), 1.0 GiB/3.9 MiB (r/w)net I241003 12:22:15.044622 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 152 runtime stats: 5.2 GiB RSS, 1765 goroutines (stacks: 19 MiB), 590 MiB/3.0 GiB Go alloc/total (heap fragmentation: 112 MiB, heap reserved: 2.2 GiB, heap released: 279 MiB), 2.0 GiB/2.1 GiB CGO alloc/total (17985.9 CGO/sec), 188.2/205.8 %(u/s)time, 0.1 %gc (107x), 1.1 GiB/3.7 MiB (r/w)net I241003 12:22:25.051695 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 154 runtime stats: 5.1 GiB RSS, 1766 goroutines (stacks: 27 MiB), 2.9 GiB/3.0 GiB Go alloc/total (heap fragmentation: 53 MiB, heap reserved: 25 MiB, heap released: 204 MiB), 1.9 GiB/2.0 GiB CGO alloc/total (31842.0 CGO/sec), 205.4/219.8 %(u/s)time, 0.0 %gc (111x), 1.1 GiB/4.3 MiB (r/w)net I241003 12:22:35.058734 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 159 runtime stats: 5.3 GiB RSS, 1771 goroutines (stacks: 24 MiB), 1.1 GiB/3.2 GiB Go alloc/total (heap fragmentation: 79 MiB, heap reserved: 2.0 GiB, heap released: 349 MiB), 2.0 GiB/2.1 GiB CGO alloc/total (17486.3 CGO/sec), 218.0/219.0 %(u/s)time, 0.0 %gc (117x), 1.2 GiB/6.7 MiB (r/w)net I241003 12:22:45.076458 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 161 runtime stats: 5.1 GiB RSS, 1774 goroutines (stacks: 21 MiB), 773 MiB/2.8 GiB Go alloc/total (heap fragmentation: 88 MiB, heap reserved: 1.9 GiB, heap released: 724 MiB), 2.1 GiB/2.2 GiB CGO alloc/total (13290.0 CGO/sec), 188.3/200.7 %(u/s)time, 0.0 %gc (123x), 1.0 GiB/6.6 MiB (r/w)net I241003 12:22:55.074632 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 163 runtime stats: 5.3 GiB RSS, 1774 goroutines (stacks: 27 MiB), 1.6 GiB/3.1 GiB Go alloc/total (heap fragmentation: 78 MiB, heap reserved: 1.4 GiB, heap released: 399 MiB), 2.0 GiB/2.2 GiB CGO alloc/total (36311.6 CGO/sec), 201.7/205.5 %(u/s)time, 0.0 %gc (127x), 942 MiB/6.2 MiB (r/w)net I241003 12:23:05.046807 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 169 runtime stats: 5.5 GiB RSS, 1770 goroutines (stacks: 27 MiB), 2.0 GiB/3.2 GiB Go alloc/total (heap fragmentation: 68 MiB, heap reserved: 1.1 GiB, heap released: 327 MiB), 2.1 GiB/2.2 GiB CGO alloc/total (35743.1 CGO/sec), 193.6/193.9 %(u/s)time, 0.0 %gc (131x), 847 MiB/5.5 MiB (r/w)net I241003 12:23:15.098309 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 173 runtime stats: 5.1 GiB RSS, 1767 goroutines (stacks: 27 MiB), 1.9 GiB/2.9 GiB Go alloc/total (heap fragmentation: 68 MiB, heap reserved: 844 MiB, heap released: 702 MiB), 2.1 GiB/2.2 GiB CGO alloc/total (32728.1 CGO/sec), 187.0/192.3 %(u/s)time, 0.0 %gc (135x), 899 MiB/5.8 MiB (r/w)net I241003 12:23:25.082783 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 175 runtime stats: 5.3 GiB RSS, 1780 goroutines (stacks: 28 MiB), 2.3 GiB/2.8 GiB Go alloc/total (heap fragmentation: 65 MiB, heap reserved: 481 MiB, heap released: 720 MiB), 2.3 GiB/2.4 GiB CGO alloc/total (20698.2 CGO/sec), 177.5/189.3 %(u/s)time, 0.0 %gc (139x), 838 MiB/4.5 MiB (r/w)net I241003 12:23:35.109772 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 177 runtime stats: 5.5 GiB RSS, 1780 goroutines (stacks: 28 MiB), 2.6 GiB/2.9 GiB Go alloc/total (heap fragmentation: 64 MiB, heap reserved: 165 MiB, heap released: 642 MiB), 2.4 GiB/2.5 GiB CGO alloc/total (64439.7 CGO/sec), 241.2/240.0 %(u/s)time, 0.0 %gc (144x), 1.0 GiB/6.6 MiB (r/w)net I241003 12:23:45.093370 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 181 runtime stats: 6.7 GiB RSS, 1810 goroutines (stacks: 28 MiB), 2.9 GiB/4.0 GiB Go alloc/total (heap fragmentation: 56 MiB, heap reserved: 933 MiB, heap released: 4.0 MiB), 2.7 GiB/2.9 GiB CGO alloc/total (25762.5 CGO/sec), 201.3/213.6 %(u/s)time, 0.0 %gc (149x), 1.1 GiB/4.3 MiB (r/w)net I241003 12:23:55.203032 318 2@server/status/runtime_log.go:43 ⋮ [T1,Vsystem,n30] 185 runtime stats: 6.8 GiB RSS, 1810 goroutines (stacks: 28 MiB), 2.4 GiB/3.6 GiB Go alloc/total (heap fragmentation: 64 MiB, heap reserved: 1.1 GiB, heap released: 411 MiB), 3.1 GiB/3.3 GiB CGO alloc/total (74758.0 CGO/sec), 241.8/264.3 %(u/s)time, 0.0 %gc (153x), 1.1 GiB/4.0 MiB (r/w)net
dmesg log [Thu Oct 3 12:23:57 2024] cockroach invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0 [Thu Oct 3 12:23:57 2024] CPU: 6 PID: 16433 Comm: cockroach Not tainted 6.5.0-1016-gcp #16~22.04.1-Ubuntu [Thu Oct 3 12:23:57 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 [Thu Oct 3 12:23:57 2024] Call Trace: [Thu Oct 3 12:23:57 2024] [Thu Oct 3 12:23:57 2024] dump_stack_lvl+0x48/0x70 [Thu Oct 3 12:23:57 2024] dump_stack+0x10/0x20 [Thu Oct 3 12:23:57 2024] dump_header+0x50/0x270 [Thu Oct 3 12:23:57 2024] oom_kill_process+0x10d/0x1c0 [Thu Oct 3 12:23:57 2024] out_of_memory+0x103/0x340 [Thu Oct 3 12:23:57 2024] __alloc_pages_may_oom+0x112/0x1e0 [Thu Oct 3 12:23:57 2024] __alloc_pages_slowpath.constprop.0+0x462/0x9d0 [Thu Oct 3 12:23:57 2024] __alloc_pages+0x304/0x330 [Thu Oct 3 12:23:57 2024] __folio_alloc+0x1d/0x60 [Thu Oct 3 12:23:57 2024] ? policy_node+0x69/0x80 [Thu Oct 3 12:23:57 2024] vma_alloc_folio+0x9f/0x3d0 [Thu Oct 3 12:23:57 2024] ? task_tick_fair+0x87/0x690 [Thu Oct 3 12:23:57 2024] do_anonymous_page+0x76/0x350 [Thu Oct 3 12:23:57 2024] handle_pte_fault+0x16e/0x170 [Thu Oct 3 12:23:57 2024] __handle_mm_fault+0x666/0x730 [Thu Oct 3 12:23:57 2024] handle_mm_fault+0x14e/0x360 [Thu Oct 3 12:23:57 2024] do_user_addr_fault+0x14b/0x670 [Thu Oct 3 12:23:57 2024] exc_page_fault+0x83/0x190 [Thu Oct 3 12:23:57 2024] asm_exc_page_fault+0x27/0x30 [Thu Oct 3 12:23:57 2024] RIP: 0033:0x4d9a6b [Thu Oct 3 12:23:57 2024] Code: 1f f8 c3 f3 44 0f 7f 3f f3 44 0f 7f 7c 1f f0 c3 f3 44 0f 7f 3f f3 44 0f 7f 7f 10 f3 44 0f 7f 7c 1f e0 f3 44 0f 7f 7c 1f f0 c3 44 0f 7f 3f f3 44 0f 7f 7f 10 f3 44 0f 7f 7f 20 f3 44 0f 7f 7f [Thu Oct 3 12:23:57 2024] RSP: 002b:000000c01535b630 EFLAGS: 00010287 [Thu Oct 3 12:23:57 2024] RAX: 0000000000000000 RBX: 0000000000000076 RCX: 0000000000000000 [Thu Oct 3 12:23:57 2024] RDX: 000000c0f5002f8a RSI: 0000000000002800 RDI: 000000c0f5002f8a [Thu Oct 3 12:23:57 2024] RBP: 000000c01535b698 R08: 0000000000000000 R09: 0000000000000001 [Thu Oct 3 12:23:57 2024] R10: 00007873c768d910 R11: 0000000000000000 R12: 000000c0f5000800 [Thu Oct 3 12:23:57 2024] R13: 0000000000000000 R14: 000000c014240c40 R15: 3fffffffffffffff [Thu Oct 3 12:23:57 2024] [Thu Oct 3 12:23:57 2024] Mem-Info: [Thu Oct 3 12:23:57 2024] active_anon:257360 inactive_anon:1598281 isolated_anon:0 [Thu Oct 3 12:23:57 2024] Node 0 active_anon:1029440kB inactive_anon:6393124kB active_file:8424kB inactive_file:32592kB unevictable:27680kB isolated(anon):0kB isolated(file):0kB mapped:30676kB dirty:8016kB writeback:0kB shmem:1324kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB kernel_stack:4304kB pagetables:22404kB sec_pagetables:0kB all_unreclaimable? no [Thu Oct 3 12:23:57 2024] Node 0 DMA free:14336kB boost:0kB min:124kB low:152kB high:180kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15920kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [Thu Oct 3 12:23:57 2024] lowmem_reserve[]: 0 2988 7905 7905 7905 [Thu Oct 3 12:23:57 2024] Node 0 DMA32 free:105436kB boost:57364kB min:82864kB low:89236kB high:95608kB reserved_highatomic:30720KB active_anon:312396kB inactive_anon:2528988kB active_file:588kB inactive_file:9588kB unevictable:0kB writepending:3236kB present:3126072kB managed:3060504kB mlocked:0kB bounce:0kB free_pcp:192kB local_pcp:0kB free_cma:0kB [Thu Oct 3 12:23:57 2024] lowmem_reserve[]: 0 0 4916 4916 4916 [Thu Oct 3 12:23:57 2024] Node 0 Normal free:96324kB boost:94392kB min:136344kB low:146832kB high:157320kB reserved_highatomic:6144KB active_anon:715196kB inactive_anon:3864952kB active_file:6060kB inactive_file:21940kB unevictable:27680kB writepending:4936kB present:5242880kB managed:5042836kB mlocked:27680kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [Thu Oct 3 12:23:57 2024] lowmem_reserve[]: 0 0 0 0 0 [Thu Oct 3 12:23:57 2024] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB [Thu Oct 3 12:23:57 2024] Node 0 DMA32: 471*4kB (UMEH) 269*8kB (UMEH) 6202*16kB (UMH) 70*32kB (UMH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 105508kB [Thu Oct 3 12:23:57 2024] Node 0 Normal: 1020*4kB (UME) 2132*8kB (UME) 4599*16kB (UMEH) 25*32kB (UMH) 6*64kB (MH) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 96032kB [Thu Oct 3 12:23:57 2024] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [Thu Oct 3 12:23:57 2024] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [Thu Oct 3 12:23:57 2024] 12916 total pagecache pages [Thu Oct 3 12:23:57 2024] 0 pages in swap cache [Thu Oct 3 12:23:57 2024] Free swap = 0kB [Thu Oct 3 12:23:57 2024] Total swap = 0kB [Thu Oct 3 12:23:57 2024] 2096218 pages RAM [Thu Oct 3 12:23:57 2024] 0 pages HighMem/MovableOnly [Thu Oct 3 12:23:57 2024] 66543 pages reserved [Thu Oct 3 12:23:57 2024] 0 pages hwpoisoned [Thu Oct 3 12:23:57 2024] Tasks state (memory values in pages): [Thu Oct 3 12:23:57 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [Thu Oct 3 12:23:57 2024] [ 166] 0 166 12297 1312 106496 0 -250 systemd-journal [Thu Oct 3 12:23:57 2024] [ 207] 0 207 72337 6848 110592 0 -1000 multipathd [Thu Oct 3 12:23:57 2024] [ 217] 0 217 2904 1147 65536 0 -1000 systemd-udevd [Thu Oct 3 12:23:57 2024] [ 442] 100 442 4065 992 73728 0 0 systemd-network [Thu Oct 3 12:23:57 2024] [ 445] 101 445 6385 1856 90112 0 0 systemd-resolve [Thu Oct 3 12:23:57 2024] [ 560] 102 560 2151 960 57344 0 -900 dbus-daemon [Thu Oct 3 12:23:57 2024] [ 577] 0 577 402671 2466 249856 0 0 google_osconfig [Thu Oct 3 12:23:57 2024] [ 591] 0 591 8271 3104 102400 0 0 networkd-dispat [Thu Oct 3 12:23:57 2024] [ 604] 104 604 55601 1376 81920 0 0 rsyslogd [Thu Oct 3 12:23:57 2024] [ 618] 0 618 440722 3596 299008 0 -900 snapd [Thu Oct 3 12:23:57 2024] [ 757] 0 757 569403 3034 266240 0 -999 google_guest_ag [Thu Oct 3 12:23:57 2024] [ 767] 0 767 1555 544 53248 0 0 agetty [Thu Oct 3 12:23:57 2024] [ 776] 0 776 1544 544 49152 0 0 agetty [Thu Oct 3 12:23:57 2024] [ 784] 0 784 58861 1193 94208 0 0 polkitd [Thu Oct 3 12:23:57 2024] [ 1016] 0 1016 4114 1248 77824 0 0 systemd-logind [Thu Oct 3 12:23:57 2024] [ 9925] 0 9925 74021 1664 163840 0 0 packagekitd [Thu Oct 3 12:23:57 2024] [ 13930] 113 13930 4730 807 57344 0 0 chronyd [Thu Oct 3 12:23:57 2024] [ 13931] 113 13931 2648 433 57344 0 0 chronyd [Thu Oct 3 12:23:57 2024] [ 14799] 0 14799 3859 1280 65536 0 -1000 sshd [Thu Oct 3 12:23:57 2024] [ 15997] 0 15997 177348 4573 196608 0 0 side-eye-agent [Thu Oct 3 12:23:57 2024] [ 16348] 1000 16348 1941 544 57344 0 0 bash [Thu Oct 3 12:23:57 2024] [ 16353] 1000 16353 3282142 1839794 20742144 0 0 cockroach [Thu Oct 3 12:23:57 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=cockroach-system.service,mems_allowed=0,global_oom,task_memcg=/system.slice/cockroach-system.service,task=cockroach,pid=16353,uid=1000 [Thu Oct 3 12:23:57 2024] Out of memory: Killed process 16353 (cockroach) total-vm:13128568kB, anon-rss:7335368kB, file-rss:23680kB, shmem-rss:0kB, UID:1000 pgtables:20256kB oom_score_adj:0

@andrewbaptist andrewbaptist added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Oct 3, 2024
@blathers-crl blathers-crl bot added the T-storage Storage Team label Oct 3, 2024
@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Oct 3, 2024

Adding storage and bumping to a P-1 since this is a process crash during a normal operation. Attaching memory profiles and stats.
memmonitoring.2024-10-03T12_23_45.288.3162234480.txt
memstats.2024-10-03T12_23_55.207.7270998016.txt
memprof.2024-10-03T12_23_45.093.3162234480.pprof.zip

@andrewbaptist andrewbaptist added P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA labels Oct 3, 2024
@andrewbaptist
Copy link
Collaborator

Reproduction steps using roachperf (This doesn't quite reproduce but shows a large memory spike).

roachprod create -n 32 $CLUSTER
roachprod put $CLUSTER artifacts/cockroach
roachprod start $CLUSTER:1-30
roachprod ssh $CLUSTER:1 "./cockroach workload init kv --splits 240 {pgurl:1}"

roachprod sql $CLUSTER:1
ALTER DATABASE kv CONFIGURE ZONE USING constraints='{"+node30":1}', lease_preferences='[[-node30]]', num_replicas=3;
SET CLUSTER SETTING bulkio.index_backfill.batch_size = 5000;

roachprod ssh $CLUSTER:31-32 "./cockroach workload run kv --max-block-bytes=10000 --min-block-bytes=10000 --concurrency=100 {pgurl:1-30}"

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ f842c3b4b5adc040d411bd17d7d10005273fc1b6:

(cluster.go:2478).Run: full command output in run_121620.643905128_n31-32_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ dcce4cafa234525fc859d32745c11ed87890dc7b:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:657
	            				main/pkg/cmd/roachtest/test_runner.go:1279
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	full command output in run_122746.603454352_n1_cockroach-workload-i.log: COMMAND_PROBLEM: exit status 1
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | main.(*clusterImpl).RunE
	            	  | 	main/pkg/cmd/roachtest/cluster.go:2522
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.kvWorkload.initWorkload
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:898
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.variations.runTest
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:657
	            	  | main.(*testRunner).runTest.func2
	            	  | 	main/pkg/cmd/roachtest/test_runner.go:1279
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1695
	            	Wraps: (2) full command output in run_122746.603454352_n1_cockroach-workload-i.log
	            	Wraps: (3) Node 1. Command with error:
	            	  | ``````
	            	  | ./cockroach workload init kv --db target --splits 1 {pgurl:1}
	            	  | ``````
	            	  | stdout: <empty>
	            	  | stderr:I241006 12:27:48.156007 1 workload/cli/run.go:665  [-] 1  random seed: -1802708348205723833
	            	  | I241006 12:27:48.318399 1 workload/workloadsql/workloadsql.go:120  [-] 2  starting 1 splits
	            	  | Error: executing ALTER TABLE kv SPLIT AT VALUES (-1): dial tcp 10.142.2.14:26257: connect: connection refused
	            	Wraps: (4) COMMAND_PROBLEM
	            	Wraps: (5) exit status 1
	            	Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *hintdetail.withDetail (4) errors.Cmd (5) *exec.ExitError
	Test:       	perturbation/metamorphic/backfill
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@itsbilal itsbilal moved this from Incoming to Tests (failures, skipped, flakes) in [Deprecated] Storage Oct 8, 2024
@itsbilal itsbilal moved this from Tests (failures, skipped, flakes) to Incoming in [Deprecated] Storage Oct 8, 2024
@sumeerbhola
Copy link
Collaborator

@andrewbaptist what is the expectation regarding #131713 (comment)?

AC/Storage doesn't in general investigate OOMs (AC has no awareness of memory) unless there is a clear sign that cgo memory usage was too high.

The profile only shows 1GB of memory. How much memory were these nodes provisioned with?

@andrewbaptist
Copy link
Collaborator

I'll have to run again to get a memory profile, but I believe the memory is all in Raft / Storage and encoding/decoding of Batch and Raft protobufs. This was a simple KV workload with relatively low concurrency, so it shouldn't have used much memory above.

This is a metamorphic test so it runs with a different number of CPUS (and therefore memory) with each run. In the reproduction steps above it was only 4 vCPU.

I'm not sure who should own this, but it was concerning that we had an OOM with this test.

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ fd4b1464dbd6e385c6e51af26fe294fd2023a259:

(cluster.go:2478).Run: full command output in run_113409.776231910_n31-32_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 645eb8c99796b3b88f5631aa0fc92a011010ce64:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:763
	            				main/pkg/cmd/roachtest/test_runner.go:1279
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	full command output in run_120703.119972637_n1_cockroach-workload-i.log: COMMAND_PROBLEM: exit status 1
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | main.(*clusterImpl).RunE
	            	  | 	main/pkg/cmd/roachtest/cluster.go:2493
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.kvWorkload.initWorkload
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:1007
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.variations.runTest
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:763
	            	  | main.(*testRunner).runTest.func2
	            	  | 	main/pkg/cmd/roachtest/test_runner.go:1279
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1695
	            	Wraps: (2) full command output in run_120703.119972637_n1_cockroach-workload-i.log
	            	Wraps: (3) Node 1. Command with error:
	            	  | ``````
	            	  | ./cockroach workload init kv --db target --splits 10000 {pgurl:1}
	            	  | ``````
	            	  | <truncated> ... ":0,"next":90,"state":"StateSnapshot"}},"leadtransferee":"0"}: have been waiting 61.00s for slow proposal RequestLease [/Table/109/1/771919047579977796]
	            	  | Error: executing ALTER TABLE kv SPLIT AT VALUES (775608027496728028): pq: replica unavailable: (n4,s4):2 unable to serve request to r5961:/Table/109/1/77{3763537538352912-7452517455103144} [(n5,s5):1VOTER_DEMOTING_LEARNER, (n4,s4):2, (n2,s2):3, (n1,s1):4VOTER_INCOMING, next=5, gen=78, sticky=9223372036.854775807,2147483647]: closed timestamp: 1728648776.836585107,0 (2024-10-11 12:12:56); raft status: {"id":"2","term":10,"vote":"2","commit":17,"lead":"2","raftState":"StateLeader","applied":17,"progress":{"1":{"match":50,"next":53,"state":"StateReplicate"},"2":{"match":50,"next":53,"state":"StateReplicate"},"3":{"match":0,"next":18,"state":"StateSnapshot"},"4":{"match":0,"next":44,"state":"StateProbe"}},"leadtransferee":"0"}: have been waiting 60.50s for slow proposal ConditionalPut [/Local/Range/Table/109/1/773763537538352912/RangeDescriptor], [txn: 47171405]
	            	Wraps: (4) COMMAND_PROBLEM
	            	Wraps: (5) exit status 1
	            	Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *hintdetail.withDetail (4) errors.Cmd (5) *exec.ExitError
	Test:       	perturbation/metamorphic/backfill
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@nicktrav nicktrav moved this from Incoming to Tests (failures, skipped, flakes) in [Deprecated] Storage Oct 15, 2024
@nicktrav nicktrav removed the T-storage Storage Team label Oct 15, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 5be5b0b52ff79b98689b2282a8b25cf9eb50ec40:

(cluster.go:2449).Run: full command output in run_123302.743302309_n13_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 42f40f59cae3c0fd8842e194d6991c951ab4382f:

(cluster.go:2449).Run: full command output in run_123154.776023523_n31-32_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 833dadd212fa4b12b1442ae8e00e85ee80a8cdce:

(admission_control_latency.go:970).waitForIOOverloadToEnd: read tcp 172.17.0.3:44834 -> 34.74.166.13:26257: read: connection reset by peer
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Oct 21, 2024
133115: roachtest: change to use standard memory configuration r=arulajmani a=andrewbaptist

Previously the perturbation/* roachtests were configured with low memory configurations. This resulted in OOMs for backfill tests. This change makes the memory configuration a metamorphic parameter but excludes low memory configurations. The perturbation/full tests are run with standard memory.

Informs: #133114
Fixes: #133086
Fixes: #131713
Epic: none

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
@craig craig bot closed this as completed in 217bd6c Oct 21, 2024
Copy link

blathers-crl bot commented Oct 28, 2024

Based on the specified backports for linked PR #133115, I applied the following new label(s) to this issue: branch-release-24.3. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 label Oct 28, 2024
blathers-crl bot pushed a commit that referenced this issue Oct 28, 2024
Previously the perturbation/* roachtests were configured with low memory
configurations. This resulted in OOMs for backfill tests. This change
makes the memory configuration a metamorphic parameter but excludes low
memory configurations. The perturbation/full tests are run with standard
memory.

Informs: #133114
Fixes: #133086
Fixes: #131713
Epic: none

Release note: None
@andrewbaptist andrewbaptist added the O-perturbation Bugs found by the perturbation framework label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-perturbation Bugs found by the perturbation framework O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants