roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

cockroach-teamcity · 2023-02-14T09:23:33Z

roachtest.kv0/enc=false/nodes=3/cpu=32/seq failed with artifacts on master @ 31365e21dc606cdc1e4302c86192ffc5a6cf1255:

test artifacts and logs in: /artifacts/kv0/enc=false/nodes=3/cpu=32/seq/run_1
(cluster.go:1937).Run: output in run_090905.266115320_n4_workload-run-kv-tole: ./workload run kv --tolerate-errors --init --histograms=perf/stats.json --concurrency=192 --duration=30m0s --read-percent=0 --sequential {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-24509}

The text was updated successfully, but these errors were encountered:

The roachtest logic to fetch the debug zip after a failure iterates over each node in the cluster spec for that test and attempts to run the `debug zip` command, returning once it succeeds on any node. The idea is that at that stage in the test, there's no way to know which node is alive so every node is attempted. However, the actual `Run` command executed used to use `c.All()`, meaning that, in practice, it would fail if _any_ node failed to run the `debug zip` command. One common scenario where this bug would come to the surface is during tests that don't upload the `cockroach` binary to every node, maybe because one of the nodes is there to run a `./workload` command exclusively. In those cases, we would fail to fetch the `debug.zip` just because the workload node didn't have the cockroach binary (see cockroachdb#97100, for example). This commit fixes the bug/typo by running the `./cockroach debug zip` command on each node individually. Epic: none Release note: None

renatolabs · 2023-02-15T00:19:56Z

For some reason, dmesg is not indicating an OOM kill on node 3, but that's very likely what happened: we got exit status 137 and Prometheus records node_vmstat_oom_kill. Unfortunately, that's all the data we have on Grafana, all the other graphs seem to be empty.

96261: cli: loss of quorum recovery int-test for half online mode r=erikgrinaker a=aliher1911 This commit adds integration test for loss of quorum recovery in half online mode. Its functionality largely mirrors behaviour of offline quorum recovery test. Release note: None Fixes #93052 97118: roachtest: fix fetching of debug zip and add dmesg flag r=herkolategan a=renatolabs **roachtest: fix fetching of debug zip** The roachtest logic to fetch the debug zip after a failure iterates over each node in the cluster spec for that test and attempts to run the `debug zip` command, returning once it succeeds on any node. The idea is that at that stage in the test, there's no way to know which node is alive so every node is attempted. However, the actual `Run` command executed used to use `c.All()`, meaning that, in practice, it would fail if _any_ node failed to run the `debug zip` command. One common scenario where this bug would come to the surface is during tests that don't upload the `cockroach` binary to every node, maybe because one of the nodes is there to run a `./workload` command exclusively. In those cases, we would fail to fetch the `debug.zip` just because the workload node didn't have the cockroach binary (see #97100, for example). This commit fixes the bug/typo by running the `./cockroach debug zip` command on each node individually. **roachtest: pass -T flag to dmesg** By default, the timestamps displayed on the `dmesg` output are relative to the kernel's boot time. This makes it harder to correlate events in there with test events. This changes artifact collection to run `dmesg -T` instead, which makes dmesg use human readable timestamps. 97170: ccl/kvccl/kvfollowerreadsccl: skip TestSecondaryTenantFollowerReadsRouting r=arulajmani a=pavelkalinnikov Refs: #95338 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Epic: None Release note: None Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

renatolabs · 2023-02-27T15:57:53Z

@cockroachdb/kv this seems to be an OOM failure, mind taking a look to see if there's something actionable to be done here please?

erikgrinaker · 2023-02-28T17:59:29Z

I'm not sure about this. The node exited at 09:22:26:

cockroach exited with code 137: Tue Feb 14 09:22:26 UTC 2023

But the health log 1 second before claims it's only using 1.6 GB RSS:

I230214 09:22:25.279194 558 2@server/status/runtime_log.go:47 ⋮ [T1,n3] 183  runtime stats: 1.6 GiB RSS, 730 goroutines (stacks: 12 MiB), 309 MiB/657 MiB Go alloc/total(stale) (heap fragmentation: 114 MiB, heap reserved: 178 MiB, heap released: 7.4 MiB), 604 MiB/1.0 GiB CGO alloc/total (1317.6 CGO/sec), 575.0/59.2 %(u/s)time, 0.1 %gc (1755x), 40 MiB/33 MiB (r/w)net

This is also confirmed by graphs:

The node clearly got killed by something though:

Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[20167]: ./cockroach.sh: line 68: 20174 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[21256]: cockroach exited with code 137: Tue Feb 14 09:22:26 UTC 2023
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'.

Out of time, will dig in further later. It's of course possible that we got a very sudden memory spike, but I'm not seeing any indications of it.

renatolabs · 2023-02-28T18:16:14Z

I agree, I was also not 100% convinced this was an OOM. It's a weird failure mode; we also saw the exact same type of failure happen in another test [1] a couple of weeks ago (ultimately that one was (mis?) labeled as an SSH flake), but I haven't seen this come up again since then.

[1] #90695 (comment)

erikgrinaker · 2023-03-01T17:20:08Z

Relabeling as GA-blocker. Unclear if there's a real problem here.

srosenberg · 2023-03-02T03:35:10Z

This was yet another instance of the roachprod "cache incoherence" bug [1], coupled with reuse of public ips,

Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 sshd[21163]: Accepted publickey for ubuntu from x.x.x.x port 52643 ssh2: RSA SHA256:N70qo8MGrMWxHPJVWlTAeavfAqsL7sjHIpMIJc3qoTI
Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 sshd[21163]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd-logind[1737]: New session 84 of user ubuntu.
Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: Started Session 84 of user ubuntu.
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[20167]: ./cockroach.sh: line 68: 20174 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[21256]: cockroach exited with code 137: Tue Feb 14 09:22:26 UTC 2023
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'

Note the elided ip on the first line; it corresponds to someone executing roachprod stop against their cluster. However, the locally cached public ips of their cluster were stale and matched the reused ip of n3, thus failing this test.

[1] #89437

cockroach-teamcity added this to the 23.1 milestone Feb 14, 2023

blathers-crl bot added the T-testeng TestEng Team label Feb 14, 2023

renatolabs mentioned this issue Feb 14, 2023

roachtest: fix fetching of debug zip and add dmesg flag #97118

Merged

renatolabs mentioned this issue Feb 17, 2023

roachtest: ssh_problem failed #90695

Open

erikgrinaker added the T-kv KV Team label Feb 27, 2023

erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 1, 2023

srosenberg removed the GA-blocker label Mar 2, 2023

srosenberg closed this as completed Mar 2, 2023

exalate-issue-sync bot removed the T-kv KV Team label Mar 2, 2023

srosenberg mentioned this issue Mar 2, 2023

roachprod: validate host before running a command #89437

Closed

srosenberg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

cockroach-teamcity commented Feb 14, 2023 •

edited by cockroach-jira-scripts

Loading

renatolabs commented Feb 15, 2023

renatolabs commented Feb 27, 2023

erikgrinaker commented Feb 28, 2023 •

edited

Loading

renatolabs commented Feb 28, 2023

erikgrinaker commented Mar 1, 2023

srosenberg commented Mar 2, 2023 •

edited

Loading

roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

Comments

cockroach-teamcity commented Feb 14, 2023 • edited by cockroach-jira-scripts Loading

renatolabs commented Feb 15, 2023

renatolabs commented Feb 27, 2023

erikgrinaker commented Feb 28, 2023 • edited Loading

renatolabs commented Feb 28, 2023

erikgrinaker commented Mar 1, 2023

srosenberg commented Mar 2, 2023 • edited Loading

cockroach-teamcity commented Feb 14, 2023 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Feb 28, 2023 •

edited

Loading

srosenberg commented Mar 2, 2023 •

edited

Loading