Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

Closed
cockroach-teamcity opened this issue Feb 14, 2023 · 6 comments
Closed

roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100

cockroach-teamcity opened this issue Feb 14, 2023 · 6 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 14, 2023

roachtest.kv0/enc=false/nodes=3/cpu=32/seq failed with artifacts on master @ 31365e21dc606cdc1e4302c86192ffc5a6cf1255:

test artifacts and logs in: /artifacts/kv0/enc=false/nodes=3/cpu=32/seq/run_1
(cluster.go:1937).Run: output in run_090905.266115320_n4_workload-run-kv-tole: ./workload run kv --tolerate-errors --init --histograms=perf/stats.json --concurrency=192 --duration=30m0s --read-percent=0 --sequential {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-24509

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 14, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 14, 2023
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Feb 14, 2023
renatolabs added a commit to renatolabs/cockroach that referenced this issue Feb 14, 2023
The roachtest logic to fetch the debug zip after a failure iterates
over each node in the cluster spec for that test and attempts to run
the `debug zip` command, returning once it succeeds on any node. The
idea is that at that stage in the test, there's no way to know which
node is alive so every node is attempted.

However, the actual `Run` command executed used to use `c.All()`,
meaning that, in practice, it would fail if _any_ node failed to run
the `debug zip` command. One common scenario where this bug would come
to the surface is during tests that don't upload the `cockroach`
binary to every node, maybe because one of the nodes is there to run a
`./workload` command exclusively. In those cases, we would fail to
fetch the `debug.zip` just because the workload node didn't have the
cockroach binary (see cockroachdb#97100, for example).

This commit fixes the bug/typo by running the `./cockroach debug zip`
command on each node individually.

Epic: none

Release note: None
@renatolabs
Copy link
Contributor

For some reason, dmesg is not indicating an OOM kill on node 3, but that's very likely what happened: we got exit status 137 and Prometheus records node_vmstat_oom_kill. Unfortunately, that's all the data we have on Grafana, all the other graphs seem to be empty.

craig bot pushed a commit that referenced this issue Feb 15, 2023
96261: cli: loss of quorum recovery int-test for half online mode r=erikgrinaker a=aliher1911

This commit adds integration test for loss of quorum recovery in
half online mode. Its functionality largely mirrors behaviour
of offline quorum recovery test.

Release note: None

Fixes #93052

97118: roachtest: fix fetching of debug zip and add dmesg flag r=herkolategan a=renatolabs

**roachtest: fix fetching of debug zip**
The roachtest logic to fetch the debug zip after a failure iterates
over each node in the cluster spec for that test and attempts to run
the `debug zip` command, returning once it succeeds on any node. The
idea is that at that stage in the test, there's no way to know which
node is alive so every node is attempted.

However, the actual `Run` command executed used to use `c.All()`,
meaning that, in practice, it would fail if _any_ node failed to run
the `debug zip` command. One common scenario where this bug would come
to the surface is during tests that don't upload the `cockroach`
binary to every node, maybe because one of the nodes is there to run a
`./workload` command exclusively. In those cases, we would fail to
fetch the `debug.zip` just because the workload node didn't have the
cockroach binary (see #97100, for example).

This commit fixes the bug/typo by running the `./cockroach debug zip`
command on each node individually.

**roachtest: pass -T flag to dmesg**
By default, the timestamps displayed on the `dmesg` output are
relative to the kernel's boot time. This makes it harder to correlate
events in there with test events. This changes artifact collection to
run `dmesg -T` instead, which makes dmesg use human readable
timestamps.

97170: ccl/kvccl/kvfollowerreadsccl: skip TestSecondaryTenantFollowerReadsRouting r=arulajmani a=pavelkalinnikov

Refs: #95338

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Epic: None
Release note: None

Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
@renatolabs
Copy link
Contributor

@cockroachdb/kv this seems to be an OOM failure, mind taking a look to see if there's something actionable to be done here please?

@erikgrinaker erikgrinaker added the T-kv KV Team label Feb 27, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 28, 2023

I'm not sure about this. The node exited at 09:22:26:

cockroach exited with code 137: Tue Feb 14 09:22:26 UTC 2023

But the health log 1 second before claims it's only using 1.6 GB RSS:

I230214 09:22:25.279194 558 2@server/status/runtime_log.go:47 ⋮ [T1,n3] 183  runtime stats: 1.6 GiB RSS, 730 goroutines (stacks: 12 MiB), 309 MiB/657 MiB Go alloc/total(stale) (heap fragmentation: 114 MiB, heap reserved: 178 MiB, heap released: 7.4 MiB), 604 MiB/1.0 GiB CGO alloc/total (1317.6 CGO/sec), 575.0/59.2 %(u/s)time, 0.1 %gc (1755x), 40 MiB/33 MiB (r/w)net

This is also confirmed by graphs:

Screenshot 2023-02-28 at 17 53 45

The node clearly got killed by something though:

Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[20167]: ./cockroach.sh: line 68: 20174 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[21256]: cockroach exited with code 137: Tue Feb 14 09:22:26 UTC 2023
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'.

Out of time, will dig in further later. It's of course possible that we got a very sudden memory spike, but I'm not seeing any indications of it.

@renatolabs
Copy link
Contributor

I agree, I was also not 100% convinced this was an OOM. It's a weird failure mode; we also saw the exact same type of failure happen in another test [1] a couple of weeks ago (ultimately that one was (mis?) labeled as an SSH flake), but I haven't seen this come up again since then.

[1] #90695 (comment)

@erikgrinaker
Copy link
Contributor

Relabeling as GA-blocker. Unclear if there's a real problem here.

@erikgrinaker erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 1, 2023
@srosenberg
Copy link
Member

srosenberg commented Mar 2, 2023

This was yet another instance of the roachprod "cache incoherence" bug [1], coupled with reuse of public ips,

Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 sshd[21163]: Accepted publickey for ubuntu from x.x.x.x port 52643 ssh2: RSA SHA256:N70qo8MGrMWxHPJVWlTAeavfAqsL7sjHIpMIJc3qoTI
Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 sshd[21163]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd-logind[1737]: New session 84 of user ubuntu.
Feb 14 09:22:25 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: Started Session 84 of user ubuntu.
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[20167]: ./cockroach.sh: line 68: 20174 Killed                  "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log"
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 bash[21256]: cockroach exited with code 137: Tue Feb 14 09:22:26 UTC 2023
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a
Feb 14 09:22:26 teamcity-8695907-1676355379-27-n4cpu32-0003 systemd[1]: cockroach.service: Failed with result 'exit-code'

Note the elided ip on the first line; it corresponds to someone executing roachprod stop against their cluster. However, the locally cached public ips of their cluster were stale and matched the reused ip of n3, thus failing this test.

[1] #89437

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Mar 2, 2023
@srosenberg srosenberg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

4 participants