-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv0/enc=false/nodes=3/cpu=32/seq failed #97100
Comments
The roachtest logic to fetch the debug zip after a failure iterates over each node in the cluster spec for that test and attempts to run the `debug zip` command, returning once it succeeds on any node. The idea is that at that stage in the test, there's no way to know which node is alive so every node is attempted. However, the actual `Run` command executed used to use `c.All()`, meaning that, in practice, it would fail if _any_ node failed to run the `debug zip` command. One common scenario where this bug would come to the surface is during tests that don't upload the `cockroach` binary to every node, maybe because one of the nodes is there to run a `./workload` command exclusively. In those cases, we would fail to fetch the `debug.zip` just because the workload node didn't have the cockroach binary (see cockroachdb#97100, for example). This commit fixes the bug/typo by running the `./cockroach debug zip` command on each node individually. Epic: none Release note: None
For some reason, |
96261: cli: loss of quorum recovery int-test for half online mode r=erikgrinaker a=aliher1911 This commit adds integration test for loss of quorum recovery in half online mode. Its functionality largely mirrors behaviour of offline quorum recovery test. Release note: None Fixes #93052 97118: roachtest: fix fetching of debug zip and add dmesg flag r=herkolategan a=renatolabs **roachtest: fix fetching of debug zip** The roachtest logic to fetch the debug zip after a failure iterates over each node in the cluster spec for that test and attempts to run the `debug zip` command, returning once it succeeds on any node. The idea is that at that stage in the test, there's no way to know which node is alive so every node is attempted. However, the actual `Run` command executed used to use `c.All()`, meaning that, in practice, it would fail if _any_ node failed to run the `debug zip` command. One common scenario where this bug would come to the surface is during tests that don't upload the `cockroach` binary to every node, maybe because one of the nodes is there to run a `./workload` command exclusively. In those cases, we would fail to fetch the `debug.zip` just because the workload node didn't have the cockroach binary (see #97100, for example). This commit fixes the bug/typo by running the `./cockroach debug zip` command on each node individually. **roachtest: pass -T flag to dmesg** By default, the timestamps displayed on the `dmesg` output are relative to the kernel's boot time. This makes it harder to correlate events in there with test events. This changes artifact collection to run `dmesg -T` instead, which makes dmesg use human readable timestamps. 97170: ccl/kvccl/kvfollowerreadsccl: skip TestSecondaryTenantFollowerReadsRouting r=arulajmani a=pavelkalinnikov Refs: #95338 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Epic: None Release note: None Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
@cockroachdb/kv this seems to be an OOM failure, mind taking a look to see if there's something actionable to be done here please? |
I'm not sure about this. The node exited at 09:22:26:
But the health log 1 second before claims it's only using 1.6 GB RSS:
This is also confirmed by graphs: The node clearly got killed by something though:
Out of time, will dig in further later. It's of course possible that we got a very sudden memory spike, but I'm not seeing any indications of it. |
I agree, I was also not 100% convinced this was an OOM. It's a weird failure mode; we also saw the exact same type of failure happen in another test [1] a couple of weeks ago (ultimately that one was (mis?) labeled as an SSH flake), but I haven't seen this come up again since then. [1] #90695 (comment) |
Relabeling as GA-blocker. Unclear if there's a real problem here. |
This was yet another instance of the roachprod "cache incoherence" bug [1], coupled with reuse of public ips,
Note the elided ip on the first line; it corresponds to someone executing [1] #89437 |
roachtest.kv0/enc=false/nodes=3/cpu=32/seq failed with artifacts on master @ 31365e21dc606cdc1e4302c86192ffc5a6cf1255:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=32
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-24509
The text was updated successfully, but these errors were encountered: