You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WARN [2017-05-06 02:50:02,414] jepsen nemesis - jepsen.core Nemesis crashed evaluating {:type :info, :f [startstop2 :stop], :process :nemesis, :time 15016483887}
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -c "cd /; killall -s CONT cockroach" returned non-zero exit status 1 on 35.190.137.205. STDOUT:
STDERR:
cockroach: no process found
at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_121]
at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_121]
at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na]
at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na]
at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na]
at clojure.core$map$fn__4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na]
at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na]
at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na]
at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na]
at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na]
at clojure.core$map$fn__4789.invoke(core.clj:2648) ~[clojure-1.8.0.jar:na]
at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na]
at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na]
at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na]
at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na]
at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na]
at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na]
at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na]
at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na]
at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na]
at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na]
at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na]
at jepsen.nemesis$node_start_stopper$reify__2090.invoke_BANG_(nemesis.clj:219) ~[classes/:na]
at jepsen.nemesis$compose$reify__2073.invoke_BANG_(nemesis.clj:161) ~[classes/:na]
at jepsen.core$nemesis_worker$fn__1210$fn__1215.invoke(core.clj:231) ~[classes/:na]
at jepsen.core$nemesis_worker$fn__1210.invoke(core.clj:229) [classes/:na]
at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na]
at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.lang.RuntimeException: sudo -S -u root bash -c "cd /; killall -s CONT cockroach" returned non-zero exit status 1 on 35.190.137.205. STDOUT:
If the server crashes then sending SIGCONT is not sufficient, the node must be restarted.
The text was updated successfully, but these errors were encountered:
I'm not sure what might have been happening here - the process must have existed and received a SIGSTOP. Only a SIGKILL could remove the process while it was stopped. Maybe the VM was running out of memory and the OOM killer chose the suspended cockroach process?
In any event, this nemesis has not been flaky in practice. I haven't seen this error recur since this issue was filed. The start-stop-2 nemesis has been passing ever since we got the nightly jepsen tests running reliably.
From https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_Jepsen/243251:id/Sequential_start-stop-2/failure-logs.tbz
Detected using the new scripts from #15717:
If the server crashes then sending SIGCONT is not sufficient, the node must be restarted.
The text was updated successfully, but these errors were encountered: