Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jepsen: the start-stop nemesis is flaky/incorrect #15736

Closed
knz opened this issue May 6, 2017 · 1 comment
Closed

jepsen: the start-stop nemesis is flaky/incorrect #15736

knz opened this issue May 6, 2017 · 1 comment
Assignees
Milestone

Comments

@knz
Copy link
Contributor

knz commented May 6, 2017

From https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_Jepsen/243251:id/Sequential_start-stop-2/failure-logs.tbz

Detected using the new scripts from #15717:

WARN [2017-05-06 02:50:02,414] jepsen nemesis - jepsen.core Nemesis crashed evaluating {:type :info, :f [startstop2 :stop], :process :nemesis, :time 15016483887}
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -c "cd /; killall -s CONT cockroach" returned non-zero exit status 1 on 35.190.137.205. STDOUT:


STDERR:
cockroach: no process found

        at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_121]
        at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_121]
        at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na]
        at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
        at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na]
        at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na]
        at clojure.core$map$fn__4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na]
        at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na]
        at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na]
        at clojure.core$map$fn__4789.invoke(core.clj:2648) ~[clojure-1.8.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na]
        at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na]
        at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na]
        at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na]
        at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na]
        at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na]
        at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na]
        at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na]
        at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na]
        at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na]
        at jepsen.nemesis$node_start_stopper$reify__2090.invoke_BANG_(nemesis.clj:219) ~[classes/:na]
        at jepsen.nemesis$compose$reify__2073.invoke_BANG_(nemesis.clj:161) ~[classes/:na]
        at jepsen.core$nemesis_worker$fn__1210$fn__1215.invoke(core.clj:231) ~[classes/:na]
        at jepsen.core$nemesis_worker$fn__1210.invoke(core.clj:229) [classes/:na]
        at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na]
        at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.lang.RuntimeException: sudo -S -u root bash -c "cd /; killall -s CONT cockroach" returned non-zero exit status 1 on 35.190.137.205. STDOUT:

If the server crashes then sending SIGCONT is not sufficient, the node must be restarted.

@petermattis petermattis modified the milestone: 1.1 Jun 1, 2017
@bdarnell bdarnell modified the milestones: 1.1, 1.2 Sep 20, 2017
@bdarnell
Copy link
Contributor

I'm not sure what might have been happening here - the process must have existed and received a SIGSTOP. Only a SIGKILL could remove the process while it was stopped. Maybe the VM was running out of memory and the OOM killer chose the suspended cockroach process?

In any event, this nemesis has not been flaky in practice. I haven't seen this error recur since this issue was filed. The start-stop-2 nemesis has been passing ever since we got the nightly jepsen tests running reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants