nightlies: Jepsen tests intermittently failing #14171

jordanlewis · 2017-03-15T19:16:07Z

The nightly Jepsen tests have been intermittently failing. It's been hard to diagnose because they were generating logs that were larger than TeamCity's default maximum artifact size, which I've since upped.

Here's an example failing run: https://teamcity.cockroachdb.com/viewLog.html?buildId=179909&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_Jepsen

The problem seems to be that during some tests the Jepsen controller fails to contact one or more of the Cockroach nodes after a short period of successful testing. This condition continues for a while until the test throws a NullPointerException and quits. In jepsen.log, which is the controller log, the affected nodes throw errors like this:

2017-03-14 06:22:11,427{GMT}    WARN    [jepsen worker 2] jepsen.cockroach.bank: Encountered error with conn [:cockroach "35.185.88.6"]; reopening
2017-03-14 06:22:11,437{GMT}    WARN    [jepsen worker 2] jepsen.cockroach.bank: Error reopening [:cockroach "35.185.88.6"]
org.postgresql.util.PSQLException: Connection to 35.185.88.6:26257 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.Driver$ConnectThread.getResult(Driver.java:358) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver.connect(Driver.java:282) ~[postgresql-9.4.1211.jar:9.4.1211]
        at java.sql.DriverManager.getConnection(DriverManager.java:664) ~[na:1.8.0_121]
        at java.sql.DriverManager.getConnection(DriverManager.java:208) ~[na:1.8.0_121]
        at clojure.java.jdbc$get_connection.invokeStatic(jdbc.clj:272) ~[classes/:na]
        at clojure.java.jdbc$get_connection.invoke(jdbc.clj:193) ~[classes/:na]
        at jepsen.cockroach.client$client$open__2668$fn__2669.invoke(client.clj:90) ~[classes/:na]
        at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na]
        at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[na:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[na:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[na:1.8.0_121]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.8.0_121]
        at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_121]
        at org.postgresql.core.PGStream.<init>(PGStream.java:64) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:147) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:55) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:219) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver.makeConnection(Driver.java:407) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver.access$100(Driver.java:55) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver$ConnectThread.run(Driver.java:317) ~[postgresql-9.4.1211.jar:9.4.1211]
        ... 1 common frames omitted

Investigating the stderr log of the unhappy nodes, I see that they get restarted a bunch. I'm not sure why that's happening, but it does coincide with the periods of time in the jepsen log that the test controller can't connect with the nodes.

There are also a few error messages like the following:

E170314 06:23:35.273281 7527 storage/replica_range_lease.go:194  [n3,s2,r2/3:/Table/{0-11},@c4201e4000] heartbeat failed on epoch increment
E170314 06:23:35.275851 7954 storage/replica_range_lease.go:199  [n3,s2,r2/3:/Table/{0-11},@c4201e4000] mismatch incrementing epoch for node_id:1 epoch:6 expiration:<wall_time:1489472612725375001 logical:0 > ; actual is {NodeID:1 Epoch:6 Expiration:1489472623.490574575,0 Draining:false}

I've uploaded the full logs for this test (13M compressed) to Google Drive for further forensics: https://drive.google.com/a/cockroachlabs.com/file/d/0BxxYvfwgwim6UDYxc0UydGhYcTg/view?usp=sharing

cc @knz

The text was updated successfully, but these errors were encountered:

knz · 2017-03-16T15:13:42Z

The stack trace here is not a symptom of the test failing.
The nemesis purposefully kills random nodes in the test cluster, then later respawns them, to see how the rest of the cluster reacts. During that time clients cannot connect to the nodes that are down, as expected.

There are two things in the TC short log which point in different directions:

the test has indicated it has failed, which needs to be investigated;
the artifact upload has failed, which ultimately would cause TC to report the test as failed too.

I can have a brief look at (1) however for (2) I would suggest changing the test runner to upload the artifacts to S3 instead of TC, and merely provide TC with a link to the S3 directory where the artefacts are.

knz · 2017-03-16T15:17:37Z

So the log file also says there was a connection error when retrieving the final bank account states. This is less expected because by that time all the nodes should have been restarted. Investigating further now.

knz · 2017-03-16T15:24:59Z

@jordanlewis is it always the same test that fails?

jordanlewis · 2017-03-17T17:18:34Z

After fixing the archive issues via #14213 and #14229, the following run demonstrates that it's not always the same test that fails, and the failing tests don't always include the same nemesis that we suspected was causing the problem, startkill2.

https://teamcity.cockroachdb.com/viewLog.html?buildId=185141&buildTypeId=Cockroach_Nightlies_Jepsen&tab=buildLog

The artifacts are available for forensics.

bdarnell · 2017-04-24T21:02:32Z

I'm looking at this and I'm having trouble interpreting the logs. Where exactly do the failures show up?

In the run from the last comment, 11 of the 19 configurations failed, with no error messages that I can find in the logs (and they run quickly - just a few seconds from "setup complete" to "tearing down nemesis"). It seems significant that the failing configurations are lexicographically consecutive: all runs of register and sequential (5 configurations each) and the first configuration of sets. It looks like theres some external factor messing up the machines or the network during this period of time.

bdarnell · 2017-04-24T21:16:12Z

It seems significant that the failing configurations are lexicographically consecutive

Nope, that's a red herring. We don't run the tests in lexicographic order so the monotonic tests are running (and passing) in between register and sets..

cuongdo · 2017-05-01T21:22:19Z

I have a hard time making sense of these logs, but I've got some on my azworker. I looked at one failure, and here are the suspicious logs I see after the restart of a node (the only one with incriminating logs for that test):

E170501 19:44:03.376992 134 storage/queue.go:634  [raftsnapshot,n2,s2,r21/3:/{Table/50-Max}] snapshot failed: (n3,s3):2: remote couldn't accept snapshot with error: [n3,s3],r21: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r10/2:/{Table/15-Max}]
I170501 19:44:04.376396 134 storage/replica_raftstorage.go:415  [raftsnapshot,n2,s2,r21/3:/{Table/50-Max}] generated Raft snapshot ce822a8f at index 13
E170501 19:44:04.377277 134 storage/queue.go:634  [raftsnapshot,n2,s2,r21/3:/{Table/50-Max}] snapshot failed: (n3,s3):2: remote couldn't accept snapshot with error: [n3,s3],r21: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r10/2:/{Table/15-Max}]
W170501 19:44:04.648977 351 storage/replica.go:1730  [n2,s2,r10/3:/Table/{15-50}] context deadline exceeded while in command queue: PushTxn [/Local/Range/"\x97"/RangeDescriptor,/Min)
W170501 19:44:04.650485 432 storage/replica.go:1730  [n2,s2,r10/3:/Table/{15-50}] context deadline exceeded while in command queue: PushTxn [/Local/Range/"\x97"/RangeDescriptor,/Min)
W170501 19:44:04.765874 384 storage/intent_resolver.go:313  [n2,s2,r1/2:/{Min-System/}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: [NotLeaseHolderError] range 10: replica (n2,s2):3 not lease holder; lease holder unknown
I170501 19:44:05.237539 130 storage/split_queue.go:92  [split,n2,s2,r21/3:/{Table/50-Max}] splitting at key /Table/51/0
I170501 19:44:05.266566 130 storage/replica_command.go:2630  [split,n2,s2,r21/3:/{Table/50-Max}] initiating a split of this range at key /Table/51 [r31]
I170501 19:44:05.376124 134 storage/replica_raftstorage.go:415  [raftsnapshot,n2,s2,r21/3:/Table/5{0-1}] generated Raft snapshot 625b3c6c at index 17
I170501 19:44:05.376993 134 storage/store.go:3341  [raftsnapshot,n2,s2,r21/3:/Table/5{0-1}] streamed snapshot to (n3,s3):2: kv pairs: 13, log entries: 7, rate-limit: 8.0 MiB/sec, 1ms

Full logs for a failing test are here:

failure-logs.zip

Someone from the core team should look at them.

cuongdo · 2017-05-01T21:24:22Z

Actually, one other node in the aforementioned test had this warning:

W170501 19:44:04.650896 503 storage/intent_resolver.go:313  [n1,s1,r1/1:/{Min-System/}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

a-robinson · 2017-05-01T21:31:20Z

Full logs for a failing test are here:

failure-logs.zip

What's the actual failure in these logs? I don't see anything that looks like an error message in jepsen.log.

cuongdo · 2017-05-01T21:48:15Z

I don't know. This is the first time I'm looking at these logs. @knz?

…

On Mon, May 1, 2017 at 5:31 PM Alex Robinson ***@***.***> wrote: Full logs for a failing test are here: failure-logs.zip What's the actual failure in these logs? I don't see anything that looks like an error message in jepsen.log. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#14171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABffpm7MYuMqhyI7_Gt-Cmsa9GyjY2u8ks5r1k8ugaJpZM4MeZ3c> .

a-robinson · 2017-05-02T18:24:23Z

Ping - can anyone who's looked at our jepsen logs before tell what's actually failing? It'd be really nice to have these tests working on a release candidate.

@cuongdo @knz @bdarnell @jordanlewis

bdarnell · 2017-05-02T18:32:17Z

I've looked at jepsen logs before, but only when running the tests by hand, not as run in teamcity. In these instances I haven't found any errors I can recognize. (and I don't remember the details of running the tests well enough to know where else to look)

jordanlewis · 2017-05-02T22:00:31Z

I'll take another look tomorrow. Note that even running the tests manually produces inscrutable results. See the Google Drive link I posted above for a full archive of the artifacts of a run I did by hand.

jordanlewis · 2017-05-02T22:01:45Z

Err, that link might be to a TC run, but it was no worse than the manual run IIRC.

knz · 2017-05-03T13:52:29Z

There are two problems here.

apparently TC does not gather jepsen.log in the artifacts, only the sub-dirs containing the specific nodes' data. We need to scrutinize the test runner. Jordan perhaps we can sit together and do that.
in the specific archive you link in your comment @cuongdo the lack of error indicates jepsen was unable to initialize the test properly, you can see this in jepsen.log with:

2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.cockroach.auto: 35.185.67.241 Cockroach started
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.cockroach.auto: 104.196.104.156 Cockroach started
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.cockroach.auto: 35.185.107.220 Cockroach started
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.cockroach.auto: 104.196.216.177 Cockroach started
2017-05-01 19:44:03,078{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.cockroach: 104.196.104.156 Restarted to work around balancing bug
2017-05-01 19:44:03,078{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.cockroach: 35.185.67.241 Restarted to work around balancing bug
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.cockroach: 35.185.107.220 Restarted to work around balancing bug
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.cockroach: 104.196.104.156 Setup complete
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.cockroach: 104.196.216.177 Restarted to work around balancing bug
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.cockroach: 35.185.67.241 Setup complete
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.cockroach: 35.185.107.220 Setup complete
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.cockroach: 104.196.216.177 Setup complete
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-4] jepsen.cockroach.register: 35.185.85.207 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-2] jepsen.cockroach.register: 35.185.107.220 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-0] jepsen.cockroach.register: 35.185.67.241 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-1] jepsen.cockroach.register: 104.196.216.177 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-3] jepsen.cockroach.register: 104.196.104.156 Connected
2017-05-01 19:44:05,165{GMT}    INFO    [clojure-agent-send-off-pool-4] jepsen.cockroach.register: 35.185.85.207 Creating table
2017-05-01 19:44:05,448{GMT}    INFO    [jepsen nemesis] jepsen.core: Nemesis starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 0] jepsen.core: Worker 0 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 1] jepsen.core: Worker 1 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 2] jepsen.core: Worker 2 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 3] jepsen.core: Worker 3 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 4] jepsen.core: Worker 4 starting
2017-05-01 19:44:05,453{GMT}    INFO    [jepsen test runner] jepsen.core: Tearing down nemesis
2017-05-01 19:44:05,656{GMT}    INFO    [jepsen test runner] jepsen.core: Nemesis torn down
2017-05-01 19:44:05,659{GMT}    INFO    [jepsen test runner] jepsen.core: Snarfing log files
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 35.185.85.207] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr

(jepsen shuts down the test immediately after initialization)

Here we could extend the jepsen code to be a bit more verbose about what went wrong.

The reason for this failure is to be found in the node log files: the CREATE statement fails to complete (the server is killed forcefully before the CREATE finished processing), presumably because of the lease holder problem also reported earlier in the log files.

The root cause is that jepsen is simply too forceful at killing the servers. Now that cockroach quit works better we should teach the cockroachdb code in jepsen to use that. Again I can probably help with this, but so could anyone with minimal clojure exposure.

bdarnell · 2017-07-25T15:39:18Z

Fixed by #16874

andreimatei · 2017-07-25T17:30:06Z

Well, I've done another run manually and it failed... But it looks to me like it failed differently than before. I don't understand much from the log at the moment but... looking...

https://teamcity.cockroachdb.com/viewLog.html?buildId=304925&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_Jepsen#testNameId2988281092275661567

bdarnell · 2017-07-25T17:34:11Z

Yeah, I don't think I've seen that one before:

ERROR [2017-07-25 16:01:32,197] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
org.postgresql.util.PSQLException: This connection has been closed.

bdarnell added this to the 1.0 milestone Apr 11, 2017

bdarnell self-assigned this Apr 11, 2017

petermattis assigned cuongdo May 1, 2017

This was referenced May 5, 2017

build: refactor the Jepsen test scripts #15717

Merged

jepsen: make the Jepsen tests first-class citizens #15738

Closed

petermattis modified the milestones: 1.1, 1.0 Jun 1, 2017

bdarnell closed this as completed Jul 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nightlies: Jepsen tests intermittently failing #14171

nightlies: Jepsen tests intermittently failing #14171

jordanlewis commented Mar 15, 2017

knz commented Mar 16, 2017

knz commented Mar 16, 2017

knz commented Mar 16, 2017

jordanlewis commented Mar 17, 2017

bdarnell commented Apr 24, 2017

bdarnell commented Apr 24, 2017

cuongdo commented May 1, 2017

cuongdo commented May 1, 2017

a-robinson commented May 1, 2017

cuongdo commented May 1, 2017 via email

a-robinson commented May 2, 2017

bdarnell commented May 2, 2017

jordanlewis commented May 2, 2017

jordanlewis commented May 2, 2017

knz commented May 3, 2017

bdarnell commented Jul 25, 2017

andreimatei commented Jul 25, 2017

bdarnell commented Jul 25, 2017

nightlies: Jepsen tests intermittently failing #14171

nightlies: Jepsen tests intermittently failing #14171

Comments

jordanlewis commented Mar 15, 2017

knz commented Mar 16, 2017

knz commented Mar 16, 2017

knz commented Mar 16, 2017

jordanlewis commented Mar 17, 2017

bdarnell commented Apr 24, 2017

bdarnell commented Apr 24, 2017

cuongdo commented May 1, 2017

cuongdo commented May 1, 2017

a-robinson commented May 1, 2017

cuongdo commented May 1, 2017 via email

a-robinson commented May 2, 2017

bdarnell commented May 2, 2017

jordanlewis commented May 2, 2017

jordanlewis commented May 2, 2017

knz commented May 3, 2017

bdarnell commented Jul 25, 2017

andreimatei commented Jul 25, 2017

bdarnell commented Jul 25, 2017