Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightlies: Jepsen tests intermittently failing #14171

Closed
jordanlewis opened this issue Mar 15, 2017 · 18 comments
Closed

nightlies: Jepsen tests intermittently failing #14171

jordanlewis opened this issue Mar 15, 2017 · 18 comments
Assignees
Milestone

Comments

@jordanlewis
Copy link
Member

The nightly Jepsen tests have been intermittently failing. It's been hard to diagnose because they were generating logs that were larger than TeamCity's default maximum artifact size, which I've since upped.

Here's an example failing run: https://teamcity.cockroachdb.com/viewLog.html?buildId=179909&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_Jepsen

The problem seems to be that during some tests the Jepsen controller fails to contact one or more of the Cockroach nodes after a short period of successful testing. This condition continues for a while until the test throws a NullPointerException and quits. In jepsen.log, which is the controller log, the affected nodes throw errors like this:

2017-03-14 06:22:11,427{GMT}    WARN    [jepsen worker 2] jepsen.cockroach.bank: Encountered error with conn [:cockroach "35.185.88.6"]; reopening
2017-03-14 06:22:11,437{GMT}    WARN    [jepsen worker 2] jepsen.cockroach.bank: Error reopening [:cockroach "35.185.88.6"]
org.postgresql.util.PSQLException: Connection to 35.185.88.6:26257 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.Driver$ConnectThread.getResult(Driver.java:358) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver.connect(Driver.java:282) ~[postgresql-9.4.1211.jar:9.4.1211]
        at java.sql.DriverManager.getConnection(DriverManager.java:664) ~[na:1.8.0_121]
        at java.sql.DriverManager.getConnection(DriverManager.java:208) ~[na:1.8.0_121]
        at clojure.java.jdbc$get_connection.invokeStatic(jdbc.clj:272) ~[classes/:na]
        at clojure.java.jdbc$get_connection.invoke(jdbc.clj:193) ~[classes/:na]
        at jepsen.cockroach.client$client$open__2668$fn__2669.invoke(client.clj:90) ~[classes/:na]
        at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na]
        at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[na:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[na:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[na:1.8.0_121]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.8.0_121]
        at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_121]
        at org.postgresql.core.PGStream.<init>(PGStream.java:64) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:147) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:55) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:219) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver.makeConnection(Driver.java:407) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver.access$100(Driver.java:55) ~[postgresql-9.4.1211.jar:9.4.1211]
        at org.postgresql.Driver$ConnectThread.run(Driver.java:317) ~[postgresql-9.4.1211.jar:9.4.1211]
        ... 1 common frames omitted

Investigating the stderr log of the unhappy nodes, I see that they get restarted a bunch. I'm not sure why that's happening, but it does coincide with the periods of time in the jepsen log that the test controller can't connect with the nodes.

There are also a few error messages like the following:

E170314 06:23:35.273281 7527 storage/replica_range_lease.go:194  [n3,s2,r2/3:/Table/{0-11},@c4201e4000] heartbeat failed on epoch increment
E170314 06:23:35.275851 7954 storage/replica_range_lease.go:199  [n3,s2,r2/3:/Table/{0-11},@c4201e4000] mismatch incrementing epoch for node_id:1 epoch:6 expiration:<wall_time:1489472612725375001 logical:0 > ; actual is {NodeID:1 Epoch:6 Expiration:1489472623.490574575,0 Draining:false}

I've uploaded the full logs for this test (13M compressed) to Google Drive for further forensics: https://drive.google.com/a/cockroachlabs.com/file/d/0BxxYvfwgwim6UDYxc0UydGhYcTg/view?usp=sharing

cc @knz

@knz
Copy link
Contributor

knz commented Mar 16, 2017

The stack trace here is not a symptom of the test failing.
The nemesis purposefully kills random nodes in the test cluster, then later respawns them, to see how the rest of the cluster reacts. During that time clients cannot connect to the nodes that are down, as expected.

There are two things in the TC short log which point in different directions:

  1. the test has indicated it has failed, which needs to be investigated;
  2. the artifact upload has failed, which ultimately would cause TC to report the test as failed too.

I can have a brief look at (1) however for (2) I would suggest changing the test runner to upload the artifacts to S3 instead of TC, and merely provide TC with a link to the S3 directory where the artefacts are.

@knz
Copy link
Contributor

knz commented Mar 16, 2017

So the log file also says there was a connection error when retrieving the final bank account states. This is less expected because by that time all the nodes should have been restarted. Investigating further now.

@knz
Copy link
Contributor

knz commented Mar 16, 2017

@jordanlewis is it always the same test that fails?

@jordanlewis
Copy link
Member Author

After fixing the archive issues via #14213 and #14229, the following run demonstrates that it's not always the same test that fails, and the failing tests don't always include the same nemesis that we suspected was causing the problem, startkill2.

https://teamcity.cockroachdb.com/viewLog.html?buildId=185141&buildTypeId=Cockroach_Nightlies_Jepsen&tab=buildLog

The artifacts are available for forensics.

@bdarnell bdarnell added this to the 1.0 milestone Apr 11, 2017
@bdarnell bdarnell self-assigned this Apr 11, 2017
@bdarnell
Copy link
Contributor

I'm looking at this and I'm having trouble interpreting the logs. Where exactly do the failures show up?

In the run from the last comment, 11 of the 19 configurations failed, with no error messages that I can find in the logs (and they run quickly - just a few seconds from "setup complete" to "tearing down nemesis"). It seems significant that the failing configurations are lexicographically consecutive: all runs of register and sequential (5 configurations each) and the first configuration of sets. It looks like theres some external factor messing up the machines or the network during this period of time.

@bdarnell
Copy link
Contributor

It seems significant that the failing configurations are lexicographically consecutive

Nope, that's a red herring. We don't run the tests in lexicographic order so the monotonic tests are running (and passing) in between register and sets..

@cuongdo
Copy link
Contributor

cuongdo commented May 1, 2017

I have a hard time making sense of these logs, but I've got some on my azworker. I looked at one failure, and here are the suspicious logs I see after the restart of a node (the only one with incriminating logs for that test):

E170501 19:44:03.376992 134 storage/queue.go:634  [raftsnapshot,n2,s2,r21/3:/{Table/50-Max}] snapshot failed: (n3,s3):2: remote couldn't accept snapshot with error: [n3,s3],r21: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r10/2:/{Table/15-Max}]
I170501 19:44:04.376396 134 storage/replica_raftstorage.go:415  [raftsnapshot,n2,s2,r21/3:/{Table/50-Max}] generated Raft snapshot ce822a8f at index 13
E170501 19:44:04.377277 134 storage/queue.go:634  [raftsnapshot,n2,s2,r21/3:/{Table/50-Max}] snapshot failed: (n3,s3):2: remote couldn't accept snapshot with error: [n3,s3],r21: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r10/2:/{Table/15-Max}]
W170501 19:44:04.648977 351 storage/replica.go:1730  [n2,s2,r10/3:/Table/{15-50}] context deadline exceeded while in command queue: PushTxn [/Local/Range/"\x97"/RangeDescriptor,/Min)
W170501 19:44:04.650485 432 storage/replica.go:1730  [n2,s2,r10/3:/Table/{15-50}] context deadline exceeded while in command queue: PushTxn [/Local/Range/"\x97"/RangeDescriptor,/Min)
W170501 19:44:04.765874 384 storage/intent_resolver.go:313  [n2,s2,r1/2:/{Min-System/}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: [NotLeaseHolderError] range 10: replica (n2,s2):3 not lease holder; lease holder unknown
I170501 19:44:05.237539 130 storage/split_queue.go:92  [split,n2,s2,r21/3:/{Table/50-Max}] splitting at key /Table/51/0
I170501 19:44:05.266566 130 storage/replica_command.go:2630  [split,n2,s2,r21/3:/{Table/50-Max}] initiating a split of this range at key /Table/51 [r31]
I170501 19:44:05.376124 134 storage/replica_raftstorage.go:415  [raftsnapshot,n2,s2,r21/3:/Table/5{0-1}] generated Raft snapshot 625b3c6c at index 17
I170501 19:44:05.376993 134 storage/store.go:3341  [raftsnapshot,n2,s2,r21/3:/Table/5{0-1}] streamed snapshot to (n3,s3):2: kv pairs: 13, log entries: 7, rate-limit: 8.0 MiB/sec, 1ms

Full logs for a failing test are here:

failure-logs.zip

Someone from the core team should look at them.

@cuongdo
Copy link
Contributor

cuongdo commented May 1, 2017

Actually, one other node in the aforementioned test had this warning:

W170501 19:44:04.650896 503 storage/intent_resolver.go:313  [n1,s1,r1/1:/{Min-System/}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

@a-robinson
Copy link
Contributor

Full logs for a failing test are here:

failure-logs.zip

What's the actual failure in these logs? I don't see anything that looks like an error message in jepsen.log.

@cuongdo
Copy link
Contributor

cuongdo commented May 1, 2017 via email

@a-robinson
Copy link
Contributor

Ping - can anyone who's looked at our jepsen logs before tell what's actually failing? It'd be really nice to have these tests working on a release candidate.

@cuongdo @knz @bdarnell @jordanlewis

@bdarnell
Copy link
Contributor

bdarnell commented May 2, 2017

I've looked at jepsen logs before, but only when running the tests by hand, not as run in teamcity. In these instances I haven't found any errors I can recognize. (and I don't remember the details of running the tests well enough to know where else to look)

@jordanlewis
Copy link
Member Author

I'll take another look tomorrow. Note that even running the tests manually produces inscrutable results. See the Google Drive link I posted above for a full archive of the artifacts of a run I did by hand.

@jordanlewis
Copy link
Member Author

Err, that link might be to a TC run, but it was no worse than the manual run IIRC.

@knz
Copy link
Contributor

knz commented May 3, 2017

There are two problems here.

  1. apparently TC does not gather jepsen.log in the artifacts, only the sub-dirs containing the specific nodes' data. We need to scrutinize the test runner. Jordan perhaps we can sit together and do that.
  2. in the specific archive you link in your comment @cuongdo the lack of error indicates jepsen was unable to initialize the test properly, you can see this in jepsen.log with:
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.cockroach.auto: 35.185.67.241 Cockroach started
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.cockroach.auto: 104.196.104.156 Cockroach started
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.cockroach.auto: 35.185.107.220 Cockroach started
2017-05-01 19:43:58,078{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.cockroach.auto: 104.196.216.177 Cockroach started
2017-05-01 19:44:03,078{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.cockroach: 104.196.104.156 Restarted to work around balancing bug
2017-05-01 19:44:03,078{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.cockroach: 35.185.67.241 Restarted to work around balancing bug
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.cockroach: 35.185.107.220 Restarted to work around balancing bug
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.cockroach: 104.196.104.156 Setup complete
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.cockroach: 104.196.216.177 Restarted to work around balancing bug
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.cockroach: 35.185.67.241 Setup complete
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.cockroach: 35.185.107.220 Setup complete
2017-05-01 19:44:03,079{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.cockroach: 104.196.216.177 Setup complete
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-4] jepsen.cockroach.register: 35.185.85.207 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-2] jepsen.cockroach.register: 35.185.107.220 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-0] jepsen.cockroach.register: 35.185.67.241 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-1] jepsen.cockroach.register: 104.196.216.177 Connected
2017-05-01 19:44:03,144{GMT}    INFO    [clojure-agent-send-off-pool-3] jepsen.cockroach.register: 104.196.104.156 Connected
2017-05-01 19:44:05,165{GMT}    INFO    [clojure-agent-send-off-pool-4] jepsen.cockroach.register: 35.185.85.207 Creating table
2017-05-01 19:44:05,448{GMT}    INFO    [jepsen nemesis] jepsen.core: Nemesis starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 0] jepsen.core: Worker 0 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 1] jepsen.core: Worker 1 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 2] jepsen.core: Worker 2 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 3] jepsen.core: Worker 3 starting
2017-05-01 19:44:05,450{GMT}    INFO    [jepsen worker 4] jepsen.core: Worker 4 starting
2017-05-01 19:44:05,453{GMT}    INFO    [jepsen test runner] jepsen.core: Tearing down nemesis
2017-05-01 19:44:05,656{GMT}    INFO    [jepsen test runner] jepsen.core: Nemesis torn down
2017-05-01 19:44:05,659{GMT}    INFO    [jepsen test runner] jepsen.core: Snarfing log files
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 35.185.67.241] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 35.185.107.220] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 35.185.85.207] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 104.196.104.156] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2017-05-01 19:44:05,662{GMT}    INFO    [jepsen node 104.196.216.177] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr

(jepsen shuts down the test immediately after initialization)

Here we could extend the jepsen code to be a bit more verbose about what went wrong.

The reason for this failure is to be found in the node log files: the CREATE statement fails to complete (the server is killed forcefully before the CREATE finished processing), presumably because of the lease holder problem also reported earlier in the log files.

The root cause is that jepsen is simply too forceful at killing the servers. Now that cockroach quit works better we should teach the cockroachdb code in jepsen to use that. Again I can probably help with this, but so could anyone with minimal clojure exposure.

@bdarnell
Copy link
Contributor

Fixed by #16874

@andreimatei
Copy link
Contributor

Well, I've done another run manually and it failed... But it looks to me like it failed differently than before. I don't understand much from the log at the moment but... looking...

https://teamcity.cockroachdb.com/viewLog.html?buildId=304925&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_Jepsen#testNameId2988281092275661567

@bdarnell
Copy link
Contributor

Yeah, I don't think I've seen that one before:

ERROR [2017-07-25 16:01:32,197] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
org.postgresql.util.PSQLException: This connection has been closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants