-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nightlies: Jepsen tests intermittently failing #14171
Comments
The stack trace here is not a symptom of the test failing. There are two things in the TC short log which point in different directions:
I can have a brief look at (1) however for (2) I would suggest changing the test runner to upload the artifacts to S3 instead of TC, and merely provide TC with a link to the S3 directory where the artefacts are. |
So the log file also says there was a connection error when retrieving the final bank account states. This is less expected because by that time all the nodes should have been restarted. Investigating further now. |
@jordanlewis is it always the same test that fails? |
I'm looking at this and I'm having trouble interpreting the logs. Where exactly do the failures show up? In the run from the last comment, 11 of the 19 configurations failed, with no error messages that I can find in the logs (and they run quickly - just a few seconds from "setup complete" to "tearing down nemesis"). It seems significant that the failing configurations are lexicographically consecutive: all runs of |
Nope, that's a red herring. We don't run the tests in lexicographic order so the |
I have a hard time making sense of these logs, but I've got some on my
Full logs for a failing test are here: Someone from the core team should look at them. |
Actually, one other node in the aforementioned test had this warning:
|
What's the actual failure in these logs? I don't see anything that looks like an error message in |
I don't know. This is the first time I'm looking at these logs. @knz?
…On Mon, May 1, 2017 at 5:31 PM Alex Robinson ***@***.***> wrote:
Full logs for a failing test are here:
failure-logs.zip
What's the actual failure in these logs? I don't see anything that looks
like an error message in jepsen.log.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#14171 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABffpm7MYuMqhyI7_Gt-Cmsa9GyjY2u8ks5r1k8ugaJpZM4MeZ3c>
.
|
Ping - can anyone who's looked at our jepsen logs before tell what's actually failing? It'd be really nice to have these tests working on a release candidate. |
I've looked at jepsen logs before, but only when running the tests by hand, not as run in teamcity. In these instances I haven't found any errors I can recognize. (and I don't remember the details of running the tests well enough to know where else to look) |
I'll take another look tomorrow. Note that even running the tests manually produces inscrutable results. See the Google Drive link I posted above for a full archive of the artifacts of a run I did by hand. |
Err, that link might be to a TC run, but it was no worse than the manual run IIRC. |
There are two problems here.
(jepsen shuts down the test immediately after initialization) Here we could extend the jepsen code to be a bit more verbose about what went wrong. The reason for this failure is to be found in the node log files: the CREATE statement fails to complete (the server is killed forcefully before the CREATE finished processing), presumably because of the lease holder problem also reported earlier in the log files. The root cause is that jepsen is simply too forceful at killing the servers. Now that |
Fixed by #16874 |
Well, I've done another run manually and it failed... But it looks to me like it failed differently than before. I don't understand much from the log at the moment but... looking... |
Yeah, I don't think I've seen that one before:
|
The nightly Jepsen tests have been intermittently failing. It's been hard to diagnose because they were generating logs that were larger than TeamCity's default maximum artifact size, which I've since upped.
Here's an example failing run: https://teamcity.cockroachdb.com/viewLog.html?buildId=179909&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_Jepsen
The problem seems to be that during some tests the Jepsen controller fails to contact one or more of the Cockroach nodes after a short period of successful testing. This condition continues for a while until the test throws a NullPointerException and quits. In
jepsen.log
, which is the controller log, the affected nodes throw errors like this:Investigating the stderr log of the unhappy nodes, I see that they get restarted a bunch. I'm not sure why that's happening, but it does coincide with the periods of time in the jepsen log that the test controller can't connect with the nodes.
There are also a few error messages like the following:
I've uploaded the full logs for this test (13M compressed) to Google Drive for further forensics: https://drive.google.com/a/cockroachlabs.com/file/d/0BxxYvfwgwim6UDYxc0UydGhYcTg/view?usp=sharing
cc @knz
The text was updated successfully, but these errors were encountered: