Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jepsen: Bank test failed #17815

Closed
bdarnell opened this issue Aug 22, 2017 · 10 comments
Closed

jepsen: Bank test failed #17815

bdarnell opened this issue Aug 22, 2017 · 10 comments
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

The jepsen bank test run on Aug 19 failed with a bunch of wrong-total errors:

INFO [2017-08-19 06:23:31,518] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :timeline {:valid? true},
 :details
 {:valid? false,
  :bad-reads
  [{:type :wrong-total,
    :expected 50,
    :found 48,
    :op
    {:type :ok,
     :f :read,
     :process 346,
     :time 119962611486,
     :value [1 7 4 20 16]}}
   {:type :wrong-total,
    :expected 50,
    :found 48,
    :op
    {:type :ok,
     :f :read,
     :process 506,
     :time 119962752422,
     :value [1 7 4 20 16]}}
   {:type :wrong-total,
    :expected 50,
    :found 48,
    :op
    {:type :ok,
     :f :read,
     :process 261,
     :time 119962852232,
     :value [1 7 4 20 16]}}

(The errors continue; looks like there was one error that changed the total from 50 to 48 and then every subsequent read reported the error again)

The bank test has not had a recent history of failures like this, although it has a history of flakiness for other reasons so it's hard to tell when this rare failure may have been introduced.

@bdarnell bdarnell added this to the 1.1 milestone Aug 22, 2017
@bdarnell bdarnell self-assigned this Aug 22, 2017
@bdarnell
Copy link
Contributor Author

jepsen-bank-logs.zip

@nvanbenschoten
Copy link
Member

Are you able to reproduce this easily? If so, could you try with a patch to beginCmds?

@bdarnell
Copy link
Contributor Author

Yes. It's not quick (it takes several hours at least), but it's easy to try.

@nvanbenschoten
Copy link
Member

Could you kick off a test using my branch from #17939?

@bdarnell
Copy link
Contributor Author

I'm running a test with #17939 now (I had tried to run it this morning but forgot that the jepsen test scripts don't automatically rebuild the binary so I was testing the wrong build). We'll know tonight whether this fixes it (or at least have some supporting evidence. It appears to reproduce around every 3 hours but I don't have many data points to support that estimate).

We see a lot of "context canceled while in command queue" messages in the logs (often affecting PushTxn), so I think it's likely to be the same issue.

35.196.90.5/cockroach.stderr:W170828 18:06:09.389224 82983 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: PushTxn [/Table/51/1/0/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:09.389368 82845 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: PushTxn [/Table/51/1/0/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:09.389490 83010 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: ResolveIntent [/Table/51/1/2/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:09.389506 82925 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: PushTxn [/Table/51/1/0/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:11.226157 83578 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: QueryTxn [/Table/51/1/2/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:11.482199 83673 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: QueryTxn [/Table/51/1/2/0,/Min)

@bdarnell
Copy link
Contributor Author

With #17939, it's passed six hours of test runs (before, it hadn't passed more than three hours in a row). I'll keep running it but so far it's looking good.

@nvanbenschoten
Copy link
Member

@bdarnell that's a very good sign. Did this run overnight?

@bdarnell
Copy link
Contributor Author

No, I've just been doing a series of manual 1-hour runs. Longer runs can produce tons of logs and require more analysis time at the end of the run.

@nvanbenschoten
Copy link
Member

Ok, I just merged #17939 so if we see this issue again we'll know that the CommandQueue fix didn't fully resolve this.

@bdarnell
Copy link
Contributor Author

OK, after a few more passing test runs I'm going to call this fixed by #17939.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 29, 2017
The failure in cockroachdb#17815 took several hours of testing to show up. We run
30 different configurations (combinations of test and nemesis) so we
don't want to run everything for too long every night, but we can do
more than the current 3m per configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants