jepsen: Bank test failed #17815

bdarnell · 2017-08-22T16:41:17Z

The jepsen bank test run on Aug 19 failed with a bunch of wrong-total errors:

INFO [2017-08-19 06:23:31,518] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :timeline {:valid? true},
 :details
 {:valid? false,
  :bad-reads
  [{:type :wrong-total,
    :expected 50,
    :found 48,
    :op
    {:type :ok,
     :f :read,
     :process 346,
     :time 119962611486,
     :value [1 7 4 20 16]}}
   {:type :wrong-total,
    :expected 50,
    :found 48,
    :op
    {:type :ok,
     :f :read,
     :process 506,
     :time 119962752422,
     :value [1 7 4 20 16]}}
   {:type :wrong-total,
    :expected 50,
    :found 48,
    :op
    {:type :ok,
     :f :read,
     :process 261,
     :time 119962852232,
     :value [1 7 4 20 16]}}

(The errors continue; looks like there was one error that changed the total from 50 to 48 and then every subsequent read reported the error again)

The bank test has not had a recent history of failures like this, although it has a history of flakiness for other reasons so it's hard to tell when this rare failure may have been introduced.

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-08-22T17:00:34Z

jepsen-bank-logs.zip

nvanbenschoten · 2017-08-27T23:38:34Z

Are you able to reproduce this easily? If so, could you try with a patch to beginCmds?

bdarnell · 2017-08-27T23:40:31Z

Yes. It's not quick (it takes several hours at least), but it's easy to try.

nvanbenschoten · 2017-08-28T11:49:44Z

Could you kick off a test using my branch from #17939?

bdarnell · 2017-08-28T19:14:56Z

I'm running a test with #17939 now (I had tried to run it this morning but forgot that the jepsen test scripts don't automatically rebuild the binary so I was testing the wrong build). We'll know tonight whether this fixes it (or at least have some supporting evidence. It appears to reproduce around every 3 hours but I don't have many data points to support that estimate).

We see a lot of "context canceled while in command queue" messages in the logs (often affecting PushTxn), so I think it's likely to be the same issue.

35.196.90.5/cockroach.stderr:W170828 18:06:09.389224 82983 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: PushTxn [/Table/51/1/0/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:09.389368 82845 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: PushTxn [/Table/51/1/0/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:09.389490 83010 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: ResolveIntent [/Table/51/1/2/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:09.389506 82925 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: PushTxn [/Table/51/1/0/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:11.226157 83578 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: QueryTxn [/Table/51/1/2/0,/Min)
35.196.90.5/cockroach.stderr:W170828 18:06:11.482199 83673 storage/replica.go:2129  [n3,s3,r41/2:/{Table/51-Max}] context canceled while in command queue: QueryTxn [/Table/51/1/2/0,/Min)

bdarnell · 2017-08-29T03:03:16Z

With #17939, it's passed six hours of test runs (before, it hadn't passed more than three hours in a row). I'll keep running it but so far it's looking good.

nvanbenschoten · 2017-08-29T11:49:43Z

@bdarnell that's a very good sign. Did this run overnight?

bdarnell · 2017-08-29T14:18:14Z

No, I've just been doing a series of manual 1-hour runs. Longer runs can produce tons of logs and require more analysis time at the end of the run.

nvanbenschoten · 2017-08-29T15:03:46Z

Ok, I just merged #17939 so if we see this issue again we'll know that the CommandQueue fix didn't fully resolve this.

bdarnell · 2017-08-29T16:41:28Z

OK, after a few more passing test runs I'm going to call this fixed by #17939.

The failure in cockroachdb#17815 took several hours of testing to show up. We run 30 different configurations (combinations of test and nemesis) so we don't want to run everything for too long every night, but we can do more than the current 3m per configuration.

bdarnell added this to the 1.1 milestone Aug 22, 2017

bdarnell self-assigned this Aug 22, 2017

bdarnell mentioned this issue Aug 27, 2017

storage: transitive prerequisites are not always transferred on command cancellation #16266

Closed

knz mentioned this issue Aug 28, 2017

jepsen: make the Jepsen tests first-class citizens #15738

Closed

17 tasks

nvanbenschoten mentioned this issue Aug 28, 2017

storage: transfer all pending prerequisites during command cancellation #17939

Merged

bdarnell closed this as completed Aug 29, 2017

bdarnell mentioned this issue Aug 29, 2017

build: Run jepsen tests for twice as long #17996

Merged

bdarnell mentioned this issue Oct 31, 2017

jepsen: More nemeses, flexible configuration #19697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jepsen: Bank test failed #17815

jepsen: Bank test failed #17815

bdarnell commented Aug 22, 2017

bdarnell commented Aug 22, 2017

nvanbenschoten commented Aug 27, 2017

bdarnell commented Aug 27, 2017

nvanbenschoten commented Aug 28, 2017

bdarnell commented Aug 28, 2017

bdarnell commented Aug 29, 2017

nvanbenschoten commented Aug 29, 2017

bdarnell commented Aug 29, 2017

nvanbenschoten commented Aug 29, 2017

bdarnell commented Aug 29, 2017

jepsen: Bank test failed #17815

jepsen: Bank test failed #17815

Comments

bdarnell commented Aug 22, 2017

bdarnell commented Aug 22, 2017

nvanbenschoten commented Aug 27, 2017

bdarnell commented Aug 27, 2017

nvanbenschoten commented Aug 28, 2017

bdarnell commented Aug 28, 2017

bdarnell commented Aug 29, 2017

nvanbenschoten commented Aug 29, 2017

bdarnell commented Aug 29, 2017

nvanbenschoten commented Aug 29, 2017

bdarnell commented Aug 29, 2017