release: v19.2.0-alpha.20190805 #39036

nvanbenschoten · 2019-07-22T19:53:56Z

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. Release note: None

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None

39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2019-08-02T18:24:41Z

Failed roachtest signoff

nvanbenschoten · 2019-08-02T18:27:08Z

tpccbench/nodes=9/cpu=4/chaos/partition failed when installing haproxy due to an apt-get failure.

Fetched 26.3 MB in 5s (5,069 kB/s)
Reading package lists...
E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/dists/xenial/universe/binary-amd64/by-hash/SHA256/c8bc8c1425fef0712430d3991cf15ea96892aa5e13130b36c126fa90887ca756  Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.
Error:  exit status 100
06:37:56 test.go:269: test failure: 	tpcc.go:693,tpcc.go:559,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod install teamcity-1564706298-09-n10cpu4:10 -- haproxy returned:
		stderr:
		
		stdout:
		iverse Translation-en [8,440 B]
		Get:38 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main Sources [4,848 B]
		Get:39 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe Sources [6,740 B]
		Get:40 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [7,280 B]
		Get:41 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main Translation-en [4,456 B]
		Get:42 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7,804 B]
		Get:43 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe Translation-en [4,184 B]
		Fetched 26.3 MB in 5s (5,069 kB/s)
		Reading package lists...
		E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/dists/xenial/universe/binary-amd64/by-hash/SHA256/c8bc8c1425fef0712430d3991cf15ea96892aa5e13130b36c126fa90887ca756  Hash Sum mismatch
		E: Some index files failed to download. They have been ignored, or old ones used instead.
		Error:  exit status 100
		: exit status 1

nvanbenschoten · 2019-08-02T18:31:52Z

clearrange/checks=true failed due to a node hitting:

F190802 10:30:59.381932 105 storage/replica_raft.go:801  [n1,s1,r17332/4:/Table/53/1/4767{4005-6941}] while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/029340.log: No space left on device

clearrange/checks=false timed out:

11:14:31 test.go:199: test worker status: dropping table
11:14:32 test.go:199: test worker status: computing number of ranges
1: 14929
8: 29732
3: 24382
10: 16741
6: 14114
2: 1042
4: 20558
7: 24278
9: 4540
5: 23522
11:14:32 test.go:199: test worker status: dropping bank table
11:47:43 test.go:290: test failure: 	test_runner.go:706: test timed out (6h30m0s)

@ajkr you've been looking into these tests in #38772 and #38720. Could you make a determination about whether these are concerning or expected, and whether they should block the release from going out?

nvanbenschoten · 2019-08-02T18:34:03Z

clock/jump/large_backward_enabled is what we're seeing in #38723:

F190802 05:43:58.596975 167 util/hlc/hlc.go:241  detected forward time jump of 0.598804 seconds is not allowed with tolerance of 0.250000 seconds

@tbg you make a determination about whether this is concerning and whether it should block the release from going out?

nvanbenschoten · 2019-08-02T18:36:50Z

jepsen/register/split crashed with the error:

10:37:18 jepsen.go:227: grabbing artifacts from controller. Tail of controller log:
10:37:18 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564706298-29-n6cpu4:6 -- tail -n 100 /mnt/data1/jepsen/cockroachdb/invoke.log
	at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.8.0.jar:na]
	at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.8.0.jar:na]
	at clojure.core$apply.invokeStatic(core.clj:650) [clojure-1.8.0.jar:na]
	at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) [clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.8.0.jar:na]
	at clojure.lang.AFn.run(AFn.java:22) [clojure-1.8.0.jar:na]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
...
ERROR [2019-08-02 10:37:15,724] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.lang.InterruptedException: null
	at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:211) ~[na:1.8.0_222]
	at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:362) ~[na:1.8.0_222]
	at jepsen.generator.Synchronize.op(generator.clj:667) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.generator.Concat.op(generator.clj:609) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.generator$op_and_validate.invokeStatic(generator.clj:34) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.generator$op_and_validate.invoke(generator.clj:30) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core.NemesisWorker.run_worker_BANG_(core.clj:442) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core$do_worker_BANG_.invokeStatic(core.clj:175) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core$do_worker_BANG_.invoke(core.clj:162) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core$run_workers_BANG_$fn__4599$fn__4600.invoke(core.clj:228) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at clojure.lang.AFn.applyToHelper(AFn.java:152) ~[clojure-1.8.0.jar:na]
	at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.8.0.jar:na]
	at clojure.core$apply.invokeStatic(core.clj:646) ~[clojure-1.8.0.jar:na]
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) ~[clojure-1.8.0.jar:na]
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) ~[clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.invoke(RestFn.java:425) ~[clojure-1.8.0.jar:na]
	at clojure.lang.AFn.applyToHelper(AFn.java:156) ~[clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.applyTo(RestFn.java:132) ~[clojure-1.8.0.jar:na]
	at clojure.core$apply.invokeStatic(core.clj:650) ~[clojure-1.8.0.jar:na]
	at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) ~[clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.8.0.jar:na]
	at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.8.0.jar:na]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_222]

@ben could you make a determination about whether this is concerning and whether it should block the release from going out?

ajkr · 2019-08-05T15:03:04Z

clearrange/checks=true failed due to a node hitting:
F190802 10:30:59.381932 105 storage/replica_raft.go:801  [n1,s1,r17332/4:/Table/53/1/4767{4005-6941}] while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/029340.log: No space left on device
clearrange/checks=false timed out:
11:14:31 test.go:199: test worker status: dropping table
11:14:32 test.go:199: test worker status: computing number of ranges
1: 14929
8: 29732
3: 24382
10: 16741
6: 14114
2: 1042
4: 20558
7: 24278
9: 4540
5: 23522
11:14:32 test.go:199: test worker status: dropping bank table
11:47:43 test.go:290: test failure: 	test_runner.go:706: test timed out (6h30m0s)
@ajkr you've been looking into these tests in #38772 and #38720. Could you make a determination about whether these are concerning or expected, and whether they should block the release from going out?

Those are expected. We're not yet able to import data as fast as that test wants.

tbg · 2019-08-05T15:10:33Z

@nvanbenschoten the jepsen failure is #30527

I'm relatively sure that #38723 will be something silly related to the roachtest rewrite. I believe that we'll sometimes run more than one of these tests on a cluster (sequentially) and that the contained ntp adjustments can leak between tests.

bdarnell · 2019-08-05T15:17:29Z

The jepsen failure is benign. It failed with an InterruptedException (which doesn't fail the build, then it failed downloading logs. We should probably check the ignoreError flag if we fail while building failure-logs.tbz.

nvanbenschoten · 2019-08-05T17:31:56Z

I changed the release name while tagging the release. This forced me to re-publish provisional binaries. I'm re-running the release qualification build on these binaries here.

nvanbenschoten · 2019-08-05T23:22:36Z

The release is out the door.

kenliu · 2019-08-06T01:32:14Z

thanks @nvanbenschoten

nvanbenschoten self-assigned this Jul 22, 2019

nvanbenschoten mentioned this issue Jul 24, 2019

storage: prevent command reproposal with new lease index after application #39064

Merged

nvanbenschoten mentioned this issue Jul 25, 2019

sql: incorrect query plan selected for stockLevel transaction #39103

Closed

nvanbenschoten changed the title ~~release: v19.2.0-alpha.20190729~~ release: v19.2.0-alpha.20190805 Aug 5, 2019

nvanbenschoten closed this as completed Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v19.2.0-alpha.20190805 #39036

release: v19.2.0-alpha.20190805 #39036

nvanbenschoten commented Jul 22, 2019 •

edited

Loading

nvanbenschoten commented Aug 2, 2019 •

edited by tbg

Loading

nvanbenschoten commented Aug 2, 2019

nvanbenschoten commented Aug 2, 2019

nvanbenschoten commented Aug 2, 2019 •

edited

Loading

nvanbenschoten commented Aug 2, 2019

ajkr commented Aug 5, 2019

tbg commented Aug 5, 2019

bdarnell commented Aug 5, 2019

nvanbenschoten commented Aug 5, 2019

nvanbenschoten commented Aug 5, 2019

kenliu commented Aug 6, 2019

release: v19.2.0-alpha.20190805 #39036

release: v19.2.0-alpha.20190805 #39036

Comments

nvanbenschoten commented Jul 22, 2019 • edited Loading

nvanbenschoten commented Aug 2, 2019 • edited by tbg Loading

Failed roachtest signoff

nvanbenschoten commented Aug 2, 2019

nvanbenschoten commented Aug 2, 2019

nvanbenschoten commented Aug 2, 2019 • edited Loading

nvanbenschoten commented Aug 2, 2019

ajkr commented Aug 5, 2019

tbg commented Aug 5, 2019

bdarnell commented Aug 5, 2019

nvanbenschoten commented Aug 5, 2019

nvanbenschoten commented Aug 5, 2019

kenliu commented Aug 6, 2019

nvanbenschoten commented Jul 22, 2019 •

edited

Loading

nvanbenschoten commented Aug 2, 2019 •

edited by tbg

Loading

nvanbenschoten commented Aug 2, 2019 •

edited

Loading