Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v19.2.0-alpha.20190805 #39036

Closed
14 tasks done
nvanbenschoten opened this issue Jul 22, 2019 · 11 comments
Closed
14 tasks done

release: v19.2.0-alpha.20190805 #39036

nvanbenschoten opened this issue Jul 22, 2019 · 11 comments
Assignees

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jul 22, 2019

Candidate SHA: 5bd37e8
Deployment status: testing on nathan-release-v1920-alpha20190701 (node 1) and nathan-release-v1920-alpha20190729 (node 1)
Release Qualification Suite: TeamCity (passed)
Nightly Suite: TeamCity (signing off)

Release process checklist

Prep date: 7/22/2019

Release date: 8/5/2019

@nvanbenschoten nvanbenschoten self-assigned this Jul 22, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10
`cdc/ledger/rangefeed=true` roachtests after reducing its duration down to
5m. Usually, at least one of these tests would hit the `negative refcount`
assertion. I then incrementally added more and more logging around entry
application until I painted a full picture of which logical ops were being
consumed by the rangefeed processor and why the same raft command was being
applied twice (once it became clear that one was). After a few more rounds
of fine-tuning the logging, the interaction with reproposals with a new max
lease index became clear.

Release note: None
craig bot pushed a commit that referenced this issue Jul 24, 2019
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Aug 2, 2019

Failed roachtest signoff

  • tpccbench/nodes=9/cpu=4/chaos/partition
  • clearrange/checks=false
  • clearrange/checks=true
  • clock/jump/large_backward_enabled
  • jepsen/register/split

@nvanbenschoten
Copy link
Member Author

tpccbench/nodes=9/cpu=4/chaos/partition failed when installing haproxy due to an apt-get failure.

Fetched 26.3 MB in 5s (5,069 kB/s)
Reading package lists...
E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/dists/xenial/universe/binary-amd64/by-hash/SHA256/c8bc8c1425fef0712430d3991cf15ea96892aa5e13130b36c126fa90887ca756  Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.
Error:  exit status 100
06:37:56 test.go:269: test failure: 	tpcc.go:693,tpcc.go:559,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod install teamcity-1564706298-09-n10cpu4:10 -- haproxy returned:
		stderr:
		
		stdout:
		iverse Translation-en [8,440 B]
		Get:38 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main Sources [4,848 B]
		Get:39 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe Sources [6,740 B]
		Get:40 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [7,280 B]
		Get:41 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main Translation-en [4,456 B]
		Get:42 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7,804 B]
		Get:43 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe Translation-en [4,184 B]
		Fetched 26.3 MB in 5s (5,069 kB/s)
		Reading package lists...
		E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/dists/xenial/universe/binary-amd64/by-hash/SHA256/c8bc8c1425fef0712430d3991cf15ea96892aa5e13130b36c126fa90887ca756  Hash Sum mismatch
		E: Some index files failed to download. They have been ignored, or old ones used instead.
		Error:  exit status 100
		: exit status 1

@nvanbenschoten
Copy link
Member Author

clearrange/checks=true failed due to a node hitting:

F190802 10:30:59.381932 105 storage/replica_raft.go:801  [n1,s1,r17332/4:/Table/53/1/4767{4005-6941}] while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/029340.log: No space left on device

clearrange/checks=false timed out:

11:14:31 test.go:199: test worker status: dropping table
11:14:32 test.go:199: test worker status: computing number of ranges
1: 14929
8: 29732
3: 24382
10: 16741
6: 14114
2: 1042
4: 20558
7: 24278
9: 4540
5: 23522
11:14:32 test.go:199: test worker status: dropping bank table
11:47:43 test.go:290: test failure: 	test_runner.go:706: test timed out (6h30m0s)

@ajkr you've been looking into these tests in #38772 and #38720. Could you make a determination about whether these are concerning or expected, and whether they should block the release from going out?

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Aug 2, 2019

clock/jump/large_backward_enabled is what we're seeing in #38723:

F190802 05:43:58.596975 167 util/hlc/hlc.go:241  detected forward time jump of 0.598804 seconds is not allowed with tolerance of 0.250000 seconds

@tbg you make a determination about whether this is concerning and whether it should block the release from going out?

@nvanbenschoten
Copy link
Member Author

jepsen/register/split crashed with the error:

10:37:18 jepsen.go:227: grabbing artifacts from controller. Tail of controller log:
10:37:18 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564706298-29-n6cpu4:6 -- tail -n 100 /mnt/data1/jepsen/cockroachdb/invoke.log
	at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.8.0.jar:na]
	at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.8.0.jar:na]
	at clojure.core$apply.invokeStatic(core.clj:650) [clojure-1.8.0.jar:na]
	at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) [clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.8.0.jar:na]
	at clojure.lang.AFn.run(AFn.java:22) [clojure-1.8.0.jar:na]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
...
ERROR [2019-08-02 10:37:15,724] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.lang.InterruptedException: null
	at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:211) ~[na:1.8.0_222]
	at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:362) ~[na:1.8.0_222]
	at jepsen.generator.Synchronize.op(generator.clj:667) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.generator.Concat.op(generator.clj:609) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.generator$op_and_validate.invokeStatic(generator.clj:34) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.generator$op_and_validate.invoke(generator.clj:30) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core.NemesisWorker.run_worker_BANG_(core.clj:442) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core$do_worker_BANG_.invokeStatic(core.clj:175) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core$do_worker_BANG_.invoke(core.clj:162) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.core$run_workers_BANG_$fn__4599$fn__4600.invoke(core.clj:228) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at clojure.lang.AFn.applyToHelper(AFn.java:152) ~[clojure-1.8.0.jar:na]
	at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.8.0.jar:na]
	at clojure.core$apply.invokeStatic(core.clj:646) ~[clojure-1.8.0.jar:na]
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) ~[clojure-1.8.0.jar:na]
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) ~[clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.invoke(RestFn.java:425) ~[clojure-1.8.0.jar:na]
	at clojure.lang.AFn.applyToHelper(AFn.java:156) ~[clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.applyTo(RestFn.java:132) ~[clojure-1.8.0.jar:na]
	at clojure.core$apply.invokeStatic(core.clj:650) ~[clojure-1.8.0.jar:na]
	at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) ~[clojure-1.8.0.jar:na]
	at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.8.0.jar:na]
	at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.8.0.jar:na]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_222]

@ben could you make a determination about whether this is concerning and whether it should block the release from going out?

@ajkr
Copy link
Contributor

ajkr commented Aug 5, 2019

clearrange/checks=true failed due to a node hitting:

F190802 10:30:59.381932 105 storage/replica_raft.go:801  [n1,s1,r17332/4:/Table/53/1/4767{4005-6941}] while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/029340.log: No space left on device

clearrange/checks=false timed out:

11:14:31 test.go:199: test worker status: dropping table
11:14:32 test.go:199: test worker status: computing number of ranges
1: 14929
8: 29732
3: 24382
10: 16741
6: 14114
2: 1042
4: 20558
7: 24278
9: 4540
5: 23522
11:14:32 test.go:199: test worker status: dropping bank table
11:47:43 test.go:290: test failure: 	test_runner.go:706: test timed out (6h30m0s)

@ajkr you've been looking into these tests in #38772 and #38720. Could you make a determination about whether these are concerning or expected, and whether they should block the release from going out?

Those are expected. We're not yet able to import data as fast as that test wants.

@tbg
Copy link
Member

tbg commented Aug 5, 2019

@nvanbenschoten the jepsen failure is #30527

I'm relatively sure that #38723 will be something silly related to the roachtest rewrite. I believe that we'll sometimes run more than one of these tests on a cluster (sequentially) and that the contained ntp adjustments can leak between tests.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 5, 2019

The jepsen failure is benign. It failed with an InterruptedException (which doesn't fail the build, then it failed downloading logs. We should probably check the ignoreError flag if we fail while building failure-logs.tbz.

@nvanbenschoten nvanbenschoten changed the title release: v19.2.0-alpha.20190729 release: v19.2.0-alpha.20190805 Aug 5, 2019
@nvanbenschoten
Copy link
Member Author

I changed the release name while tagging the release. This forced me to re-publish provisional binaries. I'm re-running the release qualification build on these binaries here.

@nvanbenschoten
Copy link
Member Author

The release is out the door.

@kenliu
Copy link

kenliu commented Aug 6, 2019

thanks @nvanbenschoten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants