Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

circleci: failed tests: TestReplicaRemovalCampaign #7386

Closed
tbg opened this issue Jun 22, 2016 · 8 comments
Closed

circleci: failed tests: TestReplicaRemovalCampaign #7386

tbg opened this issue Jun 22, 2016 · 8 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jun 22, 2016

The following test appears to have failed:

#19440:

I160622 13:08:51.134679 storage/replica_raftstorage.go:441  generated snapshot for range 1 at index 14 in 857.002µs. encoded size=3558, 31 KV pairs, 4 log entries
I160622 13:08:51.136348 storage/replica_raftstorage.go:533  replica 2 received snapshot for range 1 at index 14. encoded size=3558, 31 KV pairs, 4 log entries
I160622 13:08:51.137428 storage/replica_raftstorage.go:535  replica 2 applied snapshot for range 1 in 988.607µs
I160622 13:08:51.139488 storage/raft_transport.go:215  successfully sent MsgSnap to {2 2 2} at 127.0.0.1:33563
I160622 13:08:51.207575 storage/replica_command.go:2127  initiating a split of range=1 [/Min-/Max) at key "m"
panic: requested entry at index is unavailable

goroutine 8263 [running]:
panic(0x1cdca20, 0xc82017c990)
    /usr/local/go/src/runtime/panic.go:481 +0x3ff
github.com/coreos/etcd/raft.(*raftLog).term(0xc82088b490, 0xa, 0x6181f4, 0x0, 0x0)
    /go/src/github.com/coreos/etcd/raft/log.go:238 +0x1fe
github.com/coreos/etcd/raft.(*raftLog).lastTerm(0xc82088b490, 0xa)
    /go/src/github.com/coreos/etcd/raft/log.go:212 +0x57
github.com/coreos/etcd/raft.newRaft(0xc8216317f0, 0x0)
    /go/src/github.com/coreos/etcd/raft/raft.go:238 +0x114b
github.com/coreos/etcd/raft.NewRawNode(0xc8216317f0, 0x0, 0x0, 0x0, 0x7f6f203d00a8, 0x0, 0x0)
    /go/src/github.com/coreos/etcd/raft/rawnode.go:76 +0xe1
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroupLocked(0xc8204cac40, 0xc821631998, 0x0, 0x0)
--
    /go/src/github.com/cockroachdb/cockroach/storage/replica_range_lease.go:118 +0x2c4
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask.func1(0xc8205782a0, 0x29eadb6, 0x1e, 0x99, 0xc8208f1d60)
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:181 +0x74
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:182 +0xdc
FAIL    github.com/cockroachdb/cockroach/storage    17.549s
=== RUN   TestBatchBasics
I160622 13:09:06.512914 storage/engine/rocksdb.go:143  opening in memory rocksdb instance
I160622 13:09:06.515887 storage/engine/rocksdb.go:186  closing in-memory rocksdb instance
--- PASS: TestBatchBasics (0.02s)
=== RUN   TestBatchRepr
I160622 13:09:06.536341 storage/engine/rocksdb.go:143  opening in memory rocksdb instance
I160622 13:09:06.769199 storage/engine/rocksdb.go:186  closing in-memory rocksdb instance
--- PASS: TestBatchRepr (0.26s)
=== RUN   TestBatchGet
I160622 13:09:06.791121 storage/engine/rocksdb.go:143  opening in memory rocksdb instance

Please assign, take a look and update the issue accordingly.

@tbg tbg added this to the Q2 milestone Jun 22, 2016
@tbg tbg added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jun 22, 2016
@petermattis
Copy link
Collaborator

petermattis commented Jun 22, 2016

This is easily reproducible with:

make stress PKG=./storage TESTS=TestReplicaRemovalCampaign

@tamird
Copy link
Contributor

tamird commented Jun 22, 2016

This was introduced in f447eb0, looks like we're tickling something new here.

@petermattis
Copy link
Collaborator

I'll revert that PR until we figure this out.

@tbg
Copy link
Member Author

tbg commented Jun 22, 2016

Looking.

@tbg
Copy link
Member Author

tbg commented Jun 22, 2016

First results:

In the test, we call store0.RemoveReplica(replica2, *replica2.Desc(), true) where replica2 is the right-hand side of the split. RemoveReplica deletes all of the Replica's data, including the Raft state, leaving only a tombstone behind.

But replica2 doesn't know about this, and its Raft group is active (tickled by the previously triggered campaign, or otherwise by the replicate queue getting a leader lease), so essentially the next operation kills it.

RemoveReplica attempts to wind down the replica by calling its .Destroy() method, but that isn't enough: Destroy() mostly just cancels all pending commands to unblock waiting clients. What it really needs to do is to ensure that it won't attempt to use Raft any more.

I'll look a bit more into why this crashes our beta cluster even after the revert, with my starting point being that we destroy replica data on startup (store.Start()), though that may not be related.
Once I understand this better, I'll whip up a fix.

@petermattis
Copy link
Collaborator

When we call Replica.Destroy(), should we mark the replica as destroyed so that it can't be used? That seems to be what you're suggesting. Sound like replicateQueue is operating on a replica that has been destroyed.

@tbg
Copy link
Member Author

tbg commented Jun 22, 2016

Yes, that's the above conclusion. I want to understand the production crash
better, though.

On Wed, Jun 22, 2016 at 3:05 PM Peter Mattis [email protected]
wrote:

When we call Replica.Destroy(), should we mark the replica as destroyed
so that it can't be used? That seems to be what you're suggesting. Sound
like replicateQueue is operating on a replica that has been destroyed.


You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub
#7386 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135DO9OyI3udjEYp9FAEeE6sldCPA-ks5qOYdhgaJpZM4I7w1w
.

-- Tobias

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 22, 2016
Mark Replicas as destroyed in `Replica.Destroy` and do not perform any
further raft operations after destruction.

This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355.

The specific sequence of events in cockroachdb#7386 was that a replica was queued
for replication and the test then removed the replica from the
store. Removal of a replica from a store removes it from the various
queues, though that is done asynchronously. The replica queue merrily
tries to process the now destroyed replica and as a first step tries to
acquire the leader lease. This leader lease acquisition is done via a
raft command, but the state underlying the replica has already been
deleted. Boom!
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 22, 2016
Mark Replicas as destroyed in `Replica.Destroy` and do not perform any
further raft operations after destruction.

This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355.

The specific sequence of events in cockroachdb#7386 was that a replica was queued
for replication and the test then removed the replica from the
store. Removal of a replica from a store removes it from the various
queues, though that is done asynchronously. The replica queue merrily
tries to process the now destroyed replica and as a first step tries to
acquire the leader lease. This leader lease acquisition is done via a
raft command, but the state underlying the replica has already been
deleted. Boom!
@tamird
Copy link
Contributor

tamird commented Jun 23, 2016

Fixed in #7404.

@tamird tamird closed this as completed Jun 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants