Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestStoreRangeDownReplicate timeout due to stuck Raft command #9673

Closed
tbg opened this issue Oct 3, 2016 · 10 comments
Closed

storage: TestStoreRangeDownReplicate timeout due to stuck Raft command #9673

tbg opened this issue Oct 3, 2016 · 10 comments
Labels
C-test-failure Broken test (automatically or manually discovered).

Comments

@tbg
Copy link
Member

tbg commented Oct 3, 2016

seen in #9672: https://teamcity.cockroachdb.com/viewLog.html?buildId=25993&buildTypeId=Cockroach_UnitTests

It appears that a Raft command made it to a previously unreplicated Replica and is stuck on Raft. Ideally this Replica should be eagerly GC'ed as its peers tell it it's stale. This doesn't appear to be happening here (or my guess is wrong).

@tbg tbg added the C-test-failure Broken test (automatically or manually discovered). label Oct 3, 2016
@jordanlewis
Copy link
Member

Reproduced this (or something similar) via make stress:

6163 runs so far, 0 failures, over 2m0s

I161004 15:45:17.836507 1 rand.go:76  Random seed: -6681371670968794003
I161004 15:45:17.851255 36 storage/engine/rocksdb.go:360  opening in memory rocksdb instance
I161004 15:45:18.113884 36 gossip/gossip.go:234  initial resolvers: []
W161004 15:45:18.113918 36 gossip/gossip.go:1045  no resolvers found; use --join to specify a connected node
I161004 15:45:18.114041 36 gossip/gossip.go:268  NodeID set to 1
I161004 15:45:18.160598 36 gossip/gossip.go:268  NodeID set to 1
I161004 15:45:18.174002 36 storage/store.go:1019  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161004 15:45:18.174127 36 gossip/gossip.go:282  NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:39413" > attrs:<>
I161004 15:45:18.206337 94 storage/replica_trigger.go:252  [s1,r1:/M{in-ax}] new range lease replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [physicalTime=1970-01-01 00:00:00 +0000 UTC]
I161004 15:45:18.206997 36 storage/engine/rocksdb.go:360  opening in memory rocksdb instance
I161004 15:45:18.207880 36 gossip/gossip.go:234  initial resolvers: [127.0.0.1:39413]
W161004 15:45:18.207910 36 gossip/gossip.go:1047  no incoming or outgoing connections
I161004 15:45:18.207952 36 gossip/gossip.go:268  NodeID set to 2
I161004 15:45:18.213776 36 gossip/gossip.go:268  NodeID set to 2
I161004 15:45:18.214067 36 storage/store.go:1019  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I161004 15:45:18.214126 36 gossip/gossip.go:282  NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:44302" > attrs:<>
I161004 15:45:18.228056 36 storage/engine/rocksdb.go:360  opening in memory rocksdb instance
I161004 15:45:18.240981 36 gossip/gossip.go:234  initial resolvers: [127.0.0.1:39413]
W161004 15:45:18.241009 36 gossip/gossip.go:1047  no incoming or outgoing connections
I161004 15:45:18.241069 36 gossip/gossip.go:268  NodeID set to 3
I161004 15:45:18.244199 36 gossip/gossip.go:268  NodeID set to 3
I161004 15:45:18.244636 44 gossip/client.go:125  node 3: started gossip client to 127.0.0.1:39413
I161004 15:45:18.247086 226 gossip/client.go:125  node 2: started gossip client to 127.0.0.1:39413
I161004 15:45:18.247212 36 storage/store.go:1019  [n3,s3]: failed initial metrics computation: [n3,s3]: system config not yet available
I161004 15:45:18.247244 36 gossip/gossip.go:282  NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:35135" > attrs:<>
I161004 15:45:18.249763 36 storage/engine/rocksdb.go:360  opening in memory rocksdb instance
I161004 15:45:18.250359 36 gossip/gossip.go:234  initial resolvers: [127.0.0.1:39413]
W161004 15:45:18.250373 36 gossip/gossip.go:1047  no incoming or outgoing connections
I161004 15:45:18.250401 36 gossip/gossip.go:268  NodeID set to 4
I161004 15:45:18.262821 36 gossip/gossip.go:268  NodeID set to 4
I161004 15:45:18.267082 418 gossip/client.go:125  node 4: started gossip client to 127.0.0.1:39413
I161004 15:45:18.267998 36 storage/store.go:1019  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I161004 15:45:18.268108 36 gossip/gossip.go:282  NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:45566" > attrs:<>
I161004 15:45:18.269452 36 storage/engine/rocksdb.go:360  opening in memory rocksdb instance
I161004 15:45:18.317700 36 gossip/gossip.go:234  initial resolvers: [127.0.0.1:39413]
W161004 15:45:18.317724 36 gossip/gossip.go:1047  no incoming or outgoing connections
I161004 15:45:18.317786 36 gossip/gossip.go:268  NodeID set to 5
I161004 15:45:18.318652 36 gossip/gossip.go:268  NodeID set to 5
I161004 15:45:18.318793 36 storage/store.go:1019  [n5,s5]: failed initial metrics computation: [n5,s5]: system config not yet available
I161004 15:45:18.318829 36 gossip/gossip.go:282  NodeDescriptor set to node_id:5 address:<network_field:"tcp" address_field:"127.0.0.1:36531" > attrs:<>
I161004 15:45:18.319363 545 gossip/client.go:125  node 5: started gossip client to 127.0.0.1:39413
I161004 15:45:18.319647 563 gossip/server.go:254  node 1: refusing gossip from node 5 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:44302})
I161004 15:45:18.320041 545 gossip/client.go:130  node 5: closing client to node 1 (127.0.0.1:39413): received forward from node 1 to 2 (127.0.0.1:44302)
I161004 15:45:18.323002 352 gossip/client.go:125  node 5: started gossip client to 127.0.0.1:44302
I161004 15:45:18.331942 36 storage/client_test.go:410  gossip network initialized
I161004 15:45:18.332539 36 storage/replica_raftstorage.go:443  [s1,r1:/M{in-ax}] generated snapshot 21df4c08 for range 1 at index 12 in 50.994µs.
I161004 15:45:18.333299 36 storage/store.go:2726  streamed snapshot: kv pairs: 31, log entries: 2
I161004 15:45:18.344082 363 storage/replica_raftstorage.go:573  [s2] [n2,s2,r1:{-}]: with replicaID [?], applying preemptive snapshot at index 12 (id=21df4c08, encoded size=16, 1 rocksdb batches, 2 log entries)
I161004 15:45:18.349886 363 storage/replica_raftstorage.go:576  [s2] [n2,s2,r1:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.003s
I161004 15:45:18.350860 36 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2
I161004 15:45:18.352083 601 storage/replica.go:1712  [s1,r1:/M{in-ax}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161004 15:45:18.353200 36 storage/replica_raftstorage.go:443  [s1,r1:/M{in-ax}] generated snapshot f23c0f8a for range 1 at index 14 in 56.038µs.
I161004 15:45:18.356163 36 storage/store.go:2726  streamed snapshot: kv pairs: 34, log entries: 4
I161004 15:45:18.356413 627 storage/replica_raftstorage.go:573  [s3] [n3,s3,r1:{-}]: with replicaID [?], applying preemptive snapshot at index 14 (id=f23c0f8a, encoded size=16, 1 rocksdb batches, 4 log entries)
I161004 15:45:18.357157 627 storage/replica_raftstorage.go:576  [s3] [n3,s3,r1:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.001s
I161004 15:45:18.370843 36 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3
I161004 15:45:18.375865 381 storage/raft_transport.go:418  raft transport stream to node 1 established
I161004 15:45:18.378177 658 storage/replica.go:1712  [s1,r1:/M{in-ax}] proposing ADD_REPLICA {NodeID:3 StoreID:3 ReplicaID:3} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161004 15:45:18.381126 675 storage/raft_transport.go:418  raft transport stream to node 1 established
I161004 15:45:18.382733 36 storage/replica_command.go:2328  initiating a split of this range at key "m" [r2]
I161004 15:45:18.388287 36 storage/replica_raftstorage.go:443  [s1,r2:{"m"-/Max}] generated snapshot 1d8653cd for range 2 at index 10 in 76.657µs.
I161004 15:45:18.388824 36 storage/store.go:2726  streamed snapshot: kv pairs: 28, log entries: 0
I161004 15:45:18.389103 696 storage/replica_raftstorage.go:573  [s4] [n4,s4,r2:{-}]: with replicaID [?], applying preemptive snapshot at index 10 (id=1d8653cd, encoded size=16, 1 rocksdb batches, 0 log entries)
I161004 15:45:18.389888 696 storage/replica_raftstorage.go:576  [s4] [n4,s4,r2:{"m"-/Max}]: with replicaID [?], applied preemptive snapshot in 0.001s
I161004 15:45:18.390401 36 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4
I161004 15:45:43.574997 11 gossip/gossip.go:1061  node has connected to cluster via gossip
I161004 15:45:46.738970 276 gossip/gossip.go:1061  node has connected to cluster via gossip
I161004 15:45:51.842602 487 gossip/gossip.go:1061  node has connected to cluster via gossip
I161004 15:45:52.919450 392 gossip/gossip.go:1061  node has connected to cluster via gossip
I161004 15:45:54.458350 164 gossip/gossip.go:1061  node has connected to cluster via gossip
panic: test timed out after 2m0s

goroutine 919 [running]:
panic(0x16157c0, 0xc420968e50)
    /home/jordan/go/src/runtime/panic.go:500 +0x1a1
testing.startAlarm.func1()
    /home/jordan/go/src/testing/testing.go:918 +0x10b
created by time.goFunc
    /home/jordan/go/src/time/sleep.go:154 +0x44

goroutine 1 [chan receive, 1 minutes]:
testing.(*T).Run(0xc420308240, 0x17e94c3, 0x1b, 0x1986188, 0xc420289a01)
    /home/jordan/go/src/testing/testing.go:647 +0x316
testing.RunTests.func1(0xc420308240)
    /home/jordan/go/src/testing/testing.go:793 +0x6d
testing.tRunner(0xc420308240, 0xc42067bbc0)
    /home/jordan/go/src/testing/testing.go:610 +0x81
testing.RunTests(0x1987660, 0x21f39e0, 0x14a, 0x14a, 0xc4201654f0)
    /home/jordan/go/src/testing/testing.go:799 +0x2f5
testing.(*M).Run(0xc42067bee8, 0xc420289d30)
    /home/jordan/go/src/testing/testing.go:743 +0x85
github.com/cockroachdb/cockroach/storage_test.TestMain(0xc42067bee8)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/main_test.go:57 +0x1ff
main.main()
    github.com/cockroachdb/cockroach/storage/_test/_testmain.go:722 +0xc6

goroutine 17 [syscall, 1 minutes, locked to thread]:
runtime.goexit()
    /home/jordan/go/src/runtime/asm_amd64.s:2086 +0x1

goroutine 50 [chan receive]:
github.com/cockroachdb/cockroach/util/log.(*loggingT).flushDaemon(0x2555ac0)
    /home/jordan/src/github.com/cockroachdb/cockroach/util/log/clog.go:998 +0x77
created by github.com/cockroachdb/cockroach/util/log.init.1
    /home/jordan/src/github.com/cockroachdb/cockroach/util/log/clog.go:579 +0x95

goroutine 6 [select, 1 minutes]:
google.golang.org/grpc/transport.(*http2Client).controller(0xc42015a400)
    /home/jordan/src/google.golang.org/grpc/transport/http2_client.go:999 +0x59d
created by google.golang.org/grpc/transport.newHTTP2Client
    /home/jordan/src/google.golang.org/grpc/transport/http2_client.go:233 +0xe78

goroutine 36 [select]:
github.com/cockroachdb/cockroach/kv.(*DistSender).sendToReplicas(0xc420304480, 0x7fbfa3f54340, 0xc4204f1440, 0x1dcd6500, 0xc420324580, 0x1, 0xc4206a14f0, 0x3, 0x3, 0x0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:992 +0xc02
github.com/cockroachdb/cockroach/kv.(*DistSender).sendRPC(0xc420304480, 0x7fbfa3f54340, 0xc4204f1440, 0x1, 0xc4206a14f0, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:367 +0x161
github.com/cockroachdb/cockroach/kv.(*DistSender).sendSingleRange(0xc420304480, 0x7fbfa3f54340, 0xc4204f1440, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4202af800, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:457 +0x148
github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk.func1(0xc420ab1b10, 0xc420ab19c0, 0xc420ab1f88, 0xc420304480, 0xc420ab1f78, 0xd, 0xc420165fb3)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:732 +0x18e
github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk(0xc420304480, 0x7fbfa3f54340, 0xc4204f1440, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4202af800, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:733 +0x43d
github.com/cockroachdb/cockroach/kv.(*DistSender).Send(0xc420304480, 0x7fbfa3f54340, 0xc4204f1440, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4202af740, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:570 +0x1d6
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send(0xc4201ba420, 0x7fbfa3f54340, 0xc4204f1440, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc420b6f228, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:407 +0x7e1
github.com/cockroachdb/cockroach/internal/client.(*Txn).sendInternal(0xc420b6f1e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc420b6f228, 0x0, 0x0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/txn.go:572 +0xfd
github.com/cockroachdb/cockroach/internal/client.(*Txn).send(0xc420b6f1e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/txn.go:696 +0x498
github.com/cockroachdb/cockroach/internal/client.(*Txn).(github.com/cockroachdb/cockroach/internal/client.send)-fm(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/txn.go:309 +0x58
github.com/cockroachdb/cockroach/internal/client.sendAndFill(0xc420ab2f28, 0xc42093f200, 0x0, 0xc4204f1320)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/db.go:416 +0x168
github.com/cockroachdb/cockroach/internal/client.(*Txn).Run(0xc420b6f1e0, 0xc42093f200, 0x0, 0x0)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/txn.go:309 +0xc6
github.com/cockroachdb/cockroach/storage.(*Replica).ChangeReplicas.func1(0xc420b6f1e0, 0x0, 0x1)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/replica_command.go:3248 +0x30c
github.com/cockroachdb/cockroach/internal/client.(*DB).Txn.func1(0xc420b6f1e0, 0xc420165fd0, 0x4000000000000000, 0x0)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/db.go:466 +0x2f
github.com/cockroachdb/cockroach/internal/client.(*Txn).Exec(0xc420b6f1e0, 0xc420160101, 0x0, 0xc420165fc0, 0x0, 0x0)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/txn.go:507 +0x215
github.com/cockroachdb/cockroach/internal/client.(*DB).Txn(0xc420161ef0, 0x7fbfa3f0b1e8, 0xc4200a63f8, 0xc4201bff20, 0xa, 0x10)
    /home/jordan/src/github.com/cockroachdb/cockroach/internal/client/db.go:467 +0x1fa
github.com/cockroachdb/cockroach/storage.(*Replica).ChangeReplicas(0xc42021ce00, 0x7fbfa3f0b1e8, 0xc4200a63f8, 0x400000000, 0x400000004, 0xc4201be600, 0x0, 0x0)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/replica_command.go:3290 +0xb83
github.com/cockroachdb/cockroach/storage_test.(*multiTestContext).replicateRange(0xc42053a9c0, 0x2, 0xc420ab3c48, 0x2, 0x2)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/client_test.go:892 +0x3cf
github.com/cockroachdb/cockroach/storage_test.TestStoreRangeDownReplicate(0xc420308300)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/client_raft_test.go:995 +0x55a
testing.tRunner(0xc420308300, 0x1986188)
    /home/jordan/go/src/testing/testing.go:610 +0x81
created by testing.(*T).Run
    /home/jordan/go/src/testing/testing.go:646 +0x2ec

@tbg
Copy link
Member Author

tbg commented Oct 4, 2016

Seems likely, though the more interesting goroutine would be stuck in addWriteCommand (the one you see stuck is the coordinator waiting for the node's response). You didn't post the whole dump, so I'm sure it's there.

@tbg
Copy link
Member Author

tbg commented Oct 4, 2016

This may also be relevant for/connected to #9624

@jordanlewis
Copy link
Member

Here's addWriteCmd:

goroutine 698 [select, 1 minutes]:
github.com/cockroachdb/cockroach/storage.(*Replica).addWriteCmd(0xc42021ce00, 0x7fbfa3f54340, 0xc4204f1680, 0x0, 0xb5, 0x100000001, 0x1, 0x2, 0x0, 0xc4202af8c0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/replica.go:1557 +0x513
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc42021ce00, 0x7fbfa3f54340, 0xc4204f1680, 0x0, 0xb5, 0x100000001, 0x1, 0x2, 0x0, 0xc4202af8c0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/replica.go:1075 +0x277
github.com/cockroachdb/cockroach/storage.(*Store).Send(0xc4202a2000, 0x7fbfa3f54340, 0xc4204f1620, 0x0, 0xb5, 0x100000001, 0x1, 0x2, 0x0, 0xc4202af8c0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/store.go:2099 +0x752
github.com/cockroachdb/cockroach/storage.(*Stores).Send(0xc4200af8b0, 0x7fbfa3f54300, 0xc42051a5c0, 0x0, 0x0, 0x100000001, 0x1, 0x2, 0x0, 0xc4202af8c0, ...)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/stores.go:186 +0x202
github.com/cockroachdb/cockroach/storage_test.(*multiTestContextKVTransport).SendNext.func1(0x7fbfa3f54300, 0xc42051a5c0)
    /home/jordan/src/github.com/cockroachdb/cockroach/storage/client_test.go:467 +0x136
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask.func1(0xc4203042d0, 0x1f0f156, 0x16, 0x1f1, 0x1986638, 0xc4209409c0, 0x7fbfa3f54300, 0xc42051a5c0)
    /home/jordan/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:239 +0xc6
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask
    /home/jordan/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:240 +0x227

@tbg tbg self-assigned this Oct 4, 2016
@tbg
Copy link
Member Author

tbg commented Oct 4, 2016

I can take this one. If you're interested, let's pair up (but ok if you just wanna do something else).

@tbg
Copy link
Member Author

tbg commented Oct 4, 2016

It's a party in the TestStoreRangeDownReplicate:

--- FAIL: TestStoreRangeDownReplicate (0.81s)
client_raft_test.go:1028: replica 5 (range 2) was garbage collected

@jordanlewis
Copy link
Member

How about this one:

client_raft_test.go:985: storage/replica_command.go:2408: split at key "m" failed: conflict updating range descriptors

@tbg
Copy link
Member Author

tbg commented Oct 4, 2016

Also expected (as far as I can tell, though I'm not sure why in this test). I'll send a PR.

@tbg
Copy link
Member Author

tbg commented Oct 11, 2016

I obviously never sent that PR. My suspicion was that we're doing concurrent splits by accident, in which case it's perfectly reasonable that some of them would fail. I've got to clear out my queue now, so maybe @jordanlewis can pull this thread further when he gets back.

@tamird
Copy link
Contributor

tamird commented Oct 31, 2016

"conflict updating range descriptors" is #10171. I'm going to close this guy for now since nobody's working on it, the test in question has at least two other issues (#10171 and #10306), and this particular issue has not come up in automated runs.

@tamird tamird closed this as completed Oct 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

No branches or pull requests

3 participants