Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

: TestStoreRangeDownReplicate failed under stress #10306

Closed
cockroach-teamcity opened this issue Oct 29, 2016 · 6 comments
Closed

: TestStoreRangeDownReplicate failed under stress #10306

cockroach-teamcity opened this issue Oct 29, 2016 · 6 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/539acaedf458b09a5246992d9d05561231ecca4e

Stress build found a failed test:

I161029 07:27:42.654090 14077 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161029 07:27:42.655635 14077 gossip/gossip.go:237  [n?] initial resolvers: []
W161029 07:27:42.656046 14077 gossip/gossip.go:1055  [n?] no resolvers found; use --join to specify a connected node
I161029 07:27:42.656351 14077 base/node_id.go:62  NodeID set to 1
I161029 07:27:42.673919 14077 storage/store.go:1152  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161029 07:27:42.683893 14077 gossip/gossip.go:280  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:53834" > attrs:<> locality:<> 
I161029 07:27:42.690397 14467 storage/replica_proposal.go:292  [s1,r1/1:/M{in-ax}] new range lease replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [physicalTime=1970-01-01 00:00:00 +0000 UTC]
I161029 07:27:42.700384 14077 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161029 07:27:42.722057 14077 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:53834]
W161029 07:27:42.722483 14077 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161029 07:27:42.722808 14077 base/node_id.go:62  NodeID set to 2
I161029 07:27:42.739486 14519 gossip/client.go:126  [n2] node 2: started gossip client to 127.0.0.1:53834
I161029 07:27:42.740300 14077 storage/store.go:1152  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I161029 07:27:42.740621 14077 gossip/gossip.go:280  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:37244" > attrs:<> locality:<> 
I161029 07:27:42.743556 14077 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161029 07:27:42.746932 14077 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:53834]
W161029 07:27:42.747634 14077 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161029 07:27:42.747961 14077 base/node_id.go:62  NodeID set to 3
I161029 07:27:42.762794 14077 storage/store.go:1152  [n3,s3]: failed initial metrics computation: [n3,s3]: system config not yet available
I161029 07:27:42.763118 14077 gossip/gossip.go:280  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:50742" > attrs:<> locality:<> 
I161029 07:27:42.764415 14565 gossip/client.go:126  [n3] node 3: started gossip client to 127.0.0.1:53834
I161029 07:27:42.766098 14077 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161029 07:27:42.768266 14077 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:53834]
W161029 07:27:42.768702 14077 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161029 07:27:42.769251 14077 base/node_id.go:62  NodeID set to 4
I161029 07:27:42.827099 14077 storage/store.go:1152  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I161029 07:27:42.827427 14077 gossip/gossip.go:280  [n4] NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:50512" > attrs:<> locality:<> 
I161029 07:27:42.831559 14077 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161029 07:27:42.832321 14688 gossip/client.go:126  [n4] node 4: started gossip client to 127.0.0.1:53834
I161029 07:27:42.901812 14077 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:53834]
W161029 07:27:42.902222 14077 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161029 07:27:42.902559 14077 base/node_id.go:62  NodeID set to 5
I161029 07:27:42.931299 14077 storage/store.go:1152  [n5,s5]: failed initial metrics computation: [n5,s5]: system config not yet available
I161029 07:27:42.931625 14077 gossip/gossip.go:280  [n5] NodeDescriptor set to node_id:5 address:<network_field:"tcp" address_field:"127.0.0.1:40621" > attrs:<> locality:<> 
I161029 07:27:42.945613 14702 gossip/client.go:126  [n5] node 5: started gossip client to 127.0.0.1:53834
I161029 07:27:42.976262 14780 gossip/server.go:263  [n1] refusing gossip from node 5 (max 3 conns); forwarding to 4 ({tcp 127.0.0.1:50512})
I161029 07:27:42.999164 14702 gossip/client.go:131  [n5] node 5: closing client to node 1 (127.0.0.1:53834): received forward from node 1 to 4 (127.0.0.1:50512)
I161029 07:27:43.046369 14742 gossip/client.go:126  [n5] node 5: started gossip client to 127.0.0.1:50512
I161029 07:27:43.166896 14077 storage/client_test.go:420  gossip network initialized
I161029 07:27:43.233872 14077 storage/replica_raftstorage.go:446  [s1,r1/1:/M{in-ax}] generated snapshot 88cda945 for range 1 at index 26 in 99.8µs.
I161029 07:27:43.236081 14077 storage/store.go:3061  streamed snapshot: kv pairs: 45, log entries: 16
I161029 07:27:43.244703 14716 storage/replica_raftstorage.go:577  [s2] [n2,s2,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 26 (id=88cda945, encoded size=16, 1 rocksdb batches, 16 log entries)
I161029 07:27:43.248215 14716 storage/replica_raftstorage.go:580  [s2] [n2,s2,r1/?:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.003s
I161029 07:27:43.268639 14077 storage/replica_command.go:3240  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2 
I161029 07:27:43.290429 14903 storage/replica.go:1860  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161029 07:27:43.317819 14077 storage/replica_raftstorage.go:446  [s1,r1/1:/M{in-ax}] generated snapshot a18aa062 for range 1 at index 29 in 106.701µs.
I161029 07:27:43.333164 14940 storage/replica_raftstorage.go:577  [s3] [n3,s3,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 29 (id=a18aa062, encoded size=16, 1 rocksdb batches, 19 log entries)
I161029 07:27:43.335235 14077 storage/store.go:3061  streamed snapshot: kv pairs: 49, log entries: 19
I161029 07:27:43.337887 14940 storage/replica_raftstorage.go:580  [s3] [n3,s3,r1/?:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.004s
I161029 07:27:43.354105 14077 storage/replica_command.go:3240  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3 
I161029 07:27:43.368418 14909 storage/raft_transport.go:423  raft transport stream to node 1 established
I161029 07:27:43.659243 14974 storage/replica.go:1860  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA {NodeID:3 StoreID:3 ReplicaID:3} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:43.857200 15091 storage/raft_transport.go:423  raft transport stream to node 1 established
I161029 07:27:44.057299 14077 storage/replica_command.go:2363  initiating a split of this range at key "m" [r2]
I161029 07:27:44.513477 14077 storage/replica_raftstorage.go:446  [s1,r2/1:{"m"-/Max}] generated snapshot 39f3f9f3 for range 2 at index 10 in 113.001µs.
I161029 07:27:44.555412 14077 storage/store.go:3061  streamed snapshot: kv pairs: 28, log entries: 0
I161029 07:27:44.581406 15007 storage/replica_raftstorage.go:577  [s4] [n4,s4,r2/?:{-}]: with replicaID [?], applying preemptive snapshot at index 10 (id=39f3f9f3, encoded size=16, 1 rocksdb batches, 0 log entries)
I161029 07:27:44.627480 15007 storage/replica_raftstorage.go:580  [s4] [n4,s4,r2/?:{"m"-/Max}]: with replicaID [?], applied preemptive snapshot in 0.046s
I161029 07:27:44.683390 14077 storage/replica_command.go:3240  change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4 
W161029 07:27:44.853022 15221 storage/stores.go:218  range not contained in one range: [/Meta2/Max,"m\x00"), but have [/Min,"m")
I161029 07:27:45.227065 15235 storage/replica.go:1860  [s1,r2/1:{"m"-/Max}] proposing ADD_REPLICA {NodeID:4 StoreID:4 ReplicaID:4} for range 2: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3} {NodeID:4 StoreID:4 ReplicaID:4}]
I161029 07:27:45.348026 14077 storage/replica_raftstorage.go:446  [s1,r2/1:{"m"-/Max}] generated snapshot 93dd05f4 for range 2 at index 14 in 108.801µs.
I161029 07:27:45.408775 15266 storage/raft_transport.go:423  raft transport stream to node 1 established
I161029 07:27:45.420530 14077 storage/store.go:3061  streamed snapshot: kv pairs: 30, log entries: 4
I161029 07:27:45.533147 15232 storage/replica_raftstorage.go:577  [s5] [n5,s5,r2/?:{-}]: with replicaID [?], applying preemptive snapshot at index 14 (id=93dd05f4, encoded size=16, 1 rocksdb batches, 4 log entries)
I161029 07:27:45.539973 15232 storage/replica_raftstorage.go:580  [s5] [n5,s5,r2/?:{"m"-/Max}]: with replicaID [?], applied preemptive snapshot in 0.007s
I161029 07:27:45.558194 14077 storage/replica_command.go:3240  change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:4 store_id:4 replica_id:4 > next_replica_id:5 
I161029 07:27:46.201625 15299 storage/replica.go:1860  [s1,r2/1:{"m"-/Max}] proposing ADD_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3} {NodeID:4 StoreID:4 ReplicaID:4} {NodeID:5 StoreID:5 ReplicaID:5}]
I161029 07:27:46.610939 15145 storage/raft_transport.go:423  raft transport stream to node 1 established
I161029 07:27:47.270279 15394 storage/raft_transport.go:423  raft transport stream to node 3 established
I161029 07:27:47.279489 15398 storage/raft_transport.go:423  raft transport stream to node 2 established
I161029 07:27:48.176503 14535 storage/replica_proposal.go:341  [s2,r1/2:{/Min-"m"}] range [n2,s2,r1/2:{/Min-"m"}]: transferring raft leadership to replica ID 1
I161029 07:27:48.298796 15438 storage/raft_transport.go:423  raft transport stream to node 5 established
I161029 07:27:48.305360 15437 storage/raft_transport.go:423  raft transport stream to node 4 established
I161029 07:27:48.761156 15471 storage/raft_transport.go:423  raft transport stream to node 3 established
I161029 07:27:48.764167 15469 storage/raft_transport.go:423  raft transport stream to node 4 established
I161029 07:27:48.769110 15470 storage/raft_transport.go:423  raft transport stream to node 2 established
I161029 07:27:48.775612 15501 storage/raft_transport.go:423  raft transport stream to node 5 established
I161029 07:27:48.831409 15530 storage/raft_transport.go:423  raft transport stream to node 5 established
I161029 07:27:49.316646 15571 storage/raft_transport.go:423  raft transport stream to node 2 established
I161029 07:27:49.320757 15570 storage/raft_transport.go:423  raft transport stream to node 3 established
I161029 07:27:50.009553 15638 storage/raft_transport.go:423  raft transport stream to node 4 established
I161029 07:27:51.264923 14608 storage/replica_proposal.go:292  [s3,r2/3:{"m"-/Max}] new range lease replica {3 3 3} 1970-01-01 00:00:04.5 +0000 UTC 5.4s following replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 4.5s [physicalTime=1970-01-01 00:00:25.2 +0000 UTC]
I161029 07:27:54.020064 14537 storage/replica_proposal.go:341  [s2,r2/2:{"m"-/Max}] range [n2,s2,r2/2:{"m"-/Max}]: transferring raft leadership to replica ID 3
I161029 07:27:54.110629 14077 storage/replica_command.go:3240  [replicate,s3,r2/3:{"m"-/Max}] change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:4 store_id:4 replica_id:4 > replicas:<node_id:5 store_id:5 replica_id:5 > next_replica_id:6 
W161029 07:27:54.443812 15763 storage/stores.go:218  range not contained in one range: [/Meta2/Max,"m\x00"), but have [/Min,"m")
I161029 07:27:55.414846 15779 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:1 StoreID:1 ReplicaID:1} for range 2: [{NodeID:5 StoreID:5 ReplicaID:5} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3} {NodeID:4 StoreID:4 ReplicaID:4}]
I161029 07:27:55.629694 14077 storage/replica_command.go:3240  [replicate,s3,r2/3:{"m"-/Max}] change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:5 store_id:5 replica_id:5 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:4 store_id:4 replica_id:4 > next_replica_id:6 
I161029 07:27:55.823732 15017 storage/store.go:2922  [s1] [n1,s1,r2/1:{"m"-/Max}]: added to replica GC queue (peer suggestion)
I161029 07:27:56.511637 15802 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:56.682496 14613 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:56.942215 14600 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:57.010094 14615 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:57.365011 14609 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:57.443666 15529 storage/store.go:2922  [s5] [n5,s5,r2/5:{"m"-/Max}]: added to replica GC queue (peer suggestion)
I161029 07:27:57.444190 15529 storage/store.go:2922  [s5] [n5,s5,r2/5:{"m"-/Max}]: added to replica GC queue (peer suggestion)
I161029 07:27:57.640619 15988 util/stop/stopper.go:435  quiescing; tasks left:
1      storage/queue.go:467
I161029 07:27:57.646085 15986 util/stop/stopper.go:435  quiescing; tasks left:
2      storage/client_test.go:507
1      storage/intent_resolver.go:383
1      kv/txn_coord_sender.go:911
I161029 07:27:57.650327 15905 util/stop/stopper.go:435  quiescing; tasks left:
1      storage/client_test.go:507
I161029 07:27:57.650825 15986 util/stop/stopper.go:435  quiescing; tasks left:
2      storage/client_test.go:507
1      storage/intent_resolver.go:383
I161029 07:27:57.651038 15904 util/stop/stopper.go:435  quiescing; tasks left:
9      storage/client_test.go:507
1      storage/queue.go:467
I161029 07:27:57.709508 15986 util/stop/stopper.go:435  quiescing; tasks left:
1      storage/intent_resolver.go:383
1      storage/client_test.go:507
E161029 07:27:57.749719 14503 storage/node_liveness.go:141  [hb] failed liveness heartbeat: transaction commit result is ambiguous
E161029 07:27:57.750665 14503 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161029 07:27:57.751347 14628 storage/node_liveness.go:141  [hb] failed liveness heartbeat: transaction commit result is ambiguous
E161029 07:27:57.752261 14628 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161029 07:27:57.752618 14689 storage/node_liveness.go:141  [hb] failed liveness heartbeat: transaction commit result is ambiguous
I161029 07:27:57.772971 15904 util/stop/stopper.go:435  quiescing; tasks left:
8      storage/client_test.go:507
1      storage/queue.go:467
W161029 07:27:57.774641 15771 storage/intent_resolver.go:101  [s1] asynchronous resolveIntents failed: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
E161029 07:27:57.774824 14689 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161029 07:27:57.785277 14786 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
I161029 07:27:57.792279 15904 util/stop/stopper.go:435  quiescing; tasks left:
4      storage/client_test.go:507
1      storage/queue.go:467
E161029 07:27:57.796826 14689 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161029 07:27:57.797868 14786 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161029 07:27:57.799319 15986 util/stop/stopper.go:435  quiescing; tasks left:
1      storage/intent_resolver.go:383
I161029 07:27:57.808964 15904 util/stop/stopper.go:435  quiescing; tasks left:
3      storage/client_test.go:507
1      storage/queue.go:467
E161029 07:27:57.815655 14795 storage/queue.go:558  [replicaGC] on [n5,s5,r2/5:{"m"-/Max}]: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
I161029 07:27:57.816933 15904 util/stop/stopper.go:435  quiescing; tasks left:
2      storage/client_test.go:507
1      storage/queue.go:467
E161029 07:27:57.826353 14443 storage/queue.go:558  [replicaGC] on [n1,s1,r2/1:{"m"-/Max}]: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
E161029 07:27:57.829090 14503 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161029 07:27:57.831214 14550 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161029 07:27:57.834995 15904 util/stop/stopper.go:435  quiescing; tasks left:
1      storage/client_test.go:507
W161029 07:27:57.847470 15971 storage/intent_resolver.go:337  [n3,s3,r2/3:{"m"-/Max}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
I161029 07:27:57.847718 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.871677 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
E161029 07:27:57.879041 14550 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161029 07:27:57.879290 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.883155 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.885085 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.887143 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.889561 14511 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161029 07:27:57.889954 14625 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50742->127.0.0.1:38796: use of closed network connection
I161029 07:27:57.890351 14591 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:40621->127.0.0.1:47892: use of closed network connection
I161029 07:27:57.890684 14517 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:37244->127.0.0.1:46179: use of closed network connection
I161029 07:27:57.890938 14642 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50512->127.0.0.1:34763: use of closed network connection
I161029 07:27:57.891205 14485 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161029 07:27:57.891526 14495 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161029 07:27:57.891871 14497 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:50512: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:50512" <nil>}
I161029 07:27:57.892075 14434 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:53834->127.0.0.1:46266: use of closed network connection
W161029 07:27:57.892295 15394 storage/raft_transport.go:428  raft transport stream to node 3 failed: EOF
I161029 07:27:57.892546 14450 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
W161029 07:27:57.892855 15571 storage/raft_transport.go:428  raft transport stream to node 2 failed: EOF
W161029 07:27:57.894260 15091 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
W161029 07:27:57.894490 15471 storage/raft_transport.go:428  raft transport stream to node 3 failed: EOF
W161029 07:27:57.894661 14909 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
W161029 07:27:57.894810 15437 storage/raft_transport.go:428  raft transport stream to node 4 failed: EOF
W161029 07:27:57.894957 14991 storage/raft_transport.go:428  raft transport stream to node 3 failed: EOF
W161029 07:27:57.895116 15470 storage/raft_transport.go:428  raft transport stream to node 2 failed: EOF
I161029 07:27:57.895270 14497 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: context canceled
W161029 07:27:57.895409 15236 storage/raft_transport.go:428  raft transport stream to node 4 failed: EOF
W161029 07:27:57.895570 15145 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
I161029 07:27:57.895727 14513 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:50742: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:50742" <nil>}
I161029 07:27:57.895829 14722 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161029 07:27:57.896046 14487 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:37244: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:37244" <nil>}
I161029 07:27:57.896158 14452 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:53834: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:53834" <nil>}
W161029 07:27:57.896252 15530 storage/raft_transport.go:428  raft transport stream to node 5 failed: EOF
I161029 07:27:57.896486 14724 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:40621: operation was canceled"; Reconnecting to {"127.0.0.1:40621" <nil>}
I161029 07:27:57.896676 14724 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161029 07:27:57.897205 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.897594 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.897917 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.898329 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
I161029 07:27:57.898695 15903 util/stop/stopper.go:363  stop has been called, stopping or quiescing all running tasks
    client_raft_test.go:1010: Failed to achieve proper replication within 10 seconds
@cockroach-teamcity cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels Oct 29, 2016
@spencerkimball
Copy link
Member

@petermattis Here's the section where things go batty:

I161029 07:27:51.264923 14608 storage/replica_proposal.go:292  [s3,r2/3:{"m"-/Max}] new range lease replica {3 3 3} 1970-01-01 00:00:04.5 +0000 UTC 5.4s following replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 4.5s [physicalTime=1970-01-01 00:00:25.2 +0000 UTC]
I161029 07:27:54.020064 14537 storage/replica_proposal.go:341  [s2,r2/2:{"m"-/Max}] range [n2,s2,r2/2:{"m"-/Max}]: transferring raft leadership to replica ID 3
I161029 07:27:54.110629 14077 storage/replica_command.go:3240  [replicate,s3,r2/3:{"m"-/Max}] change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:4 store_id:4 replica_id:4 > replicas:<node_id:5 store_id:5 replica_id:5 > next_replica_id:6 
W161029 07:27:54.443812 15763 storage/stores.go:218  range not contained in one range: [/Meta2/Max,"m\x00"), but have [/Min,"m")
I161029 07:27:55.414846 15779 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:1 StoreID:1 ReplicaID:1} for range 2: [{NodeID:5 StoreID:5 ReplicaID:5} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3} {NodeID:4 StoreID:4 ReplicaID:4}]
I161029 07:27:55.629694 14077 storage/replica_command.go:3240  [replicate,s3,r2/3:{"m"-/Max}] change replicas: read existing descriptor range_id:2 start_key:"m" end_key:"\377\377" replicas:<node_id:5 store_id:5 replica_id:5 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:4 store_id:4 replica_id:4 > next_replica_id:6 
I161029 07:27:55.823732 15017 storage/store.go:2922  [s1] [n1,s1,r2/1:{"m"-/Max}]: added to replica GC queue (peer suggestion)
I161029 07:27:56.511637 15802 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:56.682496 14613 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:56.942215 14600 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:57.010094 14615 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:57.365011 14609 storage/replica.go:1860  [s3,r2/3:{"m"-/Max}] proposing REMOVE_REPLICA {NodeID:5 StoreID:5 ReplicaID:5} for range 2: [{NodeID:4 StoreID:4 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161029 07:27:57.443666 15529 storage/store.go:2922  [s5] [n5,s5,r2/5:{"m"-/Max}]: added to replica GC queue (peer suggestion)
I161029 07:27:57.444190 15529 storage/store.go:2922  [s5] [n5,s5,r2/5:{"m"-/Max}]: added to replica GC queue (peer suggestion)

What does this line mean:

W161029 07:27:54.443812 15763 storage/stores.go:218 range not contained in one range: [/Meta2/Max,"m\x00"), but have [/Min,"m")

@petermattis
Copy link
Collaborator

I'm not familiar with that warning, but looking at the code it seems to be saying that the node received a request that had start and end keys that are not contained in a single range. Perhaps the range recently split.

@bdarnell
Copy link
Contributor

bdarnell commented Nov 1, 2016

Yeah, that message means the range recently split. We should probably just remove this log line (it comes before we return an error that will be handled transparently by DistSender, so it's innocuous).

However, the keys used in this message are suspicious. What's trying to operate on a batch that spans from the beginning of the user keyspace to the non-existent key m\x00? I'm not sure where those keys would get constructed.

@vivekmenezes
Copy link
Contributor

I'm seeing the read in this test take like more than a second every once in a while which might explain why the test is failing. If you reduce the timeout in the test to 1 second and time the Get() within the test I've seen it take 2 seconds. I suspect that this might have to do with range quiescence and therefore assigning to peter.

@petermattis
Copy link
Collaborator

I can't reproduce a failure here. Can you provide details on what commands you're running?

@vivekmenezes
Copy link
Contributor

run make stress PKG=./pkg/storage TESTS=TestStoreRangeDownReplicate on your local machine
with the following diff and you'll see a failure after a bit with a log message indicating that it spent a lot of time hung on a read done to request a lease..

diff --git a/pkg/storage/client_raft_test.go b/pkg/storage/client_raft_test.go
index 6937317..2f8b5a1 100644
--- a/pkg/storage/client_raft_test.go
+++ b/pkg/storage/client_raft_test.go
@@ -1129,15 +1129,19 @@ func TestStoreRangeDownReplicate(t *testing.T) {
        desc := replica.Desc()
        mtc.replicateRange(desc.RangeID, 3, 4)

-       maxTimeout := time.After(10 * time.Second)
+       maxTimeout := time.After(1000 * time.Millisecond)
+       now := time.Now()
        succeeded := false
        i := 0
        for !succeeded {
                select {
                case <-maxTimeout:
+                       t.Logf("%s time for down replication", time.Since(now))
                        t.Fatalf("Failed to achieve proper replication within 10 seconds")
                case <-time.After(10 * time.Millisecond):
+                       start := time.Now()
                        rangeDesc := getRangeMetadata(rightKeyAddr, mtc, t)
+                       t.Logf("%s time for get meta", time.Since(start))
                        if count := len(rangeDesc.Replicas); count < 3 {
                                t.Fatalf("Removed too many replicas; expected at least 3 replicas, found %d", count)
                        } else if count == 3 {
@@ -1154,21 +1158,27 @@ func TestStoreRangeDownReplicate(t *testing.T) {
                                if i >= len(mtc.stores) {
                                        i = 0
                                }
+                               start = time.Now()
                                rep := mtc.stores[i].LookupReplica(rightKeyAddr, nil)
+                               t.Logf("%s time for lookup replica", time.Since(start))
                                if rep != nil {
+                                       start = time.Now()
                                        mtc.expireLeases()
+                                       t.Logf("%s time for expire leases", time.Since(start))
                                        // Force the read command request a new lease.
                                        getArgs := getArgs(rightKey)
+                                       start = time.Now()
                                        if _, err := client.SendWrapped(context.Background(), mtc.distSenders[i], &getArgs); err != nil {
                                                t.Fatal(err)
                                        }
+                                       t.Logf("%s time for read", time.Since(start))
                                        mtc.stores[i].ForceReplicationScanAndProcess()
                                        break
                                }
                        }
                }
        }
-
+       t.Logf("%s time for down replication", time.Since(now))
        // Expire range leases one more time, so that any remaining resolutions can
        // get a range lease.
        // TODO(bdarnell): understand why some tests need this.

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 7, 2016
Previously we were releasing the snapshot (i.e. calling
`Replica.CloseOutSnap()`) when the ChangeReplicas operation
completed. Now we release the snapshot as soon as the remote node has
applied it. This is important to allow other ranges to make progress
which might be required for the current ChangeReplicas operation to
complete.

Fixes cockroachdb#10483
Fixes cockroachdb#10306
See cockroachdb#10409
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants