storage: TestStoreRangeDownReplicate is skipped #10171

cockroach-teamcity · 2016-10-24T06:55:35Z

SHA: https://github.com/cockroachdb/cockroach/commits/4d203e83dbc49ab1bd9283c3e15caa06e5288bb8

Stress build found a failed test:

I161024 06:51:40.576601 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.579059 10966 gossip/gossip.go:236  initial resolvers: []
W161024 06:51:40.579719 10966 gossip/gossip.go:1063  no resolvers found; use --join to specify a connected node
I161024 06:51:40.580376 10966 gossip/gossip.go:271  NodeID set to 1
I161024 06:51:40.596320 10966 gossip/gossip.go:271  NodeID set to 1
I161024 06:51:40.600642 10966 storage/store.go:1125  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161024 06:51:40.601048 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:59774" > attrs:<> locality:<> 
I161024 06:51:40.611689 10939 storage/replica_proposal.go:292  [s1,r1/1:/M{in-ax}] new range lease replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [physicalTime=1970-01-01 00:00:00 +0000 UTC]
I161024 06:51:40.622131 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.646019 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.646743 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.647474 10966 gossip/gossip.go:271  NodeID set to 2
I161024 06:51:40.675251 10966 gossip/gossip.go:271  NodeID set to 2
I161024 06:51:40.677729 11034 gossip/client.go:126  node 2: started gossip client to 127.0.0.1:59774
I161024 06:51:40.678434 10966 storage/store.go:1125  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I161024 06:51:40.678771 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:46756" > attrs:<> locality:<> 
I161024 06:51:40.681992 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.687995 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.691073 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.691879 10966 gossip/gossip.go:271  NodeID set to 3
I161024 06:51:40.738680 10966 gossip/gossip.go:271  NodeID set to 3
I161024 06:51:40.741169 11097 gossip/client.go:126  node 3: started gossip client to 127.0.0.1:59774
I161024 06:51:40.749594 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:42684" > attrs:<> locality:<> 
I161024 06:51:40.753436 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.759178 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.769471 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.776008 10966 gossip/gossip.go:271  NodeID set to 4
I161024 06:51:40.831878 10966 gossip/gossip.go:271  NodeID set to 4
I161024 06:51:40.836458 10966 storage/store.go:1125  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I161024 06:51:40.837020 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:46145" > attrs:<> locality:<> 
I161024 06:51:40.838503 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.840045 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.840440 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.840861 10966 gossip/gossip.go:271  NodeID set to 5
I161024 06:51:40.842555 11205 gossip/client.go:126  node 4: started gossip client to 127.0.0.1:59774
I161024 06:51:41.027340 10966 gossip/gossip.go:271  NodeID set to 5
I161024 06:51:41.032362 11247 gossip/client.go:126  node 5: started gossip client to 127.0.0.1:59774
I161024 06:51:41.044333 11267 gossip/server.go:260  node 1: refusing gossip from node 5 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:42684})
I161024 06:51:41.052512 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:5 address:<network_field:"tcp" address_field:"127.0.0.1:48690" > attrs:<> locality:<> 
I161024 06:51:41.068319 11267 gossip/server.go:260  node 1: refusing gossip from node 5 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:42684})
I161024 06:51:41.069734 11247 gossip/client.go:131  node 5: closing client to node 1 (127.0.0.1:59774): received forward from node 1 to 3 (127.0.0.1:42684)
I161024 06:51:41.070414 11295 gossip/client.go:126  node 5: started gossip client to 127.0.0.1:42684
I161024 06:51:41.214533 10966 storage/client_test.go:416  gossip network initialized
I161024 06:51:41.249266 10966 storage/replica_raftstorage.go:446  [s1,r1/1:/M{in-ax}] generated snapshot 37e29b9f for range 1 at index 22 in 158.033µs.
I161024 06:51:41.265511 10966 storage/store.go:2969  streamed snapshot: kv pairs: 41, log entries: 12
I161024 06:51:41.269193 11380 storage/replica_raftstorage.go:577  [s2] [n2,s2,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 22 (id=37e29b9f, encoded size=16, 1 rocksdb batches, 12 log entries)
I161024 06:51:41.300007 11380 storage/replica_raftstorage.go:580  [s2] [n2,s2,r1/?:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.030s
I161024 06:51:41.326393 10966 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2 
I161024 06:51:41.430656 11375 storage/replica.go:1765  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161024 06:51:41.514999 11455 storage/raft_transport.go:423  raft transport stream to node 1 established
I161024 06:51:41.546919 10966 storage/replica_raftstorage.go:446  [s1,r1/1:/M{in-ax}] generated snapshot 16b1b56a for range 1 at index 34 in 151.451µs.
I161024 06:51:41.603553 10966 storage/store.go:2969  streamed snapshot: kv pairs: 54, log entries: 24
I161024 06:51:41.607691 11481 storage/replica_raftstorage.go:577  [s3] [n3,s3,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 34 (id=16b1b56a, encoded size=16, 1 rocksdb batches, 24 log entries)
I161024 06:51:41.631844 11481 storage/replica_raftstorage.go:580  [s3] [n3,s3,r1/?:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.024s
I161024 06:51:41.752513 10966 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3 
I161024 06:51:42.161780 11467 storage/replica.go:1765  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA {NodeID:3 StoreID:3 ReplicaID:3} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161024 06:51:42.619013 11577 storage/raft_transport.go:423  raft transport stream to node 1 established
I161024 06:51:42.920303 10966 storage/replica_command.go:2354  initiating a split of this range at key "m" [r2]
W161024 06:51:43.981295 11682 storage/stores.go:218  range not contained in one range: [/Min,"m\x00"), but have [/Min,"m")
W161024 06:51:43.987345 11655 storage/replica.go:1620  [s1,r1/1:{/Min-"m"}] unable to cancel expired Raft command [txn: 8b44ce5f], ConditionalPut [/Local/Range/"m"/RangeDescriptor,/Min), Put [/Meta2/Max,/Min), Put [/Meta2/"m",/Min), Put [/Meta1/Max,/Min), EndTransaction [/Local/Range/""/RangeDescriptor,/Min)
E161024 06:51:45.216154 10966 internal/client/txn.go:326  failure aborting transaction: does not exist; abort caused by: conflict updating range descriptors
I161024 06:51:45.232117 11769 util/stop/stopper.go:425  quiescing; tasks left:
5      storage/client_test.go:503
I161024 06:51:45.235813 11769 util/stop/stopper.go:425  quiescing; tasks left:
4      storage/client_test.go:503
I161024 06:51:45.243490 11769 util/stop/stopper.go:425  quiescing; tasks left:
3      storage/client_test.go:503
I161024 06:51:45.272817 11771 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:503
E161024 06:51:45.273832 11138 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161024 06:51:45.275219 11770 util/stop/stopper.go:425  quiescing; tasks left:
3      storage/client_test.go:503
I161024 06:51:45.299906 11769 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:503
I161024 06:51:45.369998 11770 util/stop/stopper.go:425  quiescing; tasks left:
2      storage/client_test.go:503
I161024 06:51:45.386395 11770 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:503
E161024 06:51:45.387292 11065 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
E161024 06:51:45.405877 11065 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161024 06:51:45.406107 11149 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161024 06:51:45.406331 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
E161024 06:51:45.414787 10983 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
E161024 06:51:45.417587 11327 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
W161024 06:51:45.430509 11619 storage/store.go:2834  [s3] got error from range 0, replica {1 1 0}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:0}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:0}
E161024 06:51:45.438260 11327 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161024 06:51:45.445592 11138 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
W161024 06:51:45.482757 11619 storage/store.go:2834  [s3] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:3}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.483429 11619 storage/store.go:2834  [s3] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:3}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.484085 11619 storage/store.go:2834  [s3] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:3}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.493807 11457 storage/store.go:2834  [s2] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:2 StoreID:2 ReplicaID:2}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.494189 11457 storage/store.go:2834  [s2] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:2 StoreID:2 ReplicaID:2}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.494530 11457 storage/store.go:2834  [s2] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:2 StoreID:2 ReplicaID:2}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
I161024 06:51:45.514464 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.518655 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.521735 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.524798 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.527787 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.529764 11228 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:48690->127.0.0.1:49755: use of closed network connection
I161024 06:51:45.530204 11244 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161024 06:51:45.531047 11171 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:46145->127.0.0.1:35652: use of closed network connection
I161024 06:51:45.531663 11032 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:46756->127.0.0.1:53541: use of closed network connection
I161024 06:51:45.532204 10990 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:42684->127.0.0.1:36229: use of closed network connection
I161024 06:51:45.532741 11024 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
W161024 06:51:45.533355 11577 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
I161024 06:51:45.534290 10960 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:59774->127.0.0.1:55083: use of closed network connection
I161024 06:51:45.534745 10975 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161024 06:51:45.535211 10977 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:59774: operation was canceled"; Reconnecting to {"127.0.0.1:59774" <nil>}
I161024 06:51:45.535565 10977 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.535982 11246 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:48690: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:48690" <nil>}
I161024 06:51:45.536245 11246 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.536726 11044 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161024 06:51:45.537216 11046 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:46756: operation was canceled"; Reconnecting to {"127.0.0.1:46756" <nil>}
I161024 06:51:45.538111 11046 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.538616 11186 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:46145: operation was canceled"; Reconnecting to {"127.0.0.1:46145" <nil>}
I161024 06:51:45.538897 11186 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.539149 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.539857 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.540572 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.541279 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.541889 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
    client_raft_test.go:988: storage/replica_command.go:2434: split at key "m" failed: conflict updating range descriptors

The text was updated successfully, but these errors were encountered:

tamird · 2016-10-26T20:30:31Z

Seems distinct from #9673.

tamird · 2016-11-01T15:14:21Z

Looks similar to ##10172.

vivekmenezes · 2016-11-08T16:22:45Z

I'm able to reproduce this issue on GCE with make stress PKG=./pkg/storage TESTS=TestStoreRangeDownReplicate STRESSFLAGS='-stderr -maxfails 1 -p 100' . Assigning to peter because he has been playing around in this area. perhaps I used too much stress, but it's worth investigating.

petermattis · 2016-11-08T16:25:10Z

Sort of a bad feedback loop to get assigned flaky tests because I've been fixing other flaky tests in the same area. In this case, the flakiness is an expected error that the test should be handling. I'll fix.

vivekmenezes · 2016-11-08T16:30:49Z

just retrying the command results in it hitting another error down the test. Is this just about handling errors and retrying?

petermattis · 2016-11-08T16:31:49Z

This test probably deserves to be rewritten to use TestCluster, similar to the soon to be added TestReplicateQueueRebalance.

vivekmenezes · 2016-11-08T16:35:59Z

I can take that on. Reassigning back to me.

petermattis · 2016-11-08T16:44:51Z

See #10515.

…ownReplicate The new test tests the down replication functionality using TestCluster. fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603

…ownReplicate The new test tests the down replication functionality using TestCluster. This change also makes the replica_gc_queue queue up replicas that have been removed reliably. This functionality is tested by the new test. fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603

vivekmenezes · 2016-11-21T22:30:21Z

The original test failure seen on this issue was fixed through #10728 .

tamird · 2016-11-21T23:37:49Z

Can we unskip it and call this good?

…ownReplicate The new test tests the down replication functionality using TestCluster. fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603

tamird · 2016-11-22T16:19:54Z

This was actually fixed by #10728, right?

vivekmenezes · 2016-11-22T17:06:39Z

yes

cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels Oct 24, 2016

This was referenced Oct 26, 2016

: TestStoreRangeDownReplicate failed under stress #10224

Closed

: TestStoreRangeDownReplicate failed under stress #10192

Closed

This was referenced Oct 31, 2016

storage: TestStoreRangeDownReplicate timeout due to stuck Raft command #9673

Closed

: TestStoreRangeDownReplicate failed under stress #10356

Closed

tamird assigned rjnn Nov 1, 2016

rjnn assigned vivekmenezes and unassigned rjnn Nov 3, 2016

vivekmenezes mentioned this issue Nov 3, 2016

: TestStoreRangeDownReplicate failed under stress #10428

Closed

vivekmenezes assigned petermattis and vivekmenezes and unassigned vivekmenezes and petermattis Nov 3, 2016

vivekmenezes assigned vivekmenezes and unassigned petermattis Nov 8, 2016

tamird mentioned this issue Nov 9, 2016

storage: TestStoreRangeDownReplicate is skipped #10536

Closed

vivekmenezes mentioned this issue Nov 11, 2016

storage: replace TestStoreRangeDownReplicate with TestReplicateQueueD… #10624

Merged

tamird changed the title ~~: TestStoreRangeDownReplicate failed under stress~~ storage: TestStoreRangeDownReplicate is skipped Nov 13, 2016

vivekmenezes closed this as completed in #10624 Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: TestStoreRangeDownReplicate is skipped #10171

storage: TestStoreRangeDownReplicate is skipped #10171

cockroach-teamcity commented Oct 24, 2016

tamird commented Oct 26, 2016

tamird commented Nov 1, 2016

vivekmenezes commented Nov 8, 2016

petermattis commented Nov 8, 2016

vivekmenezes commented Nov 8, 2016

petermattis commented Nov 8, 2016

vivekmenezes commented Nov 8, 2016

petermattis commented Nov 8, 2016

vivekmenezes commented Nov 21, 2016

tamird commented Nov 21, 2016

tamird commented Nov 22, 2016

vivekmenezes commented Nov 22, 2016

storage: TestStoreRangeDownReplicate is skipped #10171

storage: TestStoreRangeDownReplicate is skipped #10171

Comments

cockroach-teamcity commented Oct 24, 2016

tamird commented Oct 26, 2016

tamird commented Nov 1, 2016

vivekmenezes commented Nov 8, 2016

petermattis commented Nov 8, 2016

vivekmenezes commented Nov 8, 2016

petermattis commented Nov 8, 2016

vivekmenezes commented Nov 8, 2016

petermattis commented Nov 8, 2016

vivekmenezes commented Nov 21, 2016

tamird commented Nov 21, 2016

tamird commented Nov 22, 2016

vivekmenezes commented Nov 22, 2016