Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestStoreRangeDownReplicate is skipped #10171

Closed
cockroach-teamcity opened this issue Oct 24, 2016 · 12 comments · Fixed by #10624
Closed

storage: TestStoreRangeDownReplicate is skipped #10171

cockroach-teamcity opened this issue Oct 24, 2016 · 12 comments · Fixed by #10624
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/4d203e83dbc49ab1bd9283c3e15caa06e5288bb8

Stress build found a failed test:

I161024 06:51:40.576601 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.579059 10966 gossip/gossip.go:236  initial resolvers: []
W161024 06:51:40.579719 10966 gossip/gossip.go:1063  no resolvers found; use --join to specify a connected node
I161024 06:51:40.580376 10966 gossip/gossip.go:271  NodeID set to 1
I161024 06:51:40.596320 10966 gossip/gossip.go:271  NodeID set to 1
I161024 06:51:40.600642 10966 storage/store.go:1125  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161024 06:51:40.601048 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:59774" > attrs:<> locality:<> 
I161024 06:51:40.611689 10939 storage/replica_proposal.go:292  [s1,r1/1:/M{in-ax}] new range lease replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [physicalTime=1970-01-01 00:00:00 +0000 UTC]
I161024 06:51:40.622131 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.646019 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.646743 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.647474 10966 gossip/gossip.go:271  NodeID set to 2
I161024 06:51:40.675251 10966 gossip/gossip.go:271  NodeID set to 2
I161024 06:51:40.677729 11034 gossip/client.go:126  node 2: started gossip client to 127.0.0.1:59774
I161024 06:51:40.678434 10966 storage/store.go:1125  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I161024 06:51:40.678771 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:46756" > attrs:<> locality:<> 
I161024 06:51:40.681992 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.687995 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.691073 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.691879 10966 gossip/gossip.go:271  NodeID set to 3
I161024 06:51:40.738680 10966 gossip/gossip.go:271  NodeID set to 3
I161024 06:51:40.741169 11097 gossip/client.go:126  node 3: started gossip client to 127.0.0.1:59774
I161024 06:51:40.749594 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:42684" > attrs:<> locality:<> 
I161024 06:51:40.753436 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.759178 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.769471 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.776008 10966 gossip/gossip.go:271  NodeID set to 4
I161024 06:51:40.831878 10966 gossip/gossip.go:271  NodeID set to 4
I161024 06:51:40.836458 10966 storage/store.go:1125  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I161024 06:51:40.837020 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:46145" > attrs:<> locality:<> 
I161024 06:51:40.838503 10966 storage/engine/rocksdb.go:349  opening in memory rocksdb instance
I161024 06:51:40.840045 10966 gossip/gossip.go:236  initial resolvers: [127.0.0.1:59774]
W161024 06:51:40.840440 10966 gossip/gossip.go:1065  no incoming or outgoing connections
I161024 06:51:40.840861 10966 gossip/gossip.go:271  NodeID set to 5
I161024 06:51:40.842555 11205 gossip/client.go:126  node 4: started gossip client to 127.0.0.1:59774
I161024 06:51:41.027340 10966 gossip/gossip.go:271  NodeID set to 5
I161024 06:51:41.032362 11247 gossip/client.go:126  node 5: started gossip client to 127.0.0.1:59774
I161024 06:51:41.044333 11267 gossip/server.go:260  node 1: refusing gossip from node 5 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:42684})
I161024 06:51:41.052512 10966 gossip/gossip.go:286  NodeDescriptor set to node_id:5 address:<network_field:"tcp" address_field:"127.0.0.1:48690" > attrs:<> locality:<> 
I161024 06:51:41.068319 11267 gossip/server.go:260  node 1: refusing gossip from node 5 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:42684})
I161024 06:51:41.069734 11247 gossip/client.go:131  node 5: closing client to node 1 (127.0.0.1:59774): received forward from node 1 to 3 (127.0.0.1:42684)
I161024 06:51:41.070414 11295 gossip/client.go:126  node 5: started gossip client to 127.0.0.1:42684
I161024 06:51:41.214533 10966 storage/client_test.go:416  gossip network initialized
I161024 06:51:41.249266 10966 storage/replica_raftstorage.go:446  [s1,r1/1:/M{in-ax}] generated snapshot 37e29b9f for range 1 at index 22 in 158.033µs.
I161024 06:51:41.265511 10966 storage/store.go:2969  streamed snapshot: kv pairs: 41, log entries: 12
I161024 06:51:41.269193 11380 storage/replica_raftstorage.go:577  [s2] [n2,s2,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 22 (id=37e29b9f, encoded size=16, 1 rocksdb batches, 12 log entries)
I161024 06:51:41.300007 11380 storage/replica_raftstorage.go:580  [s2] [n2,s2,r1/?:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.030s
I161024 06:51:41.326393 10966 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2 
I161024 06:51:41.430656 11375 storage/replica.go:1765  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161024 06:51:41.514999 11455 storage/raft_transport.go:423  raft transport stream to node 1 established
I161024 06:51:41.546919 10966 storage/replica_raftstorage.go:446  [s1,r1/1:/M{in-ax}] generated snapshot 16b1b56a for range 1 at index 34 in 151.451µs.
I161024 06:51:41.603553 10966 storage/store.go:2969  streamed snapshot: kv pairs: 54, log entries: 24
I161024 06:51:41.607691 11481 storage/replica_raftstorage.go:577  [s3] [n3,s3,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 34 (id=16b1b56a, encoded size=16, 1 rocksdb batches, 24 log entries)
I161024 06:51:41.631844 11481 storage/replica_raftstorage.go:580  [s3] [n3,s3,r1/?:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.024s
I161024 06:51:41.752513 10966 storage/replica_command.go:3236  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3 
I161024 06:51:42.161780 11467 storage/replica.go:1765  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA {NodeID:3 StoreID:3 ReplicaID:3} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161024 06:51:42.619013 11577 storage/raft_transport.go:423  raft transport stream to node 1 established
I161024 06:51:42.920303 10966 storage/replica_command.go:2354  initiating a split of this range at key "m" [r2]
W161024 06:51:43.981295 11682 storage/stores.go:218  range not contained in one range: [/Min,"m\x00"), but have [/Min,"m")
W161024 06:51:43.987345 11655 storage/replica.go:1620  [s1,r1/1:{/Min-"m"}] unable to cancel expired Raft command [txn: 8b44ce5f], ConditionalPut [/Local/Range/"m"/RangeDescriptor,/Min), Put [/Meta2/Max,/Min), Put [/Meta2/"m",/Min), Put [/Meta1/Max,/Min), EndTransaction [/Local/Range/""/RangeDescriptor,/Min)
E161024 06:51:45.216154 10966 internal/client/txn.go:326  failure aborting transaction: does not exist; abort caused by: conflict updating range descriptors
I161024 06:51:45.232117 11769 util/stop/stopper.go:425  quiescing; tasks left:
5      storage/client_test.go:503
I161024 06:51:45.235813 11769 util/stop/stopper.go:425  quiescing; tasks left:
4      storage/client_test.go:503
I161024 06:51:45.243490 11769 util/stop/stopper.go:425  quiescing; tasks left:
3      storage/client_test.go:503
I161024 06:51:45.272817 11771 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:503
E161024 06:51:45.273832 11138 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161024 06:51:45.275219 11770 util/stop/stopper.go:425  quiescing; tasks left:
3      storage/client_test.go:503
I161024 06:51:45.299906 11769 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:503
I161024 06:51:45.369998 11770 util/stop/stopper.go:425  quiescing; tasks left:
2      storage/client_test.go:503
I161024 06:51:45.386395 11770 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:503
E161024 06:51:45.387292 11065 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
E161024 06:51:45.405877 11065 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161024 06:51:45.406107 11149 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161024 06:51:45.406331 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
E161024 06:51:45.414787 10983 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
E161024 06:51:45.417587 11327 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: failed to send RPC: store is stopped
W161024 06:51:45.430509 11619 storage/store.go:2834  [s3] got error from range 0, replica {1 1 0}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:0}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:0}
E161024 06:51:45.438260 11327 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161024 06:51:45.445592 11138 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
W161024 06:51:45.482757 11619 storage/store.go:2834  [s3] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:3}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.483429 11619 storage/store.go:2834  [s3] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:3}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.484085 11619 storage/store.go:2834  [s3] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:3}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.493807 11457 storage/store.go:2834  [s2] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:2 StoreID:2 ReplicaID:2}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.494189 11457 storage/store.go:2834  [s2] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:2 StoreID:2 ReplicaID:2}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
W161024 06:51:45.494530 11457 storage/store.go:2834  [s2] got error from range 1, replica {1 1 1}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:2 StoreID:2 ReplicaID:2}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:1}
I161024 06:51:45.514464 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.518655 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.521735 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.524798 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.527787 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.529764 11228 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:48690->127.0.0.1:49755: use of closed network connection
I161024 06:51:45.530204 11244 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161024 06:51:45.531047 11171 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:46145->127.0.0.1:35652: use of closed network connection
I161024 06:51:45.531663 11032 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:46756->127.0.0.1:53541: use of closed network connection
I161024 06:51:45.532204 10990 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:42684->127.0.0.1:36229: use of closed network connection
I161024 06:51:45.532741 11024 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
W161024 06:51:45.533355 11577 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
I161024 06:51:45.534290 10960 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:59774->127.0.0.1:55083: use of closed network connection
I161024 06:51:45.534745 10975 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161024 06:51:45.535211 10977 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:59774: operation was canceled"; Reconnecting to {"127.0.0.1:59774" <nil>}
I161024 06:51:45.535565 10977 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.535982 11246 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:48690: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:48690" <nil>}
I161024 06:51:45.536245 11246 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.536726 11044 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161024 06:51:45.537216 11046 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:46756: operation was canceled"; Reconnecting to {"127.0.0.1:46756" <nil>}
I161024 06:51:45.538111 11046 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.538616 11186 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:46145: operation was canceled"; Reconnecting to {"127.0.0.1:46145" <nil>}
I161024 06:51:45.538897 11186 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161024 06:51:45.539149 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.539857 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.540572 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.541279 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161024 06:51:45.541889 11768 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
    client_raft_test.go:988: storage/replica_command.go:2434: split at key "m" failed: conflict updating range descriptors
@cockroach-teamcity cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels Oct 24, 2016
@tamird
Copy link
Contributor

tamird commented Oct 26, 2016

Seems distinct from #9673.

@tamird
Copy link
Contributor

tamird commented Nov 1, 2016

Looks similar to ##10172.

@vivekmenezes
Copy link
Contributor

I'm able to reproduce this issue on GCE with make stress PKG=./pkg/storage TESTS=TestStoreRangeDownReplicate STRESSFLAGS='-stderr -maxfails 1 -p 100' . Assigning to peter because he has been playing around in this area. perhaps I used too much stress, but it's worth investigating.

@petermattis
Copy link
Collaborator

Sort of a bad feedback loop to get assigned flaky tests because I've been fixing other flaky tests in the same area. In this case, the flakiness is an expected error that the test should be handling. I'll fix.

@vivekmenezes
Copy link
Contributor

just retrying the command results in it hitting another error down the test. Is this just about handling errors and retrying?

@petermattis
Copy link
Collaborator

This test probably deserves to be rewritten to use TestCluster, similar to the soon to be added TestReplicateQueueRebalance.

@vivekmenezes
Copy link
Contributor

I can take that on. Reassigning back to me.

@petermattis
Copy link
Collaborator

See #10515.

vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Nov 11, 2016
…ownReplicate

The new test tests the down replication functionality using TestCluster.

fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Nov 11, 2016
…ownReplicate

The new test tests the down replication functionality using TestCluster.

fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603
@tamird tamird changed the title : TestStoreRangeDownReplicate failed under stress storage: TestStoreRangeDownReplicate is skipped Nov 13, 2016
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Nov 18, 2016
…ownReplicate

The new test tests the down replication functionality using TestCluster.

This change also makes the replica_gc_queue queue up replicas that have
been removed reliably. This functionality is tested by the new test.

fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603
@vivekmenezes
Copy link
Contributor

The original test failure seen on this issue was fixed through #10728 .

@tamird
Copy link
Contributor

tamird commented Nov 21, 2016

Can we unskip it and call this good?

vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Nov 22, 2016
…ownReplicate

The new test tests the down replication functionality using TestCluster.

fixes cockroachdb#10536 fixes cockroachdb#10171 fixes cockroachdb#9603
@tamird
Copy link
Contributor

tamird commented Nov 22, 2016

This was actually fixed by #10728, right?

@vivekmenezes
Copy link
Contributor

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants