Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: localcluster nodes sometimes fail to get a NodeID assigned to them #10409

Closed
cockroach-teamcity opened this issue Nov 2, 2016 · 37 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered).
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed:

#40720:

--- FAIL: acceptance/TestGossipPeerings (462.293s)
Test ended in panic.
I161102 19:31:58.720205 600 acceptance/cluster/localcluster.go:290  creating docker network with name: cockroachdb_acceptance-102ad794
I161102 19:31:58.824013 600 acceptance/cluster/localcluster.go:322  Initializing Cluster AdHoc 3x1:
{"name":"AdHoc 3x1","nodes":[{"count":3,"stores":[{"count":1,"max_ranges":0}]}],"duration":5000000000}
I161102 19:31:58.829739 600 acceptance/cluster/docker.go:107  ImagePull cockroachdb/builder:20161024-143446 already exists
I161102 19:32:00.615904 600 acceptance/cluster/localcluster.go:641  creating certs (1024bit) in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928
I161102 19:32:00.671832 600 acceptance/cluster/localcluster.go:425  creating docker container with name: roach-102ad794-0
I161102 19:32:01.312168 600 acceptance/cluster/localcluster.go:536  *** started roach-102ad794-0 ***
  ui:        https://127.0.0.1:33157
  trace:     https://127.0.0.1:33157/debug/requests
  logs:      /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestGossipPeerings-102ad794/roach-102ad794-0/cockroach.INFO
  pprof:     docker exec -it b1741 /bin/bash -c 'go tool pprof /cockroach <(wget --no-check-certificate -qO- https://$(hostname):8080/debug/pprof/heap)'
  cockroach: [start --ca-cert=/certs/ca.crt --cert=/certs/node.crt --key=/certs/node.key --host=roach-102ad794-0 --alsologtostderr=INFO --verbosity=1 --store=path=/data1.0 --log-dir=/logs/roach-102ad794-0 --logtostderr=false]

  cli-env:   COCKROACH_INSECURE=false COCKROACH_CA_CERT=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/ca.crt COCKROACH_CERT=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/node.crt COCKROACH_KEY=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/node.key COCKROACH_HOST=127.0.0.1 COCKROACH_PORT=33157
I161102 19:32:01.312261 600 acceptance/cluster/localcluster.go:425  creating docker container with name: roach-102ad794-1
I161102 19:32:01.963417 600 acceptance/cluster/localcluster.go:536  *** started roach-102ad794-1 ***
  ui:        https://127.0.0.1:33159
  trace:     https://127.0.0.1:33159/debug/requests
  logs:      /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestGossipPeerings-102ad794/roach-102ad794-1/cockroach.INFO
  pprof:     docker exec -it 32608 /bin/bash -c 'go tool pprof /cockroach <(wget --no-check-certificate -qO- https://$(hostname):8080/debug/pprof/heap)'
  cockroach: [start --ca-cert=/certs/ca.crt --cert=/certs/node.crt --key=/certs/node.key --host=roach-102ad794-1 --alsologtostderr=INFO --verbosity=1 --store=path=/data2.0 --join=roach-102ad794-0:26257 --log-dir=/logs/roach-102ad794-1 --logtostderr=false]

  cli-env:   COCKROACH_INSECURE=false COCKROACH_CA_CERT=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/ca.crt COCKROACH_CERT=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/node.crt COCKROACH_KEY=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/node.key COCKROACH_HOST=127.0.0.1 COCKROACH_PORT=33159
I161102 19:32:01.963507 600 acceptance/cluster/localcluster.go:425  creating docker container with name: roach-102ad794-2
I161102 19:32:02.611323 600 acceptance/cluster/localcluster.go:536  *** started roach-102ad794-2 ***
  ui:        https://127.0.0.1:33161
  trace:     https://127.0.0.1:33161/debug/requests
  logs:      /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestGossipPeerings-102ad794/roach-102ad794-2/cockroach.INFO
  pprof:     docker exec -it 10e43 /bin/bash -c 'go tool pprof /cockroach <(wget --no-check-certificate -qO- https://$(hostname):8080/debug/pprof/heap)'
  cockroach: [start --ca-cert=/certs/ca.crt --cert=/certs/node.crt --key=/certs/node.key --host=roach-102ad794-2 --alsologtostderr=INFO --verbosity=1 --store=path=/data3.0 --join=roach-102ad794-0:26257 --log-dir=/logs/roach-102ad794-2 --logtostderr=false]

  cli-env:   COCKROACH_INSECURE=false COCKROACH_CA_CERT=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/ca.crt COCKROACH_CERT=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/node.crt COCKROACH_KEY=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster.certs.985011928/node.key COCKROACH_HOST=127.0.0.1 COCKROACH_PORT=33161
I161102 19:32:02.611341 600 acceptance/replication_test.go:55  waiting for first range to have 3 replicas
I161102 19:32:03.613792 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161102 19:32:04.616604 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161102 19:32:05.619039 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161102 19:32:06.621365 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161102 19:32:07.624011 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161102 19:32:08.629248 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161102 19:32:09.633077 600 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
panic: test timed out after 10m0s

goroutine 666 [running]:
panic(0x13d3280, 0xc4203efd80)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
testing.startAlarm.func1()
	/usr/local/go/src/testing/testing.go:918 +0x10b
created by time.goFunc
	/usr/local/go/src/time/sleep.go:154 +0x44

goroutine 1 [chan receive, 7 minutes]:
testing.(*T).Run(0xc420424180, 0x156811a, 0x12, 0x16ac130, 0xc42028dc01)
	/usr/local/go/src/testing/testing.go:647 +0x316
testing.RunTests.func1(0xc420424180)
	/usr/local/go/src/testing/testing.go:793 +0x6d
testing.tRunner(0xc420424180, 0xc420044da8)
	/usr/local/go/src/testing/testing.go:610 +0x81
testing.RunTests(0x16aea68, 0x1e8dfe0, 0x24, 0x24, 0xc42028ddf0)
	/usr/local/go/src/testing/testing.go:799 +0x2f5
testing.(*M).Run(0xc420044ef8, 0x16ac1e0)
	/usr/local/go/src/testing/testing.go:743 +0x85
github.com/cockroachdb/cockroach/pkg/acceptance.runTests(0xc420044ef8)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/util_test.go:138 +0x48
github.com/cockroachdb/cockroach/pkg/acceptance.TestMain(0xc420044ef8)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/main_test.go:38 +0x45
main.main()
	github.com/cockroachdb/cockroach/pkg/acceptance/_test/_testmain.go:122 +0xc6

goroutine 17 [syscall, 9 minutes, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

goroutine 7 [syscall, 9 minutes]:
os/signal.signal_recv(0xc42002ffa8)
	/usr/local/go/src/runtime/sigqueue.go:116 +0x157
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
	/usr/local/go/src/os/signal/signal_unix.go:28 +0x41

goroutine 6 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).flushDaemon(0x21f4b60)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1001 +0x77
created by github.com/cockroachdb/cockroach/pkg/util/log.init.1
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:579 +0x95

goroutine 371 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc420502400)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 21 [chan receive, 9 minutes]:
github.com/cockroachdb/cockroach/pkg/acceptance.runTests.func1()
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/util_test.go:129 +0xb6
created by github.com/cockroachdb/cockroach/pkg/acceptance.runTests
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/util_test.go:137 +0x3a

goroutine 600 [select, 7 minutes]:
google.golang.org/grpc/transport.(*Stream).Header(0xc42028c4b0, 0x16ae8a0, 0xc4205711e8, 0x1eb00c0)
	/go/src/google.golang.org/grpc/transport/transport.go:239 +0x299
google.golang.org/grpc.recvResponse(0x1eaa040, 0x222dc30, 0x0, 0x0, 0x0, 0x0, 0x1e9db80, 0xc420380a80, 0x0, 0x0, ...)
	/go/src/google.golang.org/grpc/call.go:62 +0xad
google.golang.org/grpc.Invoke(0x7fa82bf64208, 0xc420014558, 0x1576756, 0x21, 0x15412e0, 0xc4204d6380, 0x14e1480, 0xc420326600, 0xc4204eec00, 0x0, ...)
	/go/src/google.golang.org/grpc/call.go:202 +0x8a1
github.com/cockroachdb/cockroach/pkg/roachpb.(*externalClient).Batch(0xc420144210, 0x7fa82bf64208, 0xc420014558, 0xc4204d6380, 0x0, 0x0, 0x0, 0xc420571500, 0x72b3f8, 0xc42027c870)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:1562 +0xd2
github.com/cockroachdb/cockroach/pkg/internal/client.sender.Send(0x1e97980, 0xc420144210, 0x7fa82bf64208, 0xc420014558, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/rpc_sender.go:48 +0xbf
github.com/cockroachdb/cockroach/pkg/internal/client.(*sender).Send(0xc4203c9a60, 0x7fa82bf64208, 0xc420014558, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	<autogenerated>:4 +0xb0
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).send(0xc420284640, 0x7fa82bf64208, 0xc420014558, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:511 +0xbf
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Run.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:445 +0x6e
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0xc420552100, 0xc4201c4000, 0x0, 0x1eb0b40)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:416 +0x168
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Run(0xc420284640, 0x7fa82bf64208, 0xc420014558, 0xc4201c4000, 0x8, 0x18)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:447 +0xac
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Get(0xc420284640, 0x7fa82bf64208, 0xc420014558, 0x14b8e40, 0xc4205520e0, 0xc4205520e0, 0x18, 0x18, 0x1, 0xc4205520e0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:214 +0x9a
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).GetProto(0xc420284640, 0x7fa82bf64208, 0xc420014558, 0x14b8e40, 0xc4205520e0, 0x1ea40c0, 0xc4203265a0, 0xc4205719e0, 0x60033c)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:222 +0x69
github.com/cockroachdb/cockroach/pkg/acceptance.countRangeReplicas(0xc420284640, 0xc420219b00, 0x0, 0xc420284601)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/replication_test.go:36 +0x157
github.com/cockroachdb/cockroach/pkg/acceptance.checkRangeReplication.func1(0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/replication_test.go:71 +0x1f4
github.com/cockroachdb/cockroach/pkg/util.RetryForDuration(0xa7a358200, 0xc420571ce0, 0x2, 0x1583329)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/testing.go:129 +0xcd
github.com/cockroachdb/cockroach/pkg/util.SucceedsSoonDepth(0x1, 0x1ead480, 0xc4204243c0, 0xc420571ce0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/testing.go:115 +0x41
github.com/cockroachdb/cockroach/pkg/util.SucceedsSoon(0x1ead480, 0xc4204243c0, 0xc420571ce0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/testing.go:109 +0x48
github.com/cockroachdb/cockroach/pkg/acceptance.checkRangeReplication(0xc4204243c0, 0x1eb1be0, 0xc420336200, 0x4a817c800)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/replication_test.go:83 +0x2d3
github.com/cockroachdb/cockroach/pkg/acceptance.StartCluster(0xc4204243c0, 0xc420515750, 0x9, 0xc42026a320, 0x1, 0x1, 0x12a05f200, 0x1eb1be0, 0xc420336200)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/util_test.go:345 +0x188
github.com/cockroachdb/cockroach/pkg/acceptance.runTestOnConfigs.func1(0xc4204243c0, 0xc420571ec8, 0x16ac228)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/util_test.go:312 +0x4e
github.com/cockroachdb/cockroach/pkg/acceptance.runTestOnConfigs(0xc4204243c0, 0x16ac228)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/util_test.go:315 +0xe2
github.com/cockroachdb/cockroach/pkg/acceptance.TestGossipPeerings(0xc4204243c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/gossip_peerings_test.go:106 +0x37
testing.tRunner(0xc4204243c0, 0x16ac130)
	/usr/local/go/src/testing/testing.go:610 +0x81
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:646 +0x2ec

goroutine 9 [select, 9 minutes, locked to thread]:
runtime.gopark(0x16af320, 0x0, 0x155b925, 0x6, 0x18, 0x2)
	/usr/local/go/src/runtime/proc.go:259 +0x13a
runtime.selectgoImpl(0xc420030f30, 0x0, 0x18)
	/usr/local/go/src/runtime/select.go:423 +0x11d9
runtime.selectgo(0xc420030f30)
	/usr/local/go/src/runtime/select.go:238 +0x1c
runtime.ensureSigM.func1()
	/usr/local/go/src/runtime/signal1_unix.go:304 +0x2f3
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

goroutine 24 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1d0b0, 0x72, 0x4)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420463720, 0x72, 0xc42003e9d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420463720, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4204636c0, 0xc420469000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420144140, 0xc420469000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420146a00, 0xc420469000, 0x1000, 0x1000, 0x30, 0xc42003eb58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42014cc60)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42014cc60, 0x1, 0x0, 0x1, 0x0, 0xc42014c4e0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420146a00)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 25 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc420146a00)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 244 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1cff0, 0x72, 0xa)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42036a220, 0x72, 0xc4200409d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42036a220, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc42036a1c0, 0xc420480000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420510040, 0xc420480000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee400, 0xc420480000, 0x1000, 0x1000, 0x30, 0xc420040b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42031c660)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42031c660, 0x1, 0x0, 0x1, 0x0, 0xc42031cde0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee400)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 47 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1cf30, 0x72, 0x6)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4203d14f0, 0x72, 0xc4200419d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4203d14f0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4203d1490, 0xc420485000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420144128, 0xc420485000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502200, 0xc420485000, 0x1000, 0x1000, 0x30, 0xc420041b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420264fc0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420264fc0, 0x1, 0x0, 0x1, 0x0, 0xc42014ccc0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502200)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 48 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc420502200)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 434 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc420502100)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 255 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee600)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 245 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee400)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 208 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1c9f0, 0x72, 0xe)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42029a8b0, 0x72, 0xc4205599d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42029a8b0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc42029a850, 0xc42046a000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc42012a200, 0xc42046a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee700, 0xc42046a000, 0x1000, 0x1000, 0x30, 0xc420559b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42031d9e0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42031d9e0, 0x1, 0x0, 0x1, 0x0, 0xc420166a80, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee700)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 185 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee300)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 171 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc420336100)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 209 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee700)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 514 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc4204eed00)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 353 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c4b0, 0x72, 0x15)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4202780d0, 0x72, 0xc4205b59d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4202780d0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420278070, 0xc42027a000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420510018, 0xc42027a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502100, 0xc42027a000, 0x1000, 0x1000, 0x30, 0xc4205b5b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4200542a0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc4200542a0, 0x1, 0x0, 0x1, 0x0, 0xc4202643c0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502100)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 266 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1c870, 0x72, 0x10)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4201698e0, 0x72, 0xc4202c79d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4201698e0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420169880, 0xc4200f8000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420510148, 0xc4200f8000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502900, 0xc4200f8000, 0x1000, 0x1000, 0x30, 0xc4202c7b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420326a20)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420326a20, 0x1, 0x0, 0x1, 0x0, 0xc420427080, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502900)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 128 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1ccf0, 0x72, 0x7)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420010450, 0x72, 0xc4201389d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420010450, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4200103f0, 0xc42029f000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc42012a188, 0xc42029f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc42000b400, 0xc42029f000, 0x1000, 0x1000, 0x30, 0xc420138b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42028a840)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42028a840, 0x1, 0x0, 0x1, 0x0, 0xc42050e000, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc42000b400)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 198 [chan receive, 9 minutes]:
database/sql.(*DB).connectionOpener(0xc4203a02c0)
	/usr/local/go/src/database/sql/sql.go:730 +0x4a
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:493 +0x1e9

goroutine 170 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1ce70, 0x72, 0x8)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4203d1170, 0x72, 0xc4201339d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4203d1170, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4203d1110, 0xc420282000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420144130, 0xc420282000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420336100, 0xc420282000, 0x1000, 0x1000, 0x30, 0xc420133b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4200fa960)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc4200fa960, 0x1, 0x0, 0x1, 0x0, 0xc42050e000, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420336100)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 254 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1cab0, 0x72, 0xd)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4204628b0, 0x72, 0xc42055f9d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4204628b0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420462850, 0xc420383000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc42012a220, 0xc420383000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee600, 0xc420383000, 0x1000, 0x1000, 0x30, 0xc42055fb58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42050f8c0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42050f8c0, 0x1, 0x0, 0x1, 0x0, 0xc4205a7740, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee600)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 129 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc42000b400)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 663 [select]:
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc4202a2d80, 0xc4204eec00, 0xc4203ee900, 0xf, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:292 +0x650
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1()
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:227 +0x69
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42027c870, 0xc4203d9500)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

goroutine 299 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc420502300)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 199 [chan receive, 9 minutes]:
database/sql.(*DB).connectionOpener(0xc4201711e0)
	/usr/local/go/src/database/sql/sql.go:730 +0x4a
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:493 +0x1e9

goroutine 184 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1cdb0, 0x72, 0x9)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42036a4c0, 0x72, 0xc4201379d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42036a4c0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc42036a460, 0xc420560000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc42012a1a8, 0xc420560000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee300, 0xc420560000, 0x1000, 0x1000, 0x30, 0xc420137b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4200fb500)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc4200fb500, 0x1, 0x0, 0x1, 0x0, 0xc42050e600, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee300)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 298 [IO wait, 9 minutes]:
net.runtime_pollWait(0x7fa82bf1c930, 0x72, 0xf)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420010ae0, 0x72, 0xc4205799d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420010ae0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420010a80, 0xc420324000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc4201441b0, 0xc420324000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502300, 0xc420324000, 0x1000, 0x1000, 0x30, 0xc420579b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4202ab1a0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc4202ab1a0, 0x1, 0x0, 0x1, 0x0, 0xc4201ca9c0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502300)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 481 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c1b0, 0x72, 0x19)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420209100, 0x72, 0xc4205779d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420209100, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4202090a0, 0xc42039f000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420144218, 0xc42039f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204eed00, 0xc42039f000, 0x1000, 0x1000, 0x30, 0xc420577b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42028b740)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42028b740, 0x1, 0x0, 0x1, 0x0, 0xc4204d32c0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204eed00)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 370 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c6f0, 0x72, 0x12)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4202d8c30, 0x72, 0xc4202c39d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4202d8c30, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4202d8bd0, 0xc4202d6000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc42012a238, 0xc4202d6000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502400, 0xc4202d6000, 0x1000, 0x1000, 0x30, 0xc4202c3b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420327aa0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420327aa0, 0x1, 0x0, 0x1, 0x0, 0xc42050eea0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502400)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 362 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c630, 0x72, 0x13)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420463170, 0x72, 0xc42055e9d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420463170, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420463110, 0xc420506000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420510108, 0xc420506000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee800, 0xc420506000, 0x1000, 0x1000, 0x30, 0xc42055eb58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc42050f800)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc42050f800, 0x1, 0x0, 0x1, 0x0, 0xc42014d800, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee800)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 267 [select, 9 minutes]:
net/http.(*persistConn).writeLoop(0xc420502900)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 303 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c7b0, 0x72, 0x11)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420011f00, 0x72, 0xc4202c59d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420011f00, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420011ea0, 0xc420206000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420510138, 0xc420206000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee500, 0xc420206000, 0x1000, 0x1000, 0x30, 0xc4202c5b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420427800)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420427800, 0x1, 0x0, 0x1, 0x0, 0xc4205c66c0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee500)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 304 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee500)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 363 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee800)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 369 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c570, 0x72, 0x14)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4204d6d80, 0x72, 0xc4205759d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4204d6d80, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4204d6d20, 0xc42033e000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc4204f20d0, 0xc42033e000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204eeb00, 0xc42033e000, 0x1000, 0x1000, 0x30, 0xc420575b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420219860)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420219860, 0x1, 0x0, 0x1, 0x0, 0xc42026c720, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204eeb00)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 526 [select, 7 minutes]:
net/http.(*persistConn).writeLoop(0xc4204ee900)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 602 [select, 7 minutes]:
net/http.(*persistConn).readLoop(0xc420336300)
	/usr/local/go/src/net/http/transport.go:1541 +0x9c9
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 402 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc4204eeb00)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 525 [IO wait, 7 minutes]:
net.runtime_pollWait(0x7fa82bf1c330, 0x72, 0x18)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420279410, 0x72, 0xc4201349d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420279410, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4202793b0, 0xc4204e1000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc4204f20e8, 0xc4204e1000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc4204ee900, 0xc4204e1000, 0x1000, 0x1000, 0x30, 0xc420134b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420327440)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420327440, 0x1, 0x0, 0x1, 0x0, 0xc4203264e0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc4204ee900)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 447 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c3f0, 0x72, 0x16)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420208ae0, 0x72, 0xc4202c49d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420208ae0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420208a80, 0xc42026e000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc4204f2088, 0xc42026e000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502700, 0xc42026e000, 0x1000, 0x1000, 0x30, 0xc4202c4b58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc420569140)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc420569140, 0x1, 0x0, 0x1, 0x0, 0xc420218b40, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502700)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 448 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc420502700)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 553 [IO wait, 7 minutes]:
net.runtime_pollWait(0x7fa82bf252b0, 0x72, 0x1a)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420278760, 0x72, 0xc42037b818, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420278760, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420278700, 0xc420340000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420144150, 0xc420340000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420336300, 0xc420340000, 0x1000, 0x1000, 0x9cc868, 0xc4205298a8, 0x7d)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4203270e0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).ReadSlice(0xc4203270e0, 0x10a, 0x2, 0x0, 0x199, 0x186, 0x186)
	/usr/local/go/src/bufio/bufio.go:330 +0xb5
net/http/internal.readChunkLine(0xc4203270e0, 0x5d3659, 0xc420369d38, 0x0, 0x1, 0xc42037ba88)
	/usr/local/go/src/net/http/internal/chunked.go:110 +0x34
net/http/internal.(*chunkedReader).beginChunk(0xc4201d89c0)
	/usr/local/go/src/net/http/internal/chunked.go:47 +0x32
net/http/internal.(*chunkedReader).Read(0xc4201d89c0, 0xc420310001, 0x5ff, 0x5ff, 0x619db0, 0xc42037bb70, 0xc42037bb78)
	/usr/local/go/src/net/http/internal/chunked.go:77 +0x7b
net/http.(*body).readLocked(0xc4201407c0, 0xc420310001, 0x5ff, 0x5ff, 0x2, 0x2, 0x0)
	/usr/local/go/src/net/http/transfer.go:651 +0x61
net/http.(*body).Read(0xc4201407c0, 0xc420310001, 0x5ff, 0x5ff, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transfer.go:643 +0xf6
net/http.(*bodyEOFSignal).Read(0xc420140840, 0xc420310001, 0x5ff, 0x5ff, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transport.go:1974 +0x116
encoding/json.(*Decoder).refill(0xc420529860, 0xa, 0x9)
	/usr/local/go/src/encoding/json/stream.go:152 +0xfa
encoding/json.(*Decoder).readValue(0xc420529860, 0x0, 0x0, 0x80)
	/usr/local/go/src/encoding/json/stream.go:128 +0x198
encoding/json.(*Decoder).Decode(0xc420529860, 0x13a1f80, 0xc420148100, 0x0, 0x0)
	/usr/local/go/src/encoding/json/stream.go:57 +0x8e
github.com/docker/docker/client.(*Client).Events.func1(0xc420556060, 0xc420278540, 0x0, 0x0, 0x0, 0x0, 0xc4201d83f0, 0x7fa82bf23670, 0xc4201b7080, 0xc420556000)
	/go/src/github.com/docker/docker/client/events.go:50 +0x2a7
created by github.com/docker/docker/client.(*Client).Events
	/go/src/github.com/docker/docker/client/events.go:63 +0xe2

goroutine 679 [chan receive, 7 minutes]:
github.com/cockroachdb/cockroach/pkg/rpc.NewContext.func1()
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:138 +0x76
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42027c870, 0xc4203171a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

goroutine 681 [select, 7 minutes]:
google.golang.org/grpc/transport.(*http2Client).controller(0xc420336500)
	/go/src/google.golang.org/grpc/transport/http2_client.go:999 +0x59d
created by google.golang.org/grpc/transport.newHTTP2Client
	/go/src/google.golang.org/grpc/transport/http2_client.go:233 +0xe78

goroutine 501 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7fa82bf1c270, 0x72, 0x17)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420203330, 0x72, 0xc42003f9d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420203330, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc4202032d0, 0xc420341000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc4204f20f0, 0xc420341000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc420502b00, 0xc420341000, 0x1000, 0x1000, 0x30, 0xc42003fb58, 0x60033c)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4204d3aa0)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc4204d3aa0, 0x1, 0x0, 0x1, 0x0, 0xc420326ae0, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc420502b00)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 502 [select, 8 minutes]:
net/http.(*persistConn).writeLoop(0xc420502b00)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 604 [select, 7 minutes]:
github.com/cockroachdb/cockroach/pkg/acceptance/cluster.(*LocalCluster).monitor.func1(0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/cluster/localcluster.go:608 +0x4aa
github.com/cockroachdb/cockroach/pkg/acceptance/cluster.(*LocalCluster).monitor(0xc420336200)
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/cluster/localcluster.go:628 +0x61
created by github.com/cockroachdb/cockroach/pkg/acceptance/cluster.(*LocalCluster).Start
	/go/src/github.com/cockroachdb/cockroach/pkg/acceptance/cluster/localcluster.go:652 +0x57e

goroutine 603 [select, 7 minutes]:
net/http.(*persistConn).writeLoop(0xc420336300)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 606 [IO wait, 7 minutes]:
net.runtime_pollWait(0x7fa82bf24d70, 0x72, 0x1b)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420153bf0, 0x72, 0xc4204ce9d0, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420153bf0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420153b90, 0xc420541000, 0x1000, 0x1000, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc4204f20b0, 0xc420541000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
net/http.(*persistConn).Read(0xc42000a800, 0xc420541000, 0x1000, 0x1000, 0x8a0d50, 0xc4204ceb58, 0x5c90ad)
	/usr/local/go/src/net/http/transport.go:1261 +0x154
bufio.(*Reader).fill(0xc4204c6300)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Peek(0xc4204c6300, 0x1, 0xc4204cebbd, 0x1, 0x0, 0xc420369e60, 0x0)
	/usr/local/go/src/bufio/bufio.go:129 +0x62
net/http.(*persistConn).readLoop(0xc42000a800)
	/usr/local/go/src/net/http/transport.go:1418 +0x1a1
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1062 +0x4e9

goroutine 607 [select, 7 minutes]:
net/http.(*persistConn).writeLoop(0xc42000a800)
	/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 680 [IO wait]:
net.runtime_pollWait(0x7fa82bf25670, 0x72, 0x1c)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc420279560, 0x72, 0xc420576828, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc420279560, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc420279500, 0xc4204e2c00, 0x400, 0x400, 0x0, 0x1e9e980, 0xc420014068)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420144228, 0xc4204e2c00, 0x400, 0x400, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
crypto/tls.(*block).readFromUntil(0xc420453560, 0x7fa82bf65ec0, 0xc420144228, 0x5, 0xc420144228, 0xc420348100)
	/usr/local/go/src/crypto/tls/conn.go:476 +0x91
crypto/tls.(*Conn).readRecord(0xc42016ee00, 0x16af417, 0xc42016ef08, 0xc4203efd40)
	/usr/local/go/src/crypto/tls/conn.go:578 +0xc4
crypto/tls.(*Conn).Read(0xc42016ee00, 0xc4204d8000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:1113 +0x116
bufio.(*Reader).fill(0xc4205c7380)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Read(0xc4205c7380, 0xc4201e0578, 0x9, 0x9, 0xc420576ca8, 0x619eb0, 0xc420358d80)
	/usr/local/go/src/bufio/bufio.go:209 +0x1bc
io.ReadAtLeast(0x1e97000, 0xc4205c7380, 0xc4201e0578, 0x9, 0x9, 0x9, 0xc4201d2460, 0xc4201d2458, 0x1e9adc0)
	/usr/local/go/src/io/io.go:307 +0xa4
io.ReadFull(0x1e97000, 0xc4205c7380, 0xc4201e0578, 0x9, 0x9, 0xc420576eb0, 0xee1f10, 0xc42028d2c0)
	/usr/local/go/src/io/io.go:325 +0x58
golang.org/x/net/http2.readFrameHeader(0xc4201e0578, 0x9, 0x9, 0x1e97000, 0xc4205c7380, 0x0, 0xc400000000, 0xc420348100, 0xc)
	/go/src/golang.org/x/net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc4201e0540, 0x0, 0x0, 0x0, 0x0)
	/go/src/golang.org/x/net/http2/frame.go:464 +0xa4
google.golang.org/grpc/transport.(*framer).readFrame(0xc420453ec0, 0xc42017da10, 0xc42017da10, 0x0, 0x0)
	/go/src/google.golang.org/grpc/transport/http_util.go:505 +0x2f
google.golang.org/grpc/transport.(*http2Client).reader(0xc420336500)
	/go/src/google.golang.org/grpc/transport/http2_client.go:921 +0xb6
created by google.golang.org/grpc/transport.newHTTP2Client
	/go/src/google.golang.org/grpc/transport/http2_client.go:203 +0x8b1

goroutine 662 [select, 7 minutes]:
google.golang.org/grpc.(*addrConn).transportMonitor(0xc4201fc500)
	/go/src/google.golang.org/grpc/clientconn.go:720 +0x63e
google.golang.org/grpc.(*ClientConn).resetAddrConn.func1(0xc4201fc500)
	/go/src/google.golang.org/grpc/clientconn.go:479 +0x1de
created by google.golang.org/grpc.(*ClientConn).resetAddrConn
	/go/src/google.golang.org/grpc/clientconn.go:480 +0x3c7

Please assign, take a look and update the issue accordingly.

@jordanlewis
Copy link
Member

I noticed something interesting about this failure mode by examining the node logs produced by the failing tests and comparing them to those produced by passing tests.

In failing tests, the nodes seem to have a lot of complaints in the logs about missing ranges or insufficient nodes:

E161103 20:02:11.194962 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:11.195373 155 storage/queue.go:568  [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:11.398045 152 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:11.398176 168 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:11.598533 152 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r4/2:/Table/1{3-4}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:11.796616 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found
E161103 20:02:11.997039 155 storage/queue.go:568  [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:11.997158 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:12.030089 168 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:12.030179 168 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r4/2:/Table/1{3-4}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:12.397596 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found
E161103 20:02:12.774816 168 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:12.774951 168 storage/queue.go:579  [n2,replicate] (purgatory) on [n2,s2,r4/2:/Table/1{3-4}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:13.198830 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found
E161103 20:02:13.399122 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:13.399228 155 storage/queue.go:568  [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:13.799951 155 storage/queue.go:568  [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:13.800030 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:14.200794 152 storage/queue.go:568  [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found

and

E161103 20:02:02.059187 132 storage/queue.go:579  [n1,replicate] (purgatory) on [n1,s1,r4/1:/Table/1{3-4}]: 0 of 1 store with an attribute matching []; likely not enough nodes in cluster

and, at the end of the test,

W161103 20:02:48.979540 291 gossip/client.go:120  [n2] node 2: failed to start gossip client: rpc error: code = 14 desc = grpc: the connection is unavailable
W161103 20:02:48.979559 103 kv/dist_sender.go:1136  [n2,hb] RPC error: rpc error: code = 14 desc = grpc: the connection is unavailable
W161103 20:02:49.023125 105 kv/dist_sender.go:1136  [n2,ts-poll,range-lookup] RPC error: rpc error: code = 14 desc = grpc: the connection is unavailable
W161103 20:02:49.024286 106 kv/dist_sender.go:1136  [n2,summaries] RPC error: rpc error: code = 14 desc = grpc: the connection is unavailable

Passing tests do not contain any log messages like these. Perhaps there is an infrequently-triggered bug in cluster startup?

@tamird
Copy link
Contributor

tamird commented Nov 3, 2016

I think Peter just fixed this

On Nov 3, 2016 16:44, "Jordan Lewis" [email protected] wrote:

I noticed something interesting about this failure mode by examining the
node logs produced by the failing tests and comparing them to those
produced by passing tests.

In failing tests, the nodes seem to have a lot of complaints in the logs
about missing ranges or insufficient nodes:

E161103 20:02:11.194962 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:11.195373 155 storage/queue.go:568 [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:11.398045 152 storage/queue.go:579 n2,replicate on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:11.398176 168 storage/queue.go:579 n2,replicate on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:11.598533 152 storage/queue.go:579 n2,replicate on [n2,s2,r4/2:/Table/1{3-4}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:11.796616 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found
E161103 20:02:11.997039 155 storage/queue.go:568 [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:11.997158 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:12.030089 168 storage/queue.go:579 n2,replicate on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:12.030179 168 storage/queue.go:579 n2,replicate on [n2,s2,r4/2:/Table/1{3-4}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:12.397596 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found
E161103 20:02:12.774816 168 storage/queue.go:579 n2,replicate on [n2,s2,r5/2:/{Table/14-Max}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:12.774951 168 storage/queue.go:579 n2,replicate on [n2,s2,r4/2:/Table/1{3-4}]: 0 of 2 stores with an attribute matching []; likely not enough nodes in cluster
E161103 20:02:13.198830 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found
E161103 20:02:13.399122 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:13.399228 155 storage/queue.go:568 [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:13.799951 155 storage/queue.go:568 [n2,timeSeriesMaintenance] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:13.800030 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r1/?:/{Min-Table/11}]: [n2,s2,r1/?:/{Min-Table/11}]: could not obtain lease: range 1 was not found
E161103 20:02:14.200794 152 storage/queue.go:568 [n2,replicate] on [n2,s2,r2/?:/Table/1{1-2}]: [n2,s2,r2/?:/Table/1{1-2}]: could not obtain lease: range 2 was not found

and

E161103 20:02:02.059187 132 storage/queue.go:579 n1,replicate on [n1,s1,r4/1:/Table/1{3-4}]: 0 of 1 store with an attribute matching []; likely not enough nodes in cluster

and, at the end of the test,

W161103 20:02:48.979540 291 gossip/client.go:120 [n2] node 2: failed to start gossip client: rpc error: code = 14 desc = grpc: the connection is unavailable
W161103 20:02:48.979559 103 kv/dist_sender.go:1136 [n2,hb] RPC error: rpc error: code = 14 desc = grpc: the connection is unavailable
W161103 20:02:49.023125 105 kv/dist_sender.go:1136 [n2,ts-poll,range-lookup] RPC error: rpc error: code = 14 desc = grpc: the connection is unavailable
W161103 20:02:49.024286 106 kv/dist_sender.go:1136 [n2,summaries] RPC error: rpc error: code = 14 desc = grpc: the connection is unavailable

Passing tests do not contain any log messages like these. Perhaps there is
an infrequently-triggered bug in cluster startup?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#10409 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPJQH_TjZ-RrnGw6jctajJp_576Ytks5q6keegaJpZM4Knqcv
.

@jordanlewis
Copy link
Member

I can reproduce it on current master.

@jordanlewis
Copy link
Member

@jordanlewis
Copy link
Member

Okay, digging a little bit deeper I notice that on failing runs one of the nodes fails to get assigned a node id. Later in the logs of that node, we see the following connection errors:

I161103 20:02:48.979169 37 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.18.0.2:26257: getsockopt: connection refused"; Reconnecting to {"roach-ab05ac12-0:26257" <nil>}
W161103 20:02:48.979206 110 gossip/client.go:120  [n?] node 0: failed to start gossip client: rpc error: code = 14 desc = grpc: the connection is unavailable

Seems like some kind of issue with docker networking, perhaps?

@tamird
Copy link
Contributor

tamird commented Nov 3, 2016

How are you reproducing it? Is it specific to azure? Why is gossip not retrying? Is it failing on retry? That line is only logged on the first attempt AFAICT.

@jordanlewis
Copy link
Member

I reproduced it by running the acceptance tests again on a TeamCity agent. I haven't tried reproducing it outside of Azure, but the logs I'm looking at captured from one of the runs linked above.

It does seem like gossip gets retried, and it also seems like the node thinks its connected via gossip:

I161103 20:02:37.128178 76 gossip/gossip.go:1071  [n?] node has connected to cluster via gossip
I161103 20:02:37.128328 76 storage/stores.go:312  [n?] wrote 2 node addresses to persistent storage
I161103 20:02:37.775265 84 gossip/client.go:225  [n?] received [cluster-id first-range sentinel store:1 store:2 system-db liveness:1 liveness:2 node:1 node:2] from node 1 (1 fresh)
<snip>
W161103 20:02:48.977447 76 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161103 20:02:48.979169 37 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.18.0.2:26257: getsockopt: connection refused"; Reconnecting to {"roach-ab05ac12-0:26257" <nil>}
W161103 20:02:48.979206 110 gossip/client.go:120  [n?] node 0: failed to start gossip client: rpc error: code = 14 desc = grpc: the connection is unavailable

@jordanlewis
Copy link
Member

Here is a gist containing the full node log: https://gist.github.com/jordanlewis/919306a9c66fa882d93e40cb2e0fd489

@jordanlewis
Copy link
Member

Just confirmed this is reproducible outside of Azure. I reproduced it on the first try on my dev machine.

@irfansharif
Copy link
Contributor

running into this as well here.

@jordanlewis jordanlewis changed the title teamcity: failed tests on master: acceptance/TestGossipPeerings acceptance: localcluster nodes sometimes fail to get a NodeID assigned to them Nov 3, 2016
@jordanlewis
Copy link
Member

I reproduced the failure with --verbosity=3 set on each node, producing this log. I noticed that on line 432 there's an attempt to get a new node id by sending a 1 Inc command via the dist_sender, which fails - this seems to be the root cause of the node being unable to acquire a NodeID. The errors about being unable to dial or connect to various nodes are red herrings caused by the shutdown of the test after a 45 second timeout. dist_sender produces the following errors and doesn't seem to recover:

I161103 22:07:56.243154 1 kv/leaseholder_cache.go:61  [n?] lookup lease holder for range 1: not found
I161103 22:07:56.243187 1 kv/dist_sender.go:1081  [n?] sending RPC for batch: 1 Inc
I161103 22:07:56.243216 1 kv/transport.go:168  [n?] sending request to roach-e9340988-0:26257: Increment [/System/"node-idgen",/Min)
I161103 22:07:56.743831 1 kv/dist_sender.go:1096  [n?] timeout, trying next peer
I161103 22:07:56.743897 1 kv/transport.go:168  [n?] sending request to roach-e9340988-1:26257: Increment [/System/"node-idgen",/Min)
I161103 22:07:56.745562 1 kv/dist_sender.go:1106  [n?] RPC reply: (err: range 1: replica {0 0 0} not lease holder; node_id:1 store_id:1 replica_id:1  is)
I161103 22:07:56.745616 1 kv/leaseholder_cache.go:61  [n?] lookup lease holder for range 1: not found
I161103 22:07:56.745924 1 kv/dist_sender.go:1229  [n?] range 1: caching new lease holder {NodeID:1 StoreID:1 ReplicaID:1}
I161103 22:07:56.745958 1 kv/leaseholder_cache.go:81  [n?] updating lease holder for range 1: {1 1 1}

Have there been any changes to dist_sender or gossip recently that could have produced or exacerbated this behavior?

@bdarnell
Copy link
Contributor

bdarnell commented Nov 4, 2016

The most recent change that might apply is a477ff1. It looks like we try the first node, get a timeout, then try the second one and get redirected back to the first one. But there's never any indication that we either make a second attempt or return an error to the caller. I think something may be getting stuck because the first request is still pending when it's told to retry.

@petermattis
Copy link
Collaborator

Were you able to make sense of those numbers?

I haven't yet tracked down why the raft log size is that large. Or are you asking why the truncation was occurring? We truncate whenever the raft log size is larger than the range size.

Why again are we denying the snapshot? We don't have one active, do we?

Yes, there was an active snapshot for range 2 that was denying the snapshot for range 1. The range 2 snapshot was not completing because range 1 was below quorum: it had been up-replicated to 2 replicas but the 2nd replica is behind hence no operations on range 1 can proceed. Hence my suggestion that we never delay snapshots for range 1. Another thought is that we shouldn't up-replicate from 1 to 2 replicas until there is a 3rd node operational in the cluster. That wouldn't fix this problem, but seems like good practice.

I still need to track down why the raft log size is so large that we're performing a truncation. And I'm not clear why we're allowing the truncation there in any case. We just added a second node, but the RaftStatus.Commit index is passed that new node's index even though both nodes are needed for quorum. Perhaps there is another bit of information we should be looking at in computeTruncatableIndexes.

@bdarnell Not sure if you've been following along, but would appreciate your thoughts here.

@tamird
Copy link
Contributor

tamird commented Nov 5, 2016

Isn't the truncation happening between the preemptive snapshot and the conf
change? Why is that unexpected? The raft group didn't yet know about the
new replica, so it doesn't make any attempt to preserve the petition of the
raft log that the new replica needs.

On Nov 5, 2016 12:57, "Peter Mattis" [email protected] wrote:

Were you able to make sense of those numbers?

I haven't yet tracked down why the raft log size is that large. Or are you
asking why the truncation was occurring? We truncate whenever the raft log
size is larger than the range size.

Why again are we denying the snapshot? We don't have one active, do we?

Yes, there was an active snapshot for range 2 that was denying the
snapshot for range 1. The range 2 snapshot was not completing because range
1 was below quorum: it had been up-replicated to 2 replicas but the 2nd
replica is behind hence no operations on range 1 can proceed. Hence my
suggestion that we never delay snapshots for range 1. Another thought is
that we shouldn't up-replicate from 1 to 2 replicas until there is a 3rd
node operational in the cluster. That wouldn't fix this problem, but seems
like good practice.

I still need to track down why the raft log size is so large that we're
performing a truncation. And I'm not clear why we're allowing the
truncation there in any case. We just added a second node, but the
RaftStatus.Commit index is passed that new node's index even though both
nodes are needed for quorum. Perhaps there is another bit of information we
should be looking at in computeTruncatableIndexes.

@bdarnell https://github.com/bdarnell Not sure if you've been following
along, but would appreciate your thoughts here.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10409 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPOV6OiOz2cv4mxL7z-KfNcB3r-g3ks5q7LVzgaJpZM4Knqcv
.

@petermattis
Copy link
Collaborator

Isn't the truncation happening between the preemptive snapshot and the conf
change?

No, the truncation is happening after the conf change. Before the conf change completes we prevent any truncation from occurring using the Replica.setPendingSnapshotIndex mechanism.

@petermattis
Copy link
Collaborator

No, the truncation is happening after the conf change.

Oops, I was mistaken. The truncation is happening before the conf change. The current computeTruncatableIndex logic always truncates to the raft commit index when the raft log size is greater than the range size, ignoring the pending snapshot index. It is arguable whether this logic should be changed as we're always supposed to be able to bring a replica up to date via a snapshot.

The raft log size is exceeding the range size because it is filled with DeleteRange operations for time series culling which are deleting nothing because the cluster is brand new. The relatively large size of the raft log is due to a handful of large entries for time series merge operations.

@tamird
Copy link
Contributor

tamird commented Nov 5, 2016

On Sat, Nov 5, 2016 at 2:20 PM, Peter Mattis [email protected]
wrote:

It is arguable whether this logic should be changed as we're always
supposed to be able to bring a replica up to date via a snapshot.

I would argue that it should be changed; the deadlock here is obvious
now, and this would solve it.

It's either that, or we always need to allow snapshot generation for range
1, or we generally need to allow more than one snapshot to be generated at
a time (#7307).

@petermattis
Copy link
Collaborator

I would argue that it should be changed; the deadlock here is obvious now, and this would solve it.

I'm not sure it solves the problem, though. I think there would still be a window after committing the conf change where the raft log could be truncated before the new replica is brought up to date.

It's either that, or we always need to allow snapshot generation for range 1, or we generally need to allow more than one snapshot to be generated at a time (#7307).

I'm not sure always allowing snapshot generation for range 1 solves the problem, either. (Though it certainly solves it for the current crop of flaky tests). I think we would need to allow snapshot generation for any range containing meta1 or meta2 keys. That's easy enough to detect from the range descriptor. Allowing more than one snapshot to be generated at a time doesn't solve the problem as whatever you set N to you can imagine scenarios where that N is consumed and we need to snapshot a meta range to make progress.

@tamird
Copy link
Contributor

tamird commented Nov 5, 2016

On Sat, Nov 5, 2016 at 2:39 PM, Peter Mattis [email protected]
wrote:

I'm not sure it solves the problem, though. I think there would still be a
window after committing the conf change where the raft log could be
truncated before the new replica is brought up to date.

How? getBehindIndex takes into account each follower's index; after the
conf change this ought to include the new replica and prevent truncation
behind its committed index.

I'm not sure always allowing snapshot generation for range 1 solves the
problem, either. (Though it certainly solves it for the current crop of
flaky tests). I think we would need to allow snapshot generation for any
range containing meta1 or meta2 keys. That's easy enough to detect from the
range descriptor. Allowing more than one snapshot to be generated at a time
doesn't solve the problem as whatever you set N to you can imagine
scenarios where that N is consumed and we need to snapshot a meta range to
make progress.

Agreed.

@petermattis
Copy link
Collaborator

How? getBehindIndex takes into account each follower's index; after the conf change this ought to include the new replica and prevent truncation behind its committed index.

We don't call getBehindIndex if the raft log size is greater than the target size. It was assumed that the commit index would be sufficient, but that isn't true if we've just up-replicated from 1 to 2. I've just verified that after such an up-replication, the commit index can be ahead of the quorum index (the largest index that a quorum of the nodes are at) until the new replica is brought up to date.

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 5, 2016
Preserve the raft log, no matter how large it grows, during
up-replication.

Fixes cockroachdb#10409
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 6, 2016
Compute the index at which a quorum of nodes have committed instead of
using raft.Status.Commit. The latter can be in advance of the computed
quorum commit index just after a replica has been added to the
group. And by computing the value ourselves we can include the pending
snapshot index in the computation.

Fixes cockroachdb#10409
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 6, 2016
Compute the index at which a quorum of nodes have committed instead of
using raft.Status.Commit. The latter can be in advance of the computed
quorum commit index just after a replica has been added to the
group. And by computing the value ourselves we can include the pending
snapshot index in the computation.

Fixes cockroachdb#10409
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 6, 2016
Compute the index at which a quorum of nodes have committed instead of
using raft.Status.Commit. The latter can be in advance of the computed
quorum commit index just after a replica has been added to the
group. And by computing the value ourselves we can include the pending
snapshot index in the computation.

Fixes cockroachdb#10409
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 6, 2016
Compute the index at which a quorum of nodes have committed instead of
using raft.Status.Commit. The latter can be in advance of the computed
quorum commit index just after a replica has been added to the
group. And by computing the value ourselves we can include the pending
snapshot index in the computation.

Fixes cockroachdb#10409
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 7, 2016
Previously we were releasing the snapshot (i.e. calling
`Replica.CloseOutSnap()`) when the ChangeReplicas operation
completed. Now we release the snapshot as soon as the remote node has
applied it. This is important to allow other ranges to make progress
which might be required for the current ChangeReplicas operation to
complete.

Fixes cockroachdb#10483
See cockroachdb#10409
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 7, 2016
Previously we were releasing the snapshot (i.e. calling
`Replica.CloseOutSnap()`) when the ChangeReplicas operation
completed. Now we release the snapshot as soon as the remote node has
applied it. This is important to allow other ranges to make progress
which might be required for the current ChangeReplicas operation to
complete.

Fixes cockroachdb#10483
Fixes cockroachdb#10306
See cockroachdb#10409
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

No branches or pull requests

9 participants