roachtest: acceptance/bank/node-restart failed #35326

cockroach-teamcity · 2019-03-02T07:09:49Z

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1211: stall detected at round 5, no forward progress for 30s
	cluster.go:1152,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1158877-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1158877-acceptance: starting.............................: signal: killed

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-03-04T07:15:00Z

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1211: stall detected at round 3, no forward progress for 30s
	cluster.go:1192,bank.go:247: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1160394-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1160394-acceptance: stopping and waiting..............................: signal: killed

cockroach-teamcity · 2019-03-11T06:15:50Z

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog

The test failed on master:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1214: stall detected at round 5, no forward progress for 30s
	cluster.go:1170,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1170795-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1170795-acceptance: starting.............................: signal: killed

cockroach-teamcity · 2019-03-21T06:17:27Z

SHA: https://github.com/cockroachdb/cockroach/commits/5b36cc6276340282cb333ff4a9cb4f1fbd6c3348

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189990&tab=buildLog

The test failed on release-2.1:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1214: stall detected at round 2, no forward progress for 30s
	cluster.go:1193,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1189990-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1189990-acceptance: starting.............................: signal: killed

cockroach-teamcity · 2019-03-26T06:17:45Z

SHA: https://github.com/cockroachdb/cockroach/commits/7f8a0969e8e9eb7e9fc0d2fe96e03849d30dd561

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199677&tab=buildLog

The test failed on release-19.1:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1214: stall detected at round 1, no forward progress for 30s
	cluster.go:1193,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1199677-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1199677-acceptance: starting.............................: signal: killed

tbg · 2019-04-01T13:08:11Z

It's curious that every time this test fails, there's a pending roachprod start invocation. For the last logs, the node being started is n4 (after having been killed moments earlier). I didn't find anything in any of the logs, except for this:

E190326 06:17:19.126750 346 sql/distsqlrun/flow_registry.go:224  [intExec=read orphaned table leases] flow id:9b544064-4968-49a6-9e94-6995eabba1ac : 1 inbound streams timed out after 10s; propagated error throughout flow
I190326 06:17:19.187538 375 sql/lease.go:1822  released orphaned table lease: {id:53 version:1 expiration:{Time:{wall:495315000 ext:63689178174 loc:<nil>}}}

cc @vivekmenezes / @jordanlewis could this error be responsible here? I'm thinking something about the case in which the node getting killed holds the table lease and for some reason doesn't manage to release it after the restart, blocking writes to the table for north of 30s and thus failing the test.

tbg · 2019-04-01T13:09:06Z

Oh hey I found the same error in the logs two up:

E190321 06:17:02.295548 387 sql/distsqlrun/flow_registry.go:230  [intExec=read orphaned table leases] flow id:2ea8fd60-fc32-4691-b82d-75f9fd793767 : 1 inbound streams timed out after 10s; propagated error throughout flow
I190321 06:17:02.365213 352 sql/lease.go:1815  released orphaned table lease: {id:53 version:1 expiration:{Time:{wall:696560000 ext:63688746141 loc:<nil>}}}
I190321 06:17:02.365218 351 sql/lease.go:1815  released orphaned table lease: {id:53 version:1 expiration:{Time:{wall:693236000 ext:63688746081 loc:<nil>}}}

vivekmenezes · 2019-04-01T14:38:07Z

@tbg this looks like a situation where the "read orphaned table leases" query (ROTL) was attempted a few times and eventually succeeded. The results of that query were leases that were orphaned and were released. But I don't see how this could have blocked writes to the table. There is no schema change happening in this test. Besides all this logic is running asynchronously and not holding up the node from restarting.

cockroach-teamcity · 2019-04-03T06:16:11Z

SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog

The test failed on master:
	bank.go:189,bank.go:501,acceptance.go:78,test.go:1228: dial tcp 35.231.201.221:26257: connect: connection refused
	cluster.go:1295,bank.go:247: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1217763-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1217763-acceptance: stopping and waiting............................................................: signal: killed
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:776,test.go:762,bank.go:189,bank.go:501,acceptance.go:78,test.go:1228: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1217763-acceptance --oneshot --ignore-empty-nodes: exit status 1 1: 5268
		2: 6342
		3: 5244
		4: dead
		Error:  4: dead

tbg · 2019-04-03T12:33:20Z

roachprod stop c:4 got stuck in the test, which then triggered the verifyAccounts goroutine to fail:

cockroach/pkg/cmd/roachtest/bank.go

Lines 177 to 190 in e830bb6

    
           err := retry.ForDuration(30*time.Second, func() error { 
        
           	// Hold the read lock on the client to prevent it being restarted by 
        
           	// chaos monkey. 
        
           	client.RLock() 
        
           	defer client.RUnlock() 
        
           	err := client.db.QueryRow("SELECT count(*), sum(balance) FROM bank.accounts").Scan(&accounts, &sum) 
        
           	if err != nil && !testutils.IsSQLRetryableError(err) { 
        
           		t.Fatal(err) 
        
           	} 
        
           	return err 
        
           }) 
        
           if err != nil { 
        
           	t.Fatal(err) 
        
           }

The node actually died when the stop was invoked. Unclear to me why it didn't return. @petermattis any idea? The roachprod.log seems to indicate (if I read it correctly) that the process died quickly: https://gist.github.com/tbg/fc3e0b20eacceb7fc65aeff1438b2ccb

petermattis · 2019-04-04T21:15:02Z

I don't see anything that would account for the stuck roachprod stop c:4 command. The roachprod.log seems to indicate that pid 5244 is dead. There is no evidence of a subsequent attempt to start the node, either in roachprod.log or test.log or in the node's logs.

The problem ranges report in the debug.zip shows a number of under-replicated ranges, but no unavailable ones:

{
  "node_id": 1,
  "problems_by_node_id": {
    "1": {
      "underreplicated_range_ids": [
        2,
        3,
        5,
        8,
        11,
        12,
        14,
        15,
        18
      ]
    },
    "2": {
      "underreplicated_range_ids": [
        1,
        6,
        16
      ]
    },
    "3": {
      "underreplicated_range_ids": [
        19,
        20,
        21
      ]
    },
    "4": {
      "error_message": "initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.142.15.239:26257: connect: connection refused\""
    }
  }
}

I believe r21 corresponds to the bank table. It is under-replicated because n4 is down. n3 is the leaseholder for that range.

Hmm, this is curious in n3's logs:

W190403 06:15:06.281045 92 gossip/gossip.go:1501  [n3] first range unavailable; trying remaining resolvers

Might be nothing. The debug.zip indicates that n3 was the leaseholder for r1 and gossiped out the first range descriptor.

Seems like the error from verifyAccounts wasn't printed anywhere. Did I just miss it?

petermattis · 2019-04-04T21:18:49Z

I'm not seeing anything wedged in the goroutines on n1, but I believe the client connections have been closed by the time the debug.zip is generated. Nothing interesting on n3 either. Hrmm.

cockroach-teamcity · 2019-04-07T05:55:48Z

SHA: https://github.com/cockroachdb/cockroach/commits/6da68d7fe2c9a29b85e2ec0c7e545a0d6bdc4c5c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1226521&tab=buildLog

The test failed on release-19.1:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1237: stall detected at round 7, no forward progress for 30s
	cluster.go:1255,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1226521-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1226521-acceptance: starting............................: signal: killed
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:775,bank.go:394,bank.go:498,acceptance.go:78,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1226521-acceptance --oneshot --ignore-empty-nodes: exit status 1 3: 6345
		1: 5653
		4: dead
		2: 5448
		Error:  4: dead

tbg · 2019-04-23T10:18:05Z

The last one is a stall during roachprod start. Higher up we're seeing roachprod stop stall (which it never should assuming the test infra works). I suspect there's maybe something infra-i going on, but not only that.

I'm going to keep this open for now -- it reproduces every 5-10 days on average it seems, so we're due for another one soon.

tbg · 2019-04-23T10:20:28Z

@andreimatei could you have an eye on this? The difficulty here, assuming it's not test infra, is observing what goes wrong during the start sequence.

Also potentially related to infra fix #37001

ajwerner · 2019-04-25T20:19:57Z

https://teamcity.cockroachdb.com/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=1261779

Here's another

cockroach-teamcity · 2019-04-30T06:13:29Z

SHA: https://github.com/cockroachdb/cockroach/commits/73765b6d168fb999466756b112fd590747a3a8c4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266059&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	bank.go:395,bank.go:499,acceptance.go:92,test.go:1253: stall detected at round 2, no forward progress for 30s
	cluster.go:1402,bank.go:249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1266059-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1266059-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:790,test.go:780,bank.go:395,bank.go:499,acceptance.go:92,test.go:1253: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1266059-acceptance --oneshot --ignore-empty-nodes: exit status 1 4: dead
		3: 5567
		2: 5408
		1: 5418
		Error:  4: dead

cockroach-teamcity · 2019-05-01T06:20:50Z

SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog

The test failed on branch=master, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:92,test.go:1251: stall detected at round 5, no forward progress for 30s
	cluster.go:1402,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1268176-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1268176-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:92,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1268176-acceptance --oneshot --ignore-empty-nodes: exit status 1 1: 6216
		3: 5323
		4: dead
		2: 5803
		Error:  4: dead

cockroach-teamcity · 2019-05-10T06:13:58Z

SHA: https://github.com/cockroachdb/cockroach/commits/84dc682eca4b11e6abaf390fc8883f32afe81fb4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1283539&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 4, no forward progress for 30s
	cluster.go:1400,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1283539-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1283539-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1283539-acceptance --oneshot --ignore-empty-nodes: exit status 1 4: dead
		1: 5291
		3: 5386
		2: 5385
		Error:  4: dead

cockroach-teamcity · 2019-05-15T06:19:23Z

SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog

The test failed on branch=master, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 4, no forward progress for 30s
	cluster.go:1400,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1290143-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1290143-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1290143-acceptance --oneshot --ignore-empty-nodes: exit status 1 2: 5397
		3: 5676
		4: dead
		1: 6231
		Error:  4: dead

cockroach-teamcity · 2019-05-21T06:20:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/9671342fead0509bec0913bae4ae1f244660788e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1298500&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 5, no forward progress for 30s
	cluster.go:1440,bank.go:256: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1298500-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1298500-acceptance: stopping and waiting..............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1298500-acceptance --oneshot --ignore-empty-nodes: exit status 1 1: 6276
		2: 5711
		3: 6282
		4: dead
		Error:  4: dead

cockroach-teamcity · 2019-05-28T19:09:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/c280de40c2bcab93c41fe82bef8353a5ecd95ac4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1311970&tab=buildLog

The test failed on branch=master, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 3, no forward progress for 30s
	cluster.go:1442,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1311970-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1311970-acceptance: starting............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1311970-acceptance --oneshot --ignore-empty-nodes: exit status 1 2: 5308
		1: 6504
		3: 5453
		4: dead
		Error:  4: dead
	test.go:1237: test timed out (10m0s)

cockroach-teamcity · 2019-07-03T06:15:40Z

SHA: https://github.com/cockroachdb/cockroach/commits/f1c9693da739fa5fc2c94d4d978fadd6710d17da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1371441&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190703-1371441/acceptance/bank/node-restart/run_1
	bank.go:398,bank.go:502,acceptance.go:68,test_runner.go:680: stall detected at round 2, no forward progress for 30s

cockroach-teamcity · 2019-07-10T06:28:24Z

SHA: https://github.com/cockroachdb/cockroach/commits/f74db5e81f8eaa190a41d708a9ccafb3eba9370a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1380786&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190710-1380786/acceptance/bank/node-restart/run_1
	bank.go:398,bank.go:502,acceptance.go:68,test_runner.go:678: stall detected at round 6, no forward progress for 30s
	cluster.go:1691,bank.go:251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1562739066-02-n4cpu4:4 returned:
		stderr:
		
		stdout:
		teamcity-1562739066-02-n4cpu4: stopping and waiting..............................: signal: killed

cockroach-teamcity · 2019-08-21T05:14:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1446993&tab=buildLog

The test failed on branch=provisional_201908202216_v19.2.0-beta.20190826, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1446993/acceptance/bank/node-restart/run_1
	bank.go:398,bank.go:502,acceptance.go:69,test_runner.go:673: stall detected at round 2, no forward progress for 30s

irfansharif · 2019-09-19T16:20:04Z

Running on bd27eb3 I ran into the following:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5e70446]

goroutine 68 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0008367e0, 0x74e0620, 0xc000956270)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x121
panic(0x6983380, 0x92382e0)
        /usr/local/Cellar/go/1.12.7/libexec/src/runtime/panic.go:522 +0x1b5
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*IntAlloc).Freeze(0x0)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/quotapool/intpool.go:72 +0x26
main.(*cluster).Save(0xc000518780, 0x74e0620, 0xc0009562a0, 0xc0002a45a0, 0x4f, 0xc0000d0960)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1067 +0xb7
main.(*testRunner).runWorker(0xc00022af00, 0x74e0620, 0xc000956270, 0xc0002062fc, 0x2, 0xc0009561b0, 0xc00027aa40, 0xc0004cc420, 0xc000956201, 0x6d1829e, ...)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:478 +0xb26
main.(*testRunner).Run.func2(0x74e0620, 0xc000956270)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:274 +0x226
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00027aa70, 0xc0008367e0, 0xc00022e0e0)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xfb
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xa8

This then prompts the dead node detection failure as seen by the roachtest.

irfansharif · 2019-09-20T18:09:10Z

Ok I have a handle on what's happening here and #38785 (TL;DR flaky networking
causes roachprod to hang). This turned out to be pretty difficult to reproduce.
I tried all variants of running locally on my laptop, locally on GCE worker,
100+ runs across multiple roachprod clusters, etc. (though through this I learned
that we have some neat tooling in house now). This coupled with #40830 for the
previous GCE failures made for a lot of lost time here.

Below I've included roachtest logs from all the failures I've been able to
generate:

16:42:15 test.go:182: test status: starting cluster
16:42:15 cluster.go:315: > /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/bin/roachprod start irfansharif-flake:1
irfansharif-flake: starting
16:42:15 bank.go:453: round 2: client counts: (101)
16:42:16 bank.go:453: round 2: client counts: (127)
16:42:17 bank.go:453: round 2: client counts: (150)
irfansharif-flake: COCKROACH_DEV_LICENSE unset: enterprise features will be unavailable
16:42:57 bank.go:309: round 2: monkey sleeping while cluster recovers...
16:42:58 bank.go:316: round 2: not waiting for recovery due to signal that we're done
16:42:58 test.go:284: test failure: 	bank.go:431,bank.go:541,acceptance.go:69,test_runner.go:673: stall detected at round 2, no forward progress for 30s
16:42:58 bank.go:185: client 1 shutting down

Note the long stall between 16:42:17 and 16:42:57. This corresponds to our
roachprod start invocation, that just seems to have hanged. When looking at
cockroach.log and roachprod.log for irfansharif-flake:1, we see that we
did in fact start the node. journalctl.txt also tells us that an SSH
connection was made right around 16:42:15.

Here's another:

15:59:43 bank.go:270: round 34: restarting nodes [2]
15:59:43 bank.go:275: round 34: restarting 2
15:59:43 test.go:182: test status: stopping cluster
15:59:43 cluster.go:315: > /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/bin/roachprod stop irfansharif-flake:2
irfansharif-flake: stopping and waiting
15:59:44 bank.go:453: round 34: client counts: (2749)
16:08:06 bank.go:185: client 1 shutting down 
16:05:56 test.go:284: test failure: 	test_runner.go:688: test timed out (10m0s)
16:08:06 test.go:284: [not the first failure] test failure: 	cluster.go:1664,bank.go:280: cluster.StartE: context canceled
16:08:06 bank.go:548: 2951 transfers (4.2/sec) in 700.0s

The out of order timestamps are funky but I think not a problem. There's
another roachprod stop stall seen here. Same idea, roachprod.log and
the corresponding cockroach.log for irfansharif-flake:2 tell us that the
node was in fact shut down. journalctl.txt also tells us that an SSH
connection was made right around 15:59:44.

Another one:

17:06:45 bank.go:321: round 31: cluster recovered
17:06:45 bank.go:270: round 32: restarting nodes [2]
17:16:21 bank.go:309: round 32: monkey sleeping while cluster recovers...
17:06:46 bank.go:453: round 32: client counts: (2641)
17:08:41 bank.go:185: client 1 shutting down
17:13:17 test.go:284: test failure: 	test_runner.go:688: test timed out (10m0s)
17:16:21 bank.go:316: round 32: not waiting for recovery due to signal that we're done
17:16:21 bank.go:548: 5484 transfers (7.2/sec) in 759.9s

Except this time roachprod stop didn't even show up in roachprod.log.
Looking at the corresponding cockroach.log the node wasn't stopped. We just
stalled completely. journalctl.txt showed no new SSH connections made
anywhere around 17:06:45.

I also saw this failure, which was unlike the ones above, but still related:

18:54:46 test.go:182: test status: starting cluster
18:54:46 cluster.go:315: > /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/bin/roachprod start irfansharif-flake:4
irfansharif-flake: starting..
18:54:48 bank.go:310: round 28: monkey sleeping while cluster recovers...
18:54:48 bank.go:305: round 28: no progress made by clients
18:54:49 bank.go:305: round 28: no progress made by clients
18:54:50 bank.go:305: round 28: no progress made by clients
18:54:51 bank.go:305: round 28: no progress made by clients
18:54:52 bank.go:305: round 28: no progress made by clients
18:54:53 bank.go:305: round 28: no progress made by clients
18:54:54 bank.go:305: round 28: no progress made by clients
18:54:55 bank.go:305: round 28: no progress made by clients
18:54:56 bank.go:305: round 28: no progress made by clients
18:54:57 bank.go:305: round 28: no progress made by clients
18:54:58 bank.go:305: round 28: no progress made by clients
18:54:59 bank.go:305: round 28: no progress made by clients
18:55:00 bank.go:305: round 28: no progress made by clients
18:55:01 bank.go:180: client 1 shutting down
18:55:01 test.go:284: test failure: 	bank.go:421,bank.go:542,acceptance.go:69,test_runner.go:673: after 15.7s: dial tcp 35.237.3.40:26257: connect: operation timed out
18:55:01 bank.go:317: round 28: not waiting for recovery due to signal that we're done
18:55:02 cluster.go:1303: running (fast) consistency checks on node 1

Here roachprod start seemed to have worked just fine, but a client timed out?

I was pretty sure at this point this was some random infra flake. I then tried
introducing network delays from my local mac (where I'm running
{roachtest,roachprod} from against a roachprod cluster). I used Network Link
Conditioner and introduced
50-100ms of uplink/downlink delay right around when the roachtest invokes
roachtest {start,stop}. Now I'm able to immediately reproduce the failures
above. That is the client operation timed out errors, the test timed out
context cancellation errors, and most importantly, all the roachprod {start,stop}
errors. My current understanding of this is that because everything roachprod
does, it does through SSH, we're particularly susceptible to network delays, packet
drops, etc. We've seen this before, or at least pointed to this being a problem
before, over at the issues referring to #37001. For now I think setting a time
out around our calls to roachprod (basically (*cluster).Stop(ctx, ...)
would better surface these kind of errors.

andreimatei · 2019-09-20T18:33:23Z

Very interesting about Network Link Conditioner helping to reproduce this. 100ms of delay really causes our ssh invocations to get in trouble?!?
Were you able to extract any kind of signature for these problems from the ssh logs? (I believe we enable both client-side and server-side ssh logging these days).

irfansharif · 2019-09-20T19:03:42Z

There's a fair bit of tweaking and waiting around for it to happen, and I found increasing the test duration to a longer 5 minutes or so more conducive to shuffle these out. Dropped packets help, higher latencies help, but yes even 100ms if timed just right can make things go wonky. I've not seen this ssh logging, I was working off an older branch. I'll check it out.

irfansharif · 2019-09-20T21:15:45Z

(cross-posting from #38785 (comment)) For some more confirmation bias that it's flakey infrastructure at play:

Also if you notice, the failures for cluster-recovery and node-restart as posted above and over at #35326 happened in quick succession. (10th Jul, 02:14 and 10th Jul '19, 02:28; 21st Aug, 01:14 and 21st Aug, 01:14)

40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif Fixes #38785. Fixes #35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at #37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in #38785 and in #35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None 41029: cli: fix the demo licensing code r=rohany a=knz Fixes #40734. Fixes #41024. Release justification: fixes a flaky test, fixes UX of main new feature Before this patch, there were multiple problems with the code: - if the license acquisition was disabled by the env var config, the error message would not be clear. - the licensing code would deadlock silently on OSS-only builds (because the license failure channel was not written in that control branch). - the error/warning messages would be interleaved on the same line as the input line (missing newline at start of message). - the test code would fail when the license server is not available. - the set up of the example database and workload would be performed asynchronously, with unclear signalling of when the user can expect to use them interactively. After this patch: - it's possible to override the license acquisition URL with COCKROACH_DEMO_LICENSE_URL, this is used in tests. - setting up the example database, partitioning and workload is done before presenting the interactive prompt. - partitioning the example database, if requested by --geo-partitioned-replicas, waits for license acquisition to complete (license acquisition remains asynchronous otherwise). - impossible configurations are reported early(earlier). For example: - OSS-only builds: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas * * ERROR: enterprise features are required for this demo, cannot run from OSS-only binary * Failed running "demo" ``` For license acquisition failures: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas error while contacting licensing server: Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) * * ERROR: license acquisition was unsuccessful. * Note: enterprise features are needed for --geo-partitioned-replicas. * Failed running "demo" ``` Additionally, this change fixes test flakiness that arises from an unavailable license server. Release note (cli change): To enable uses of `cockroach demo` with enterprise features in firewalled network environments, it is now possible to redirect the license acquisition with the environment variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for example a suitably configured HTTP proxy). Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

cockroach-teamcity assigned tbg Mar 2, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 2, 2019

tbg assigned vivekmenezes and unassigned tbg Apr 1, 2019

nvanbenschoten mentioned this issue Apr 9, 2019

workload/tpcc: split, partition, and scatter during initialization #33242

Merged

tbg assigned andreimatei and unassigned vivekmenezes Apr 23, 2019

tbg mentioned this issue Apr 25, 2019

roachtest: acceptance/bank/cluster-recovery #37145

Closed

andreimatei mentioned this issue Apr 30, 2019

roachtest: TC timing out with test downloading debug.zip after acceptance/bank/cluster-recovery timeout #37200

Closed

dt mentioned this issue Aug 21, 2019

release: v19.2.0-beta.20190826 #39781

Closed

14 tasks

nvanbenschoten mentioned this issue Sep 11, 2019

roachtest: acceptance/bank/cluster-recovery failed #38785

Closed

irfansharif mentioned this issue Sep 23, 2019

roachtest: deflake bank/{node-restart,cluster-recovery} #40997

Merged

irfansharif self-assigned this Sep 24, 2019

craig bot closed this as completed in 5296a27 Sep 24, 2019

irfansharif mentioned this issue Apr 2, 2020

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

Closed

irfansharif mentioned this issue Mar 9, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: acceptance/bank/node-restart failed #35326

roachtest: acceptance/bank/node-restart failed #35326

cockroach-teamcity commented Mar 2, 2019

cockroach-teamcity commented Mar 4, 2019

cockroach-teamcity commented Mar 11, 2019

cockroach-teamcity commented Mar 21, 2019

cockroach-teamcity commented Mar 26, 2019

tbg commented Apr 1, 2019

tbg commented Apr 1, 2019

vivekmenezes commented Apr 1, 2019 •

edited

Loading

cockroach-teamcity commented Apr 3, 2019

tbg commented Apr 3, 2019

petermattis commented Apr 4, 2019

petermattis commented Apr 4, 2019

cockroach-teamcity commented Apr 7, 2019

tbg commented Apr 23, 2019

tbg commented Apr 23, 2019

ajwerner commented Apr 25, 2019

cockroach-teamcity commented Apr 30, 2019

cockroach-teamcity commented May 1, 2019

cockroach-teamcity commented May 10, 2019

cockroach-teamcity commented May 15, 2019

cockroach-teamcity commented May 21, 2019

cockroach-teamcity commented May 28, 2019

cockroach-teamcity commented Jul 3, 2019

cockroach-teamcity commented Jul 10, 2019

cockroach-teamcity commented Aug 21, 2019

irfansharif commented Sep 19, 2019

irfansharif commented Sep 20, 2019 •

edited

Loading

andreimatei commented Sep 20, 2019

irfansharif commented Sep 20, 2019

irfansharif commented Sep 20, 2019

roachtest: acceptance/bank/node-restart failed #35326

roachtest: acceptance/bank/node-restart failed #35326

Comments

cockroach-teamcity commented Mar 2, 2019

cockroach-teamcity commented Mar 4, 2019

cockroach-teamcity commented Mar 11, 2019

cockroach-teamcity commented Mar 21, 2019

cockroach-teamcity commented Mar 26, 2019

tbg commented Apr 1, 2019

tbg commented Apr 1, 2019

vivekmenezes commented Apr 1, 2019 • edited Loading

cockroach-teamcity commented Apr 3, 2019

tbg commented Apr 3, 2019

petermattis commented Apr 4, 2019

petermattis commented Apr 4, 2019

cockroach-teamcity commented Apr 7, 2019

tbg commented Apr 23, 2019

tbg commented Apr 23, 2019

ajwerner commented Apr 25, 2019

cockroach-teamcity commented Apr 30, 2019

cockroach-teamcity commented May 1, 2019

cockroach-teamcity commented May 10, 2019

cockroach-teamcity commented May 15, 2019

cockroach-teamcity commented May 21, 2019

cockroach-teamcity commented May 28, 2019

cockroach-teamcity commented Jul 3, 2019

cockroach-teamcity commented Jul 10, 2019

cockroach-teamcity commented Aug 21, 2019

irfansharif commented Sep 19, 2019

irfansharif commented Sep 20, 2019 • edited Loading

andreimatei commented Sep 20, 2019

irfansharif commented Sep 20, 2019

irfansharif commented Sep 20, 2019

vivekmenezes commented Apr 1, 2019 •

edited

Loading

irfansharif commented Sep 20, 2019 •

edited

Loading