Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true failed #44845

Closed
cockroach-teamcity opened this issue Feb 7, 2020 · 26 comments
Closed

roachtest: clearrange/checks=true failed #44845

cockroach-teamcity opened this issue Feb 7, 2020 · 26 comments
Assignees
Labels
A-testing Testing tools and infrastructure C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).clearrange/checks=true failed on release-19.1@407017cad14dfa63f19578055082dc10f3283cc4:

		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_080356.531_n1_cockroach_workload_fixtures_import_bank:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.runClearRange
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:47
		    main.registerClearRange.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:33
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1734688-1581059457-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		    stderr:
		    I200207 08:03:57.341594 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		    I200207 09:35:34.295700 66 ccl/workloadccl/fixture.go:516  imported bank (1h31m37s, 0 rows, 0 index entries, 0 B)
		    Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1734688-1581059457-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 9: dead
		1: 3489
		2: 3315
		3: 3379
		5: 3775
		4: 3403
		7: 3375
		10: 3422
		8: 3377
		6: 3462
		Error:  9: dead

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-19.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 7, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Feb 7, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@ffbadbb6e8ac7d7376611e9487f505428a24d90d:

		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_080614.005_n1_cockroach_workload_fixtures_import_bank:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.runClearRange
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:47
		    main.registerClearRange.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:33
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1766912-1582701158-25-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		    stderr:
		    I200226 08:06:14.774554 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		    I200226 09:49:53.190754 15 ccl/workloadccl/fixture.go:516  imported bank (1h43m38s, 0 rows, 0 index entries, 0 B)
		    Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1766912-1582701158-25-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: dead
		8: 4729
		7: 4105
		1: 4328
		2: 4223
		10: 4218
		3: 4019
		4: 4654
		6: 4600
		9: 4238
		Error:  5: dead

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@1fcf7104d19c5c7634cfb52c4302bc9e70c4b9ea:

		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_080414.271_n1_cockroach_workload_fixtures_import_bank:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.runClearRange
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:47
		    main.registerClearRange.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:33
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1770229-1582787440-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		    stderr:
		    I200227 08:04:15.086220 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		    I200227 09:39:06.576459 25 ccl/workloadccl/fixture.go:516  imported bank (1h34m51s, 0 rows, 0 index entries, 0 B)
		    Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1770229-1582787440-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 7: dead
		2: 3833
		8: 3780
		5: 3734
		3: 3807
		6: 3789
		1: 3848
		9: 3746
		4: 4186
		10: 3755
		Error:  7: dead

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@ca235a18adac0241b4e3baf144c7ff7689d952c9:

		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1857128-1586240245-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		    stderr:
		    I200407 07:09:31.651336 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		    I200407 08:37:55.047415 15 ccl/workloadccl/fixture.go:516  imported bank (1h28m23s, 0 rows, 0 index entries, 0 B)
		    Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 1. Command with error:
		        ```
		        ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1857128-1586240245-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead
		8: 3751
		1: 3813
		3: 3736
		7: 3699
		6: 3730
		5: 4180
		10: 3748
		9: 3726
		4: 3730
		Error: UNCLASSIFIED_PROBLEM:
		  - 2: dead
		    main.glob..func13
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		    main.wrap.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		    main.main
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:203
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /clearrange/checks=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@c406bb10543ca97010c64cc230a3c45690a7eb6c:

		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1862644-1586413264-27-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		    stderr:
		    I200409 07:18:35.106734 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		    I200409 08:47:01.073076 67 ccl/workloadccl/fixture.go:516  imported bank (1h28m26s, 0 rows, 0 index entries, 0 B)
		    Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 1. Command with error:
		        ```
		        ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

	cluster.go:1420,context.go:135,cluster.go:1409,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1862644-1586413264-27-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 8: dead
		6: 4127
		2: 4562
		5: 3685
		1: 4415
		10: 4034
		7: 4589
		9: 4213
		3: 4035
		4: 4509
		Error: UNCLASSIFIED_PROBLEM:
		  - 8: dead
		    main.glob..func13
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		    main.wrap.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		    main.main
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:203
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /clearrange/checks=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@d556976a57c52e188157469ec9a64d8f388a79e9:

		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_071514.728_n1_cockroach_workload_fixtures_import_bank
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1916036-1588486750-28-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200503 07:15:15.500419 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200503 08:55:12.857591 15 ccl/workloadccl/fixture.go:516  imported bank (1h39m57s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1916036-1588486750-28-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 7: dead
		3: 4222
		5: 3818
		1: 4541
		4: 4362
		2: 4366
		8: 4513
		6: 4666
		9: 4241
		10: 4720
		Error: UNCLASSIFIED_PROBLEM: 7: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) 7: dead
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Error types: (1) errors.Unclassified (2) *errors.fundamental

More

Artifacts: /clearrange/checks=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@petermattis
Copy link
Collaborator

In the most recent failure, node 7 died with:

F200503 08:54:46.943158 184 storage/replica_raft.go:927  [n7,s7,r21011/3:/Table/53/1/5999{4141-7413}] during sideloading: during sideloading: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/auxiliary/sideloading/r2XXXX/r21011/i23.t7: No space left on device

Looks like this happened during the import phase of the test, which is surprising. The last compaction stats output to the logs show:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       7.2      7.2       0.1   0.0      0.0    212.6        34       128    0.270       0      0
  L2      0/0    0.00 KB   0.0      3.9     2.5      1.4       3.9      2.5       0.0   1.5    212.2    210.3        19        38    0.498   2147K    63K
  L3      4/0   12.42 KB   0.0      8.9     5.4      3.4       8.8      5.4       1.5   1.6    159.9    159.0        57       119    0.477   4319K    36K
  L4    168/0   633.63 MB   0.5      6.2     3.4      2.8       6.1      3.4       4.4   1.8    162.3    160.7        39       265    0.147   1928K    17K
  L5   1096/1   20.66 GB   1.3     12.1     1.2     11.0       8.5     -2.4     106.4   7.4     74.0     52.1       167      3811    0.044   2516K   390K
  L6   8236/1   167.49 GB   0.0    248.7    79.6    169.2     239.9     70.8      96.7   3.0    157.1    151.5      1622      5852    0.277     26M  1473K
 Sum   9504/2   188.77 GB   0.0    279.8    92.1    187.7     274.4     86.7     209.1   1.4    147.8    145.0      1938     10213    0.190     36M  1981K

That seems reasonable, and not terribly different from another node:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0      18.2     18.2       0.3   0.1      0.0    266.0        70       286    0.244       0      0
  L2     15/0   55.94 MB   0.9     24.5    17.1      7.3      24.5     17.1       0.3   1.4    145.0    145.0       173       239    0.723   6010K    65K
  L3     23/0   78.31 MB   0.5     13.2    10.8      2.4      13.2     10.8       7.7   1.2    130.8    130.8       104       505    0.205   3287K    16K
  L4    151/1   890.70 MB   1.0     20.5    10.8      9.7      20.4     10.7       7.9   1.9    118.9    118.3       177       875    0.202   6192K    29K
  L5   2087/5   13.52 GB   1.0     11.4     4.9      6.6      10.8      4.3      14.7   2.2     94.5     89.6       124       555    0.223   3106K    85K
  L6   6451/0   157.90 GB   0.0     80.5     0.6     80.0      72.9     -7.1     165.0 125.0     65.8     59.6      1253      5887    0.213   8293K   837K
 Sum   8727/6   172.42 GB   0.0    150.2    44.2    105.9     159.9     54.0     195.9   0.9     81.0     86.2      1899      8347    0.228     26M  1034K

Not sure what happened here. Perhaps a lot of disk space is being used elsewhere.

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@cd9ecd90d2ce0f5caf362d6ffa6f782e91640837:

		Wraps: (3) output in run_070453.339_n1_cockroach_workload_fixtures_import_bank
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1967593-1590473552-27-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200526 07:04:54.114326 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200526 08:34:07.142888 23 ccl/workloadccl/fixture.go:516  imported bank (1h29m13s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1967593-1590473552-27-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 6: dead
		7: 3739
		2: 3728
		8: 3714
		5: 3730
		1: 3812
		3: 4114
		10: 3706
		4: 3766
		9: 3698
		Error: UNCLASSIFIED_PROBLEM: 6: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 6: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@73a373fb8c138c8ef6e4a05d7c1757207efa0a8d:

		Wraps: (3) output in run_070651.858_n1_cockroach_workload_fixtures_import_bank
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1980374-1590905635-27-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200531 07:06:52.615429 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200531 08:32:00.687259 66 ccl/workloadccl/fixture.go:516  imported bank (1h25m8s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1512,context.go:135,cluster.go:1501,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1980374-1590905635-27-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		8: 3784
		7: 3772
		9: 3729
		10: 3813
		4: 3781
		5: 4154
		6: 3766
		2: 3731
		3: 3742
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@petermattis
Copy link
Collaborator

Similar to what was reported in #44845 (comment), one of the nodes died during the import due to being out of space:

F200531 08:32:00.108957 147 storage/store.go:3779 [n1,s1,r20432/2:/Table/53/1/56{697017-700289}] during sideloading: during sideloading: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/auxiliary/sideloading/r2XXXX/r20432/i15.t6: No space left on device

@knz
Copy link
Contributor

knz commented Jun 4, 2020

@jlinder what was the plan to deal with out-of-disk errors?

@knz knz added the A-testing Testing tools and infrastructure label Jun 4, 2020
@petermattis
Copy link
Collaborator

@jlinder what was the plan to deal with out-of-disk errors?

Was a plan ever discussed? We might just be pushing up too close to the cluster capacity with this setup. We could reduce the size of the import to provide more breathing room. Or we could switch to using EBS and larger volumes to provide more breathing room.

@jlinder
Copy link
Collaborator

jlinder commented Jun 5, 2020

I don't remember discussion of such a plan.

The obvious fixes to me are to increase disk size for the tests in question or change the tests to be more considerate of how they are using disk (if that's an option). Since roachprod can be told the machine type and amount of disk to use in cluster nodes, would updating roachtest to use different machine types or more disk work?

@petermattis
Copy link
Collaborator

Since roachprod can be told the machine type and amount of disk to use in cluster nodes, would updating roachtest to use different machine types or more disk work?

roachtest already allows individual tests to choose the machine type they want, but I don't think we have an ability to ask for bigger disks (yet). Adding that should be doable with a bit of elbow grease.

Reducing the size of the import used by clearrange is certainly the more straightforward path as we just have to change 1 line of code. I would like to understand why we seem to be hitting this problem more frequently of late. That could be indicative of a regression.

Cc @dt in case you know of a recent change that could have affected disk imbalances during IMPORTs.

@dt
Copy link
Member

dt commented Jun 6, 2020

I don't know of anything that has changed there -- I don't think we've touched anything on bulk side. How recent is as of late? Pebble compaction differences could have changed it, or, going back a lot further the switch to larger ranges could be relevant.

In IMPORT we issue splits and scatter the follow range any time the data producer process has sent out 48mb of data without hitting a range boundary i.e. when it has sent that much to a single range. This was picked back when the range size was 64mb, since it meant the range was 75% full. We left it that way with the move to larger ranges and just let merges clean up afterwards since we were already fighting with hotspots and the inverted LSMs and didn't want to make it any worse at the time. The normal kv background splitting and rebalancing is also enabled throughout the IMPORT ranges that fill bit-by-bit over time from separate small flushes.

That said, we've seen frequent cases of the allocator just doing nothing when we ask it scatter a range, even when disk space usage is not balanced or load is not balanced, sometimes because it looks at mvcc byte counts and not actual storage bytes.

@petermattis
Copy link
Collaborator

I don't know of anything that has changed there -- I don't think we've touched anything on bulk side. How recent is as of late? Pebble compaction differences could have changed it, or, going back a lot further the switch to larger ranges could be relevant.

The failures predate the switch to using Pebble as the default. For example, my message on May 3 was before that switch. It might be unreasonable to assume the first message on this issue is due to out-of-disk, but that might put a bound on it. The switch to larger ranges landed on Feb 19. The first failure on this issue was Feb 7, but many more failures have occurred since then.

@petermattis
Copy link
Collaborator

@jbowens You've been running the clearrange roachtests recently. Have you ever encountered this "no space" error? Can we add some additional instrumentation to help identify why one node is running out of space? From the graphs on #50508 we should have plenty of space per node. Is there some sort of severe space imbalance going on?

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@0c04a92ba19eedd4762ca7feb8361433682f3ded:

		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2060752-1593731282-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200703 00:02:55.709944 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200703 01:36:31.945897 25 ccl/workloadccl/fixture.go:516  imported bank (1h33m36s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1512,context.go:135,cluster.go:1501,test_runner.go:829: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2060752-1593731282-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		9: 3801
		5: 3836
		1: 3881
		10: 3797
		4: 3745
		6: 3847
		7: 3819
		2: 4184
		8: 3802
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1789
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@jbowens
Copy link
Collaborator

jbowens commented Jul 6, 2020

F200706 19:09:33.569043 167 storage/replica_raft.go:927  [n7,s7,r20463/1:/Table/53/1/55{899181-902453}] during sideloading: during sideloading: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/auxiliary/sideloading/r2XXXX/r20463/i21.t7: No space left on device

From the debug.zip, node 7's last reported capacity used is 381.9 GB and capacity available 1.34 GB and compactor queue shows:

"compactor.suggestionbytes.queued": 62684108629,

I'll try to reproduce this with instrumentation on the release-19.1 branch tomorrow.

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@8ecf958ac06ee10391ceb108ba11a745de8ff4b1:

		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2072432-1594187551-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200708 06:45:10.892019 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200708 08:11:43.646069 25 ccl/workloadccl/fixture.go:516  imported bank (1h26m33s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 20
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1516,context.go:135,cluster.go:1505,test_runner.go:829: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2072432-1594187551-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: dead
		3: 3898
		6: 3806
		10: 3836
		7: 3876
		8: 3815
		4: 3833
		9: 3868
		2: 3826
		1: 4369
		Error: UNCLASSIFIED_PROBLEM: 5: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1789
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 5: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@7c03505d8daa19dee7f5f0268c9e728e38d4ba6d:

		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2137346-1596261039-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200801 06:17:14.244803 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200801 07:48:17.862745 25 ccl/workloadccl/fixture.go:516  imported bank (1h31m4s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 20
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1571,context.go:135,cluster.go:1560,test_runner.go:823: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2137346-1596261039-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead
		1: 3941
		4: 3841
		5: 3854
		10: 4053
		3: 4054
		9: 4082
		6: 4215
		8: 3833
		7: 4108
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1808
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@86b7271623ad797e9c42d5f7900a5cb424fed436:

		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2145078-1596520433-26-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200804 06:32:21.749636 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200804 08:10:15.645276 26 ccl/workloadccl/fixture.go:516  imported bank (1h37m54s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 20
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1571,context.go:135,cluster.go:1560,test_runner.go:823: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2145078-1596520433-26-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: dead
		4: 3730
		10: 3754
		6: 3791
		1: 4234
		3: 3767
		7: 3793
		8: 3790
		2: 3775
		9: 3842
		Error: UNCLASSIFIED_PROBLEM: 5: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1808
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 5: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).clearrange/checks=true failed on release-19.1@efeb30fcc83c76819a832e7f12c91c891dbe0e68:

		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2190652-1597729965-27-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I200818 06:37:38.620381 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		  | I200818 08:08:18.513048 13 ccl/workloadccl/fixture.go:516  imported bank (1h30m40s, 0 rows, 0 index entries, 0 B)
		  | Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 20
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1612,context.go:135,cluster.go:1601,test_runner.go:823: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2190652-1597729965-27-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead
		9: 3908
		4: 4293
		3: 3913
		10: 3840
		8: 3784
		6: 3784
		1: 3857
		7: 3781
		5: 3741
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1808
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /clearrange/checks=true
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@jbowens
Copy link
Collaborator

jbowens commented Aug 18, 2020

@itsbilal got a reproduction

jbowens@jbowensmbp cockroach % roachprod run bilal-1597757897-02-n10cpu4 -- 'df | grep /mnt/data1'
bilal-1597757897-02-n10cpu4: df | grep /mnt/data1 10/10
   1: /dev/nvme1n1    95990980 90801212    290572 100% /mnt/data1
   2: /dev/nvme1n1    95990980  370768  90721016   1% /mnt/data1
   3: /dev/nvme1n1    95990980  282332  90809452   1% /mnt/data1
   4: /dev/nvme1n1    95990980  298144  90793640   1% /mnt/data1
   5: /dev/nvme1n1    95990980  359868  90731916   1% /mnt/data1
   6: /dev/nvme1n1    95990980  297052  90794732   1% /mnt/data1
   7: /dev/nvme1n1    95990980  370096  90721688   1% /mnt/data1
   8: /dev/nvme1n1    95990980  294928  90796856   1% /mnt/data1
   9: /dev/nvme1n1    95990980  299720  90792064   1% /mnt/data1
  10: /dev/nvme1n1    95990980  292512  90799272   1% /mnt/data1
                  L0     L1     L2     L3        L4        L5        L6         TOTAL
count             0      0      0      15        172       1010      3071       4268
seq num
  smallest        0      0      0      93601     83522     40167     37166      37166
  largest         0      0      0      457320    470215    470204    470251     470251
size
  data            0 B    0 B    0 B    62 M      2.3 G     13 G      69 G       84 G
    blocks        0      0      0      2240      80346     453233    2400881    2936700
  index           0 B    0 B    0 B    68 K      2.2 M     13 M      68 M       83 M
    blocks        0      0      0      15        172       1010      3071       4268
    top-level     0 B    0 B    0 B    0 B       0 B       0 B       0 B        0 B
  filter          0 B    0 B    0 B    132 K     310 K     1.7 M     1.3 M      3.5 M
  raw-key         0 B    0 B    0 B    4.1 M     5.9 M     33 M      176 M      220 M
  raw-value       0 B    0 B    0 B    64 M      2.3 G     13 G      69 G       84 G
records
  set             0      0      0      58 K      243 K     1.4 M     7.2 M      8.9 M
  delete          0      0      0      89 K      2.4 K     11 K      17         102 K
  range-delete    0      0      0      51        1         48        17         117
  merge           0      0      0      0         0         0         0          0

@jbowens
Copy link
Collaborator

jbowens commented Aug 18, 2020

@itsbilal noticed this test failing often on AWS and never on GCP while trying to reproduce #52720. I never noticed that all the failures were specifically on AWS, and I only tried to reproduce it on GCP. Oops.

None of the nodes had very much disk space headroom around when n1 ran out of space.

debug/nodes.json:            "capacity.available": 742469632, 742 MB
debug/nodes.json:            "capacity.available": 1200033792, 1200 MB
debug/nodes.json:            "capacity.available": 6198439936, 6198 MB
debug/nodes.json:            "capacity.available": 4542550016, 4543 MB
debug/nodes.json:            "capacity.available": 6815735808, 6816 MB
debug/nodes.json:            "capacity.available": 2204524544, 2205 MB 
debug/nodes.json:            "capacity.available": 474505216, 474.5 MB 
debug/nodes.json:            "capacity.available": 1852162048, 1825 MB
debug/nodes.json:            "capacity.available": 5351120896, 5351 MB
debug/nodes.json:            "capacity.available": 2944262144, 2944 MB

On AWS, this test uses a c5d.xlarge which has a 100 GB instance store disk, as opposed to GCP's 375 GB local SSD.

On the dead n1:

du -sh auxiliary/sideloading/
4.9G	auxiliary/sideloading

ubuntu@ip-10-12-29-83:/mnt/data1/cockroach$ ls -l auxiliary/sideloading/r0XXXX/ | head -n 4
total 1208
drwxr-x--- 2 ubuntu ubuntu 4096 Aug 18 14:22 r2113
drwxr-x--- 2 ubuntu ubuntu 4096 Aug 18 14:10 r2135
drwxr-x--- 2 ubuntu ubuntu 4096 Aug 18 14:23 r2440

The earliest of these sideload sstables r2113 appears in the logs here:

W200818 14:09:04.177418 143 kv/kvserver/store_raft.go:502  [n1,s1,r2113/3:/Table/53/1/208{78500-83752}] handle raft ready: 1.1s [applied=0, batches=0, state_assertions=0]
W200818 14:09:04.460892 167 kv/kvserver/store_raft.go:502  [n1,s1,r2145/3:/Table/53/1/145{55630-63750}] handle raft ready: 1.1s [applied=0, batches=0, state_assertions=0]
I200818 14:09:04.748230 207 server/status/runtime.go:522  [n1] runtime stats: 5.8 GiB RSS, 530 goroutines, 1.4 GiB/2.3 GiB/3.3 GiB GO alloc/idle/total, 2.0 GiB/2.5 GiB CGO alloc/total, 5589.9 CGO/sec, 32.1/20.6 %(u/s)time, 0.1 %gc (4x), 412 MiB/199 MiB (r/w)net
W200818 14:09:04.924298 168 kv/kvserver/store_raft.go:502  [n1,s1,r2187/1:/{Table/53/1/6…-Max}] handle raft ready: 1.1s [applied=0, batches=0, state_assertions=0]
W200818 14:09:05.207114 148 kv/kvserver/store_raft.go:502  [n1,s1,r2113/3:/Table/53/1/208{78500-83752}] handle raft ready: 0.9s [applied=0, batches=0, state_assertions=0]
W200818 14:09:05.630115 167 kv/kvserver/store_raft.go:502  [n1,s1,r2145/3:/Table/53/1/145{55630-63750}] handle raft ready: 0.8s [applied=0, batches=0, state_assertions=0]

The node's last log line before panicking was at 14:32:06.691529. Is it expected for a sideloaded sstable to be sitting around for > 20 minutes?

@itsbilal
Copy link
Member

itsbilal commented Sep 1, 2020

Fixed in #53572.

@itsbilal itsbilal closed this as completed Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

8 participants