Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #100804

Closed
cockroach-teamcity opened this issue Apr 6, 2023 · 11 comments · Fixed by #101437
Closed

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #100804

cockroach-teamcity opened this issue Apr 6, 2023 · 11 comments · Fixed by #101437
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 6, 2023

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ b432e8c20339de5cfa7c811a9ee6f5dc98d15a1e:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1010).runTest: test timed out (5h0m0s)
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_081748.107819932_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_081748.118180757_n1_cockroach-sql-insecu.log: exit status 137

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-26634

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 6, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 6, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ de239a7438f44d382c9aefceb65d9c39911dabd2:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ 2f96695f75b07c872ec5f146acc1fa198135768f:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 7: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

msbutler commented Apr 10, 2023

Latest roachtest failed on cluster teamcity-9516100-1681103961-27-n10cpu8 according to test.log, which ran c5.2xlarge machines according to cluster spec. This confirms that #100286 is behaving as expected, by removing the d in the machine type.

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ 2f96695f75b07c872ec5f146acc1fa198135768f:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1010).runTest: test timed out (5h0m0s)
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_074205.134359961_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_074205.147800821_n1_cockroach-sql-insecu.log: exit status 137

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

msbutler commented Apr 12, 2023

On the failure two days ago: before node 7 died, raft was consuming 9 GB of memory:
image

And 7.dmesg.txt confirms the oom killer was in the building:

 [Mon Apr 10 08:17:51 2023] cockroach invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0

@msbutler
Copy link
Collaborator

And last night, the restore test timed out after 5 hours: throughout restore, ranges were under replicated according to test.log. In fact, the system.replication_stats.txt file in the debug zip indicated that there were unavailable ranges, but i couldn't find anything in the cockroach logs to confirm this.

@msbutler msbutler added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 12, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ c8a4703f9853a442ac676b44d074b43eb387f60c:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1010).runTest: test timed out (5h0m0s)
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_082619.980953729_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_082619.998953143_n1_cockroach-sql-insecu.log: exit status 137

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

Keeping this open until the backport lands in #101507.

@erikgrinaker erikgrinaker reopened this Apr 13, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ ad16885ca3b4567ed5eb34646fe8281fd2d740e3:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@blathers-crl
Copy link

blathers-crl bot commented Apr 14, 2023

cc @cockroachdb/replication

@aliher1911 aliher1911 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 14, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ 6cce6c746150307a9ecf3b529dfd633a6985c110:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1010).runTest: test timed out (5h0m0s)
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_075204.139392467_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_075204.160753301_n1_cockroach-sql-insecu.log: exit status 137

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants