roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

cockroach-teamcity · 2023-07-08T23:13:03Z

roachtest.restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed with artifacts on master @ 43c26aec0072f76e02e6d5ffc1b7079026b24630:

(monitor.go:137).Wait: monitor failure: monitor task failed: dial tcp 3.142.42.194:26257: connect: connection timed out
test artifacts and logs in: /artifacts/restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-29582}

The text was updated successfully, but these errors were encountered:

dt · 2023-07-09T17:09:49Z

n8 oomkilled. unfortunately the last heap profile in the artifacts doesn't show anything unusual.

It passed in the last thee runs, on 6/27, 6/24, and 7/1, so this might be new and maybe related to the removal of the req limiter? /cc @irfansharif

dt · 2023-07-09T17:24:56Z

Given the other restore failures last night did have usable profiles and those pointed to raft, I'm going to guess this is the same

blathers-crl · 2023-07-10T16:43:12Z

cc @cockroachdb/replication

irfansharif · 2023-07-10T16:44:17Z

this might be new and maybe related to the removal of the req limiter?

I don't think so, given there were also 23.1 failures (#106496, #106248) and #104861 is not there.

irfansharif · 2023-07-10T16:44:41Z

+cc #73376.

erikgrinaker · 2023-07-11T09:10:40Z

Why are these all popping up now, on 23.1 even? Did we change machine types or something, or backport anything related?

tbg · 2023-07-19T12:39:15Z

This test is new¹ and weekly. It has passed exactly once, on July 15. The first failure was a panic during a split which was resolved here. So we have very little signal. It's also a massive test that is expected to show problems such as #102840 more prominently. Either way, hard to justify that this should be a release blocker.

https://teamcity.cockroachdb.com/test/-379436985926334303?currentProjectId=Cockroach_Nightlies&expandTestHistoryChartSection=true&orderBy=status&order=asc&pager.currentPage=2 ↩

tbg · 2023-07-19T12:41:09Z

I should also add that some of the existing restore tests oomed in the July 6 - July 10 period and them stopped doing that - across the master/release branch. This hints at something in the infrastructure that was temporarily different.

x-ref #106248 (comment)
x-ref #106496 (comment)

Handing this back to @pavelkalinnikov to track the eventual resolution through cockroachlabs.atlassian.net/browse/CRDB-25503.

tbg · 2023-07-25T09:32:44Z

n8 oomkilled. unfortunately the last heap profile in the artifacts doesn't show anything unusual.

I took a look and I don't see evidence of n8 being oomkilled. In fact, the post-test checks all confirm it's running and all nodes return 200 on the ready endpoint.

Details

test-post-assertions: 23:07:12 cluster.go:1531: checking for dead nodes
test-post-assertions: 23:07:13 cluster.go:1547: n15: err=<nil>,msg=111584
test-post-assertions: 23:07:13 cluster.go:1547: n10: err=<nil>,msg=111969
test-post-assertions: 23:07:13 cluster.go:1547: n14: err=<nil>,msg=111731
test-post-assertions: 23:07:13 cluster.go:1547: n13: err=<nil>,msg=110991
test-post-assertions: 23:07:13 cluster.go:1547: n9: err=<nil>,msg=111352
test-post-assertions: 23:07:13 cluster.go:1547: n3: err=<nil>,msg=112188
test-post-assertions: 23:07:13 cluster.go:1547: n4: err=<nil>,msg=112171
test-post-assertions: 23:07:13 cluster.go:1547: n7: err=<nil>,msg=110831
test-post-assertions: 23:07:13 cluster.go:1547: n1: err=<nil>,msg=113098
test-post-assertions: 23:07:13 cluster.go:1547: n12: err=<nil>,msg=112364
test-post-assertions: 23:07:13 cluster.go:1547: n2: err=<nil>,msg=110503
test-post-assertions: 23:07:13 cluster.go:1547: n8: err=<nil>,msg=112440
test-post-assertions: 23:07:13 cluster.go:1547: n6: err=<nil>,msg=113842
test-post-assertions: 23:07:13 cluster.go:1547: n5: err=<nil>,msg=112266
test-post-assertions: 23:07:13 cluster.go:1547: n11: err=<nil>,msg=110689
test-post-assertions: 23:07:13 test_runner.go:1165: n1:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n2:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n3:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n4:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n5:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n6:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n7:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n8:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n9:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n10:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n11:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n12:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n13:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n14:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n15:/health?ready=1 status=200 ok

I spot checked a few logs and they give off a good vibe (i.e. no obvious breakage, slow readies, etc).

This looks more like a network infra flake to me:

health: 23:07:12 health_checker.go:84: health check terminated on node 8 with dial tcp 3.142.42.194:26257: connect: connection timed out

Without more debug info (like at least #97788 and also the debug zip didn't work here) not sure what else to do - will close.

Note that the test has passed a few times since.

cockroach-teamcity added this to the 23.2 milestone Jul 8, 2023

dt added A-admission-control and removed T-disaster-recovery labels Jul 9, 2023

irfansharif mentioned this issue Jul 10, 2023

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

Closed

irfansharif added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 10, 2023

blathers-crl bot added the T-kv-replication label Jul 10, 2023

arulajmani assigned pav-kv Jul 10, 2023

irfansharif removed the A-admission-control label Jul 11, 2023

tbg assigned tbg and unassigned pav-kv Jul 19, 2023

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023

tbg assigned pav-kv and unassigned tbg Jul 19, 2023

tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023

tbg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Jul 25, 2023

tbg closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

cockroach-teamcity commented Jul 8, 2023 •

edited by cockroach-jira-scripts

Loading

dt commented Jul 9, 2023

dt commented Jul 9, 2023

blathers-crl bot commented Jul 10, 2023

irfansharif commented Jul 10, 2023

irfansharif commented Jul 10, 2023

erikgrinaker commented Jul 11, 2023

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 25, 2023 •

edited

Loading

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

Comments

cockroach-teamcity commented Jul 8, 2023 • edited by cockroach-jira-scripts Loading

dt commented Jul 9, 2023

dt commented Jul 9, 2023

blathers-crl bot commented Jul 10, 2023

irfansharif commented Jul 10, 2023

irfansharif commented Jul 10, 2023

erikgrinaker commented Jul 11, 2023

tbg commented Jul 19, 2023 • edited Loading

Footnotes

tbg commented Jul 19, 2023 • edited Loading

tbg commented Jul 25, 2023 • edited Loading

cockroach-teamcity commented Jul 8, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 25, 2023 •

edited

Loading