Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

Closed
cockroach-teamcity opened this issue Jul 8, 2023 · 9 comments
Closed
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 8, 2023

roachtest.restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed with artifacts on master @ 43c26aec0072f76e02e6d5ffc1b7079026b24630:

(monitor.go:137).Wait: monitor failure: monitor task failed: dial tcp 3.142.42.194:26257: connect: connection timed out
test artifacts and logs in: /artifacts/restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-29582

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jul 8, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 8, 2023
@dt
Copy link
Member

dt commented Jul 9, 2023

n8 oomkilled. unfortunately the last heap profile in the artifacts doesn't show anything unusual.

It passed in the last thee runs, on 6/27, 6/24, and 7/1, so this might be new and maybe related to the removal of the req limiter? /cc @irfansharif

@dt
Copy link
Member

dt commented Jul 9, 2023

Given the other restore failures last night did have usable profiles and those pointed to raft, I'm going to guess this is the same

@blathers-crl
Copy link

blathers-crl bot commented Jul 10, 2023

cc @cockroachdb/replication

@irfansharif
Copy link
Contributor

this might be new and maybe related to the removal of the req limiter?

I don't think so, given there were also 23.1 failures (#106496, #106248) and #104861 is not there.

@irfansharif
Copy link
Contributor

+cc #73376.

@erikgrinaker
Copy link
Contributor

Why are these all popping up now, on 23.1 even? Did we change machine types or something, or backport anything related?

@tbg tbg assigned tbg and unassigned pav-kv Jul 19, 2023
@tbg
Copy link
Member

tbg commented Jul 19, 2023

This test is new1 and weekly. It has passed exactly once, on July 15. The first failure was a panic during a split which was resolved here. So we have very little signal. It's also a massive test that is expected to show problems such as #102840 more prominently. Either way, hard to justify that this should be a release blocker.

image

Footnotes

  1. https://teamcity.cockroachdb.com/test/-379436985926334303?currentProjectId=Cockroach_Nightlies&expandTestHistoryChartSection=true&orderBy=status&order=asc&pager.currentPage=2

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023
@tbg tbg assigned pav-kv and unassigned tbg Jul 19, 2023
@tbg
Copy link
Member

tbg commented Jul 19, 2023

I should also add that some of the existing restore tests oomed in the July 6 - July 10 period and them stopped doing that - across the master/release branch. This hints at something in the infrastructure that was temporarily different.

x-ref #106248 (comment)
x-ref #106496 (comment)

Handing this back to @pavelkalinnikov to track the eventual resolution through cockroachlabs.atlassian.net/browse/CRDB-25503.

@tbg tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023
@tbg tbg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Jul 25, 2023
@tbg
Copy link
Member

tbg commented Jul 25, 2023

n8 oomkilled. unfortunately the last heap profile in the artifacts doesn't show anything unusual.

I took a look and I don't see evidence of n8 being oomkilled. In fact, the post-test checks all confirm it's running and all nodes return 200 on the ready endpoint.

Details
test-post-assertions: 23:07:12 cluster.go:1531: checking for dead nodes
test-post-assertions: 23:07:13 cluster.go:1547: n15: err=<nil>,msg=111584
test-post-assertions: 23:07:13 cluster.go:1547: n10: err=<nil>,msg=111969
test-post-assertions: 23:07:13 cluster.go:1547: n14: err=<nil>,msg=111731
test-post-assertions: 23:07:13 cluster.go:1547: n13: err=<nil>,msg=110991
test-post-assertions: 23:07:13 cluster.go:1547: n9: err=<nil>,msg=111352
test-post-assertions: 23:07:13 cluster.go:1547: n3: err=<nil>,msg=112188
test-post-assertions: 23:07:13 cluster.go:1547: n4: err=<nil>,msg=112171
test-post-assertions: 23:07:13 cluster.go:1547: n7: err=<nil>,msg=110831
test-post-assertions: 23:07:13 cluster.go:1547: n1: err=<nil>,msg=113098
test-post-assertions: 23:07:13 cluster.go:1547: n12: err=<nil>,msg=112364
test-post-assertions: 23:07:13 cluster.go:1547: n2: err=<nil>,msg=110503
test-post-assertions: 23:07:13 cluster.go:1547: n8: err=<nil>,msg=112440
test-post-assertions: 23:07:13 cluster.go:1547: n6: err=<nil>,msg=113842
test-post-assertions: 23:07:13 cluster.go:1547: n5: err=<nil>,msg=112266
test-post-assertions: 23:07:13 cluster.go:1547: n11: err=<nil>,msg=110689
test-post-assertions: 23:07:13 test_runner.go:1165: n1:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n2:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n3:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n4:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n5:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n6:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n7:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n8:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n9:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n10:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n11:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n12:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n13:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n14:/health?ready=1 status=200 ok
test-post-assertions: 23:07:13 test_runner.go:1165: n15:/health?ready=1 status=200 ok

I spot checked a few logs and they give off a good vibe (i.e. no obvious breakage, slow readies, etc).

This looks more like a network infra flake to me:

health: 23:07:12 health_checker.go:84: health check terminated on node 8 with dial tcp 3.142.42.194:26257: connect: connection timed out

Without more debug info (like at least #97788 and also the debug zip didn't work here) not sure what else to do - will close.

Note that the test has passed a few times since.

@tbg tbg closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

6 participants