Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed #103692

Closed
cockroach-teamcity opened this issue May 20, 2023 · 2 comments · Fixed by #103764
Closed

roachtest: admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed #103692

cockroach-teamcity opened this issue May 20, 2023 · 2 comments · Fixed by #103764
Assignees
Labels
A-observability-inf A-server-networking Pertains to network addressing,routing,initialization branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 20, 2023

roachtest.admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed with artifacts on master @ a226099dfee6ac0159c0c40a54aaa753d646cf6e:

test artifacts and logs in: /artifacts/admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96/run_1
(admission_control_tpcc_overload.go:135).verifyNodeLiveness: failed to fetch liveness metrics: status: 503 Service Unavailable, content-type: application/json, body: {
  "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "code": 14,
  "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "details": [
  ]
}, error: <nil>

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-28138
Epic: CRDB-28893

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels May 20, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone May 20, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 22, 2023

All nodes were still running, but gRPC connections appeared to be failing on n1 and n2. The check itself is trying to fetch time series metrics via the gRPC JSON gateway for about 30 seconds:

if err := retry.WithMaxAttempts(ctx, retry.Options{
MaxBackoff: 500 * time.Millisecond,
}, 60, func() (err error) {
response, err = getMetrics(adminURLs[0], now.Add(-runDuration), now, []tsQuery{
{
name: "cr.node.liveness.heartbeatfailures",
queryType: total,
},
})
return err
}); err != nil {
t.Fatalf("failed to fetch liveness metrics: %v", err)
}

err := httputil.PostJSON(http.Client{Timeout: 500 * time.Millisecond}, url, &request, &response)
return response, err

05:56:02 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (admission_control_tpcc_overload.go:135).verifyNodeLiveness: failed to fetch liveness metrics: status: 503 Service Unavailable, content-type: application/json, body: {
  "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "code": 14,
  "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "details": [
  ]
}, error: <nil>

However, we're also seeing node readiness checks fail with the same error on n1 and n2:

teardown: 05:56:02 cluster.go:1346: checking for dead nodes
teardown: 05:56:03 cluster.go:1362: n3: err=<nil>,msg=11335
teardown: 05:56:03 cluster.go:1362: n2: err=<nil>,msg=11258
teardown: 05:56:03 cluster.go:1362: n1: err=<nil>,msg=11551
teardown: 05:56:03 cluster.go:1362: n4: err=<nil>,msg=skipped
teardown: 05:56:03 test_runner.go:1131: n1:/health?ready=1 status=503 body={
  "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "code": 14,
  "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "details": [
  ]
}
teardown: 05:56:03 test_runner.go:1131: n2:/health?ready=1 status=503 body={
  "error": "connection error: desc = \"transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "code": 14,
  "message": "connection error: desc = \"transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "details": [
  ]
}
teardown: 05:56:03 test_runner.go:1139: n3:/health?ready=1 status=200 ok
teardown: 05:56:03 test_runner.go:1126: n4:/health?ready=1 error=Get "http://34.171.146.150:26258/health?ready=1": dial tcp 34.171.146.150:26258: connect: connection refused

@tbg The connection reuse errors seem suspect. I know you've been working in the RPC area lately, any of this seem related?

craig bot pushed a commit that referenced this issue May 22, 2023
103609: roachtest: fix drain test's log searching r=rafiss a=rafiss

c4b6958 changed how the --drain-wait calculation is done, so we must update the test to account for that.

fixes #103577
Release note: None

103716: tracingpb: fix missing return in `Recording.SafeFormat()` r=erikgrinaker a=erikgrinaker

This led to out of bounds panics, e.g.:

```
<empty recording>%!v(PANIC=SafeFormat method: runtime error: index out of range [0] with length 0)
```

Follows #103034.
Touches #103692.

Epic: none
Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@tbg
Copy link
Member

tbg commented May 23, 2023

#99261 (comment)

I'll have to double check but I don't think I merged any of my changes related to this already, so this is looking like a pre-existing bug, at least not newly introduced over the last few weeks but possibly new in 23.1, we'll have to investigate.

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-server-networking Pertains to network addressing,routing,initialization labels May 23, 2023
@craig craig bot closed this as completed in 741c91b May 23, 2023
@craig craig bot closed this as completed in #103764 May 23, 2023
@knz knz added T-observability-inf and removed T-kv KV Team labels May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf A-server-networking Pertains to network addressing,routing,initialization branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants