Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed #96543

Closed
cockroach-teamcity opened this issue Feb 4, 2023 · 1 comment · Fixed by #96782
Closed

roachtest: admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed #96543

cockroach-teamcity opened this issue Feb 4, 2023 · 1 comment · Fixed by #96782
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 4, 2023

roachtest.admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:

test artifacts and logs in: /artifacts/admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96/run_1
(admission_control_tpcc_overload.go:113).verifyNodeLiveness: failed to fetch liveness metrics: status: 503 Service Unavailable, content-type: application/json, body: {
  "error": "connection error: desc = \"transport: Error while dialing connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "code": 14,
  "message": "connection error: desc = \"transport: Error while dialing connection interrupted (did the remote node shut down or are there networking issues?)\"",
  "details": [
  ]
}, error: <nil>

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-24178

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 4, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 4, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 4, 2023
@tbg
Copy link
Member

tbg commented Feb 7, 2023

These are likely related to the overload going on in this test. I see them in the node logs too:

W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655  ‹[core]›‹[Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹  "Addr": "10.128.0.188:26257",›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹  "ServerName": "10.128.0.188:26257",›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹  "Attributes": null,›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹  "BalancerAttributes": null,›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹  "Type": 0,›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹  "Metadata": null›
W230204 06:55:36.780806 42344 google.golang.org/grpc/grpclog/component.go:41 ⋮ [T1] 655 +‹}. Err: connection error: desc = "transport: Error while dialing connection interrupted (did the remote node shut down or are there networking issues?)"›

It's curious that connections aren't stable under high CPU load. I wish there were more information in the logs.

But, it looks like this is not a new failure mode, as the test addresses it:

var response tspb.TimeSeriesQueryResponse
// Retry because timeseries queries can fail if the underlying inter-node
// connections are in a failed state which can happen due to overload.
// Now that the load has stopped, this should resolve itself soon.
// Even with 60 retries we'll at most spend 30s attempting to fetch
// the metrics.
if err := retry.WithMaxAttempts(ctx, retry.Options{
MaxBackoff: 500 * time.Millisecond,
}, 60, func() (err error) {
response, err = getMetrics(adminURLs[0], now.Add(-runDuration), now, []tsQuery{
{
name: "cr.node.liveness.heartbeatfailures",
queryType: total,
},
})
return err

The retrying seems to not have been sufficient in this case.

Ideally we would look into making gRPC more resilient here or at least produce better errors. The error we're getting is from here

cockroach/pkg/rpc/context.go

Lines 1813 to 1845 in 79e3bd7

func (ood *onlyOnceDialer) dial(ctx context.Context, addr string) (net.Conn, error) {
ood.mu.Lock()
defer ood.mu.Unlock()
if err := ood.mu.err; err != nil {
// Not first dial.
if !errors.Is(err, grpcutil.ErrConnectionInterrupted) {
// Hitting this path would indicate that gRPC retried even though it
// received an error not marked as retriable. At the time of writing, at
// least in simple experiments, as expected we don't see this error.
err = errors.Wrap(err, "previous dial failed")
}
return nil, err
}
// First dial.
dialer := net.Dialer{
LocalAddr: sourceAddr,
}
conn, err := dialer.DialContext(ctx, "tcp", addr)
if err != nil {
// Return an error and make sure it's not marked as temporary, so that
// ideally we don't even use the dialer again. (If the caller still does
// it's fine, but it will cause more confusing error logging, etc).
err = &notTemporaryError{error: err}
ood.mu.err = err
return nil, err
}
ood.mu.err = grpcutil.ErrConnectionInterrupted
return conn, nil
}

meaning that we dialed successfully, but then the connection broke and we're preventing gRPC from re-dialing.

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 7, 2023
tbg added a commit to tbg/cockroach that referenced this issue Feb 8, 2023
While looking into cockroachdb#96543, I wasn't 100% sure we weren't accidentally
redialing a connection internally. This improved logging and the test
makes it more obvious that things are working as intended.

Touches cockroachdb#96543.

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 8, 2023
I ran the test[^1] and it passed, so hopefully this isn't obviously breaking
anything.

See cockroachdb#96543.

[^1]: `GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c 1 -u admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 -- tag:weekly`

Epic: none
Release note: None
@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 8, 2023
@craig craig bot closed this as completed in 72e3b1f Feb 8, 2023
kathancox pushed a commit to kathancox/cockroach that referenced this issue Feb 9, 2023
96768: server: remove stale comment r=andreimatei a=andreimatei

Release note: None
Epic: None

96781: rpc: improve detection of onlyOnceDialer redials r=erikgrinaker a=tbg

While looking into cockroachdb#96543, I wasn't 100% sure we weren't accidentally
redialing a connection internally. This improved logging and the test
makes it more obvious that things are working as intended.

Touches cockroachdb#96543.

Epic: none
Release note: None

96782: roachtest: verbose gRPC logging for admission-control/tpcc-olap r=irfansharif a=tbg

Closes cockroachdb#96543.

Next time we'll know more.

Epic: none
Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants