Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TestStatusEngineStatsJson failed #99261

Closed
cockroach-teamcity opened this issue Mar 22, 2023 · 17 comments · Fixed by #103764 or #105629
Closed

server: TestStatusEngineStatsJson failed #99261

cockroach-teamcity opened this issue Mar 22, 2023 · 17 comments · Fixed by #103764 or #105629
Assignees
Labels
A-observability-inf branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 22, 2023

server.TestStatusEngineStatsJson failed with artifacts on release-23.1 @ dc054b59d684e421238d7044539463df051186b7:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/74885ac1e7f972ae9e97608ed3fdb458/logTestStatusEngineStatsJson1821925975
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:282: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/74885ac1e7f972ae9e97608ed3fdb458/logTestStatusEngineStatsJson1821925975
--- FAIL: TestStatusEngineStatsJson (15.37s)
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/obs-inf-prs @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-25791
Epic: CRDB-28893

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 22, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 22, 2023
@knz knz added the branch-master Failures and bugs on the master branch. label Apr 3, 2023
@knz
Copy link
Contributor

knz commented Apr 3, 2023

This continues to flake on master and 23.1 despite my attempt to make it better in #100115 and #100284.

craig bot pushed a commit that referenced this issue Apr 3, 2023
99934: changefeedccl: Remove skipped tests that decayed over time r=miretskiy a=miretskiy

Remove
Fixes #32232
Remove TestChangefeedNodeShutdown.  This test has been disabled since 2018; Other tests exist (e.g. `TestChangefeedHandlesDrainingNodes`) that verify restart behavior.

Fixes #51842
Remove BenchmarkChangefeedTicks benchmark.  This benchmark has been skipped since 2019.  Attempts could be made to revive it; however, this benchmark had a lot of code, which accomplished questionable goals. The benchmark itself was unrepresentative (by using dependency injection), too small to be meaningful (1000 rows), and most likely would be too noise and inconclusive.  We have added other micro benchmarks over time; and we conduct large scale testing, including with roachtests.

Release note: None

100342: upgrades: remove migration that waits for schema changes r=rafiss a=rafiss

We can also remove some skipped tests, since they no longer apply.

informs: #96751
Release note: None

100345: upgrades: unskip TestIsAtLeastVersionBuiltin r=rafiss a=rafiss

informs: #96751
Release note: None

100484: server,testutils: add some extra logging for TestStatusEngineStatsJson r=abarganier a=knz

Informs #99261

Release note: None
Epic: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ 568c68a5b2ba48ba08cd69c96876dd8ac25ade3c:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/38c80cbed3f79a398570824e9fcc636b/logTestStatusEngineStatsJson2636581669
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:279: using admin URL https://127.0.0.1:44241
    status_test.go:286: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/38c80cbed3f79a398570824e9fcc636b/logTestStatusEngineStatsJson2636581669
--- FAIL: TestStatusEngineStatsJson (84.09s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@dhartunian dhartunian self-assigned this Apr 21, 2023
@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ 3c23346a6d1c91dfe8ed1c5285966f9b9e487601:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/04ff595e8b8a4cdf61532aa9937d7edf/logTestStatusEngineStatsJson3168219894
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:279: using admin URL https://127.0.0.1:36583
    status_test.go:286: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/04ff595e8b8a4cdf61532aa9937d7edf/logTestStatusEngineStatsJson3168219894
--- FAIL: TestStatusEngineStatsJson (87.12s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ ddc247fd1fa4c7c1d2f212fcdd101ef200b772ff:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/04ff595e8b8a4cdf61532aa9937d7edf/logTestStatusEngineStatsJson2340365415
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:279: using admin URL https://127.0.0.1:43193
    status_test.go:286: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/04ff595e8b8a4cdf61532aa9937d7edf/logTestStatusEngineStatsJson2340365415
--- FAIL: TestStatusEngineStatsJson (109.60s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on release-23.1 @ 1f41dc622e53e7f2dcf27d80ed808a47605510b8:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/38c80cbed3f79a398570824e9fcc636b/logTestStatusEngineStatsJson2954868437
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:284: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/38c80cbed3f79a398570824e9fcc636b/logTestStatusEngineStatsJson2954868437
--- FAIL: TestStatusEngineStatsJson (100.05s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ 5d775e992ade315becd6b4fc41ed1fb85d8e35d6:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/115b56b7b7bd2dfb6f4ea1f5f6513edb/logTestStatusEngineStatsJson1398007110
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:279: using admin URL https://127.0.0.1:35639
    status_test.go:286: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/115b56b7b7bd2dfb6f4ea1f5f6513edb/logTestStatusEngineStatsJson1398007110
--- FAIL: TestStatusEngineStatsJson (56.25s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ 439c515b2a0058648731da73993c409544404da1:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestStatusEngineStatsJson483250565
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:280: using admin URL https://127.0.0.1:41589
    status_test.go:287: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestStatusEngineStatsJson483250565
--- FAIL: TestStatusEngineStatsJson (73.18s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ 09ec0ee7c658bc139adab023fca8a61f1cd6789f:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestStatusEngineStatsJson835366702
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:280: using admin URL https://127.0.0.1:36079
    status_test.go:287: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestStatusEngineStatsJson835366702
--- FAIL: TestStatusEngineStatsJson (73.25s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on master @ 76078816b69ee6d1f8dacd581ef464660fb1b697:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/115b56b7b7bd2dfb6f4ea1f5f6513edb/logTestStatusEngineStatsJson3831849725
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:280: using admin URL https://127.0.0.1:35511
    status_test.go:287: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/115b56b7b7bd2dfb6f4ea1f5f6513edb/logTestStatusEngineStatsJson3831849725
--- FAIL: TestStatusEngineStatsJson (64.35s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@andrewbaptist
Copy link
Collaborator

smg260 pushed a commit to smg260/cockroach that referenced this issue May 18, 2023
Refs: cockroachdb#99261

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes
Release note: None
Epic: None
craig bot pushed a commit that referenced this issue May 19, 2023
103624: server: skip TestStatusEngineStatsJson r=smg260 a=smg260

Refs: #99261

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes
Release note: None
Epic: None

Co-authored-by: Miral Gadani <[email protected]>
@tbg
Copy link
Member

tbg commented May 19, 2023

I've been poking at this test a bit. So far I got to the point where I'm certain that the EngineStats endpoint isn't even reached. So I think the problem is somewhere in the http-rpc bridge. I'll try to remember how this all works and printf-debug my way through...

@tbg
Copy link
Member

tbg commented May 19, 2023

Ah, I see that we are setting up the dialer for grpc-gateway here:

// Eschew `(*rpc.Context).GRPCDial` to avoid unnecessary moving parts on the
// uniquely in-process connection.
dialOpts, err := rpcContext.GRPCDialOptions(ctx, GRPCAddr, rpc.DefaultClass)
if err != nil {
return nil, nil, nil, err
}

It uses GRPCDialOptions, which also puts in a onlyOnceDialer - but I don't think we want that here... under stress, we can probably fail to dial due to some timeout. And then the onlyOnceDialer doesn't allow reconnections, so this connection will be permanently broken.

@tbg
Copy link
Member

tbg commented May 19, 2023

Yeah, I think that's it. I changed the above call to not use the onlyOnceDialer by classifying it as loopbackTransport1

@knz could you chime in on whether that's the semantically correct fix? It does seem like that to me but there might be nuance that I'm missing.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/3d3910b15061844ff809bcb6fa8c0ca52f8fcf95/pkg/rpc/context.go#L1631-L1637

@cockroach-teamcity
Copy link
Member Author

server.TestStatusEngineStatsJson failed with artifacts on release-23.1 @ 128261565cccf92b771975db161ceda9340a659a:

=== RUN   TestStatusEngineStatsJson
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestStatusEngineStatsJson1794869553
    test_log_scope.go:79: use -show-logs to present logs inline
    status_test.go:284: condition failed to evaluate within 45s: status: 503 Service Unavailable, content-type: application/json, body: {
          "error": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "code": 14,
          "message": "connection error: desc = \"transport: error while dialing: gRPC connection unexpectedly re-dialed: connection interrupted (did the remote node shut down or are there networking issues?)\"",
          "details": [
          ]
        }, error: <nil>
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestStatusEngineStatsJson1794869553
--- FAIL: TestStatusEngineStatsJson (62.62s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@rafiss
Copy link
Collaborator

rafiss commented Jun 27, 2023

reopening as the test is still skipped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test
Projects
None yet
8 participants