Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

Closed
cockroach-teamcity opened this issue Jun 10, 2024 · 3 comments · Fixed by #126498
Closed

pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

cockroach-teamcity opened this issue Jun 10, 2024 · 3 comments · Fixed by #126498
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-observability

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 10, 2024

pkg/ccl/serverccl/statusccl/statusccl_test.TestTenantStatusAPI failed with artifacts on master @ e924a27896b9fbf495609d3965612366de6ebf78:

=== RUN   TestTenantStatusAPI
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI1737636433
    tenant_status_test.go:144: -- test log scope end --
CREATE UNIQUE INDEX test_pkey ON test_db1.public.test USING btree (k ASC)test logs left over in: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI1737636433
--- FAIL: TestTenantStatusAPI (17.28s)
=== RUN   TestTenantStatusAPI/tenant_cancel_session
    tenant_status_test.go:1095: error executing 'SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'': read tcp 127.0.0.1:42566->127.0.0.1:35893: read: connection reset by peer
    tenant_status_test.go:1136: 
        	Error Trace:	pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:1136
        	            				pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:97
        	Error:      	Condition never satisfied
        	Test:       	TestTenantStatusAPI/tenant_cancel_session
    --- FAIL: TestTenantStatusAPI/tenant_cancel_session (5.11s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/obs-prs

This test on roachdash | Improve this report!

Jira issue: CRDB-39439

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability labels Jun 10, 2024
@cockroach-teamcity
Copy link
Member Author

pkg/ccl/serverccl/statusccl/statusccl_test.TestTenantStatusAPI failed with artifacts on master @ ba7de435128f4d538168c61d749e5a4d2f1496dd:

=== RUN   TestTenantStatusAPI
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI3715395111
    tenant_status_test.go:144: -- test log scope end --
CREATE UNIQUE INDEX test_pkey ON test_db1.public.test USING btree (k ASC)test logs left over in: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI3715395111
--- FAIL: TestTenantStatusAPI (18.19s)
=== RUN   TestTenantStatusAPI/tenant_cancel_session
    tenant_status_test.go:1095: error executing 'SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'': read tcp 127.0.0.1:56308->127.0.0.1:39391: read: connection reset by peer
    tenant_status_test.go:1136: 
        	Error Trace:	pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:1136
        	            				pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:97
        	Error:      	Condition never satisfied
        	Test:       	TestTenantStatusAPI/tenant_cancel_session
    --- FAIL: TestTenantStatusAPI/tenant_cancel_session (5.11s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@dhartunian dhartunian added the P-1 Issues/test failures with a fix SLA of 1 month label Jun 26, 2024
@exalate-issue-sync exalate-issue-sync bot assigned dhartunian and unassigned xinhaoz Jun 28, 2024
@dhartunian
Copy link
Collaborator

I've bisected this test failure to #124373

@rafiss is it possible that given your change we can't guarantee that in 5 seconds the session will be closed in our test?

Code in question is here, and it's the require.Eventually on line 1136 that never succeds which is checking crdb_internal.cluster_sessions for the session we've cancelled in the API to get removed. When run under --stress sometimes the session doesn't get removed after 5 seconds. Bumping to 10 seconds still doesn't do it so I think some sort of signal is missing.

func selectClusterSessionIDs(t *testing.T, conn *sqlutils.SQLRunner) []string {
var sessionIDs []string
rows := conn.QueryStr(t,
"SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'")
for _, row := range rows {
sessionIDs = append(sessionIDs, row[0])
}
return sessionIDs
}
func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelper) {
// Open a SQL session on tenant SQL pod 0.
sqlPod0 := helper.TestCluster().TenantConn(0)
sqlPod0.Exec(t, "SELECT 1")
// See the session over HTTP on tenant SQL pod 1.
httpPod1 := helper.TestCluster().TenantAdminHTTPClient(t, 1)
defer httpPod1.Close()
listSessionsResp := serverpb.ListSessionsResponse{}
httpPod1.GetJSON("/_status/sessions?exclude_closed_sessions=true", &listSessionsResp)
var session serverpb.Session
for _, s := range listSessionsResp.Sessions {
if s.LastActiveQuery == "SELECT 1" {
session = s
break
}
}
require.NotNil(t, session.ID, "session not found")
// See the session over SQL on tenant SQL pod 0.
sessionID := hex.EncodeToString(session.ID)
require.Eventually(t, func() bool {
return strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID)
}, 5*time.Second, 100*time.Millisecond)
// Cancel the session over HTTP from tenant SQL pod 1.
cancelSessionReq := serverpb.CancelSessionRequest{SessionID: session.ID}
cancelSessionResp := serverpb.CancelSessionResponse{}
httpPod1.PostJSON("/_status/cancel_session/"+session.NodeID.String(), &cancelSessionReq, &cancelSessionResp)
require.Equal(t, true, cancelSessionResp.Canceled, cancelSessionResp.Error)
// No longer see the session over SQL from tenant SQL pod 0.
// (The SQL client maintains an internal connection pool and automatically reconnects.)
require.Eventually(t, func() bool {
return !strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID)
}, 5*time.Second, 100*time.Millisecond)
// Attempt to cancel the session again over HTTP from tenant SQL pod 1, so that we can see the error message.
httpPod1.PostJSON("/_status/cancel_session/"+session.NodeID.String(), &cancelSessionReq, &cancelSessionResp)
require.Equal(t, false, cancelSessionResp.Canceled)
require.Equal(t, fmt.Sprintf("session ID %s not found", sessionID), cancelSessionResp.Error)
}

@rafiss
Copy link
Collaborator

rafiss commented Jun 28, 2024

It looks like the failure is read: connection reset by peer

    tenant_status_test.go:1095: error executing 'SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'': read tcp 127.0.0.1:56308->127.0.0.1:39391: read: connection reset by peer

So I don't think the issue is that the session is never cancelled. Rather, the issue appears to be that the test is trying to use a connection that was closed by the server. It appears to be a test bug -- the test is using the same SQL session to run this query as the SQL session that was cancelled. As such, I'll remove the blocker label. Prior to #124373, the test was probably getting away with this, since it took longer for the session to react to the cancel signal, which might affect how the connection pool automatically reconnects.

Here's an example of how it could be fixed. I've used overly verbose variable names for clarity.

diff --git a/pkg/ccl/serverccl/statusccl/tenant_status_test.go b/pkg/ccl/serverccl/statusccl/tenant_status_test.go
index 2b8f857e774..6db66436a63 100644
--- a/pkg/ccl/serverccl/statusccl/tenant_status_test.go
+++ b/pkg/ccl/serverccl/statusccl/tenant_status_test.go
@@ -1101,9 +1101,16 @@ func selectClusterSessionIDs(t *testing.T, conn *sqlutils.SQLRunner) []string {
 }
 
 func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelper) {
-	// Open a SQL session on tenant SQL pod 0.
-	sqlPod0 := helper.TestCluster().TenantConn(0)
-	sqlPod0.Exec(t, "SELECT 1")
+	ctx := context.TODO()
+	// Open two different SQL sessions on tenant SQL pod 0.
+	sqlPod0 := helper.TestCluster().TenantDB(0)
+	sqlPod0SessionToCancel, err := sqlPod0.Conn(ctx)
+	require.NoError(t, err)
+	sqlPod0SessionForIntrospection, err := sqlPod0.Conn(ctx)
+	require.NoError(t, err)
+	_, err = sqlPod0SessionToCancel.ExecContext(ctx, "SELECT 1")
+	require.NoError(t, err)
+	introspectionRunner := sqlutils.MakeSQLRunner(sqlPod0SessionForIntrospection)
 
 	// See the session over HTTP on tenant SQL pod 1.
 	httpPod1 := helper.TestCluster().TenantAdminHTTPClient(t, 1)
@@ -1122,7 +1129,7 @@ func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelp
 	// See the session over SQL on tenant SQL pod 0.
 	sessionID := hex.EncodeToString(session.ID)
 	require.Eventually(t, func() bool {
-		return strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID)
+		return strings.Contains(strings.Join(selectClusterSessionIDs(t, introspectionRunner), ","), sessionID)
 	}, 5*time.Second, 100*time.Millisecond)
 
 	// Cancel the session over HTTP from tenant SQL pod 1.
@@ -1134,7 +1141,7 @@ func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelp
 	// No longer see the session over SQL from tenant SQL pod 0.
 	// (The SQL client maintains an internal connection pool and automatically reconnects.)
 	require.Eventually(t, func() bool {
-		return !strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID)
+		return !strings.Contains(strings.Join(selectClusterSessionIDs(t, introspectionRunner), ","), sessionID)
 	}, 5*time.Second, 100*time.Millisecond)
 
 	// Attempt to cancel the session again over HTTP from tenant SQL pod 1, so that we can see the error message.

@rafiss rafiss removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 28, 2024
@dhartunian dhartunian added P-2 Issues/test failures with a fix SLA of 3 months and removed P-1 Issues/test failures with a fix SLA of 1 month labels Jul 1, 2024
craig bot pushed a commit that referenced this issue Jul 2, 2024
126352: cli: add fallback query support for debug zip r=xinhaoz a=dhartunian

Previously, when SQL queries for dumping tables to debug zip would fail, we would have no follow-up. Engineers can now define "fallback" queries for tables in debug zip in order to make a second attempt with a simpler query. Often we want to run a more complex query to gather more debug data but these queries can fail when the cluster is experiencing problems. This change gives us a chance to define a simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in the `TableRegistryConfig` struct for redacted and unredacted queries respectively.

Debug zip output will still include the failed attempts at the original query along with the error message file as before. If a fallback query is defined, that query will produce its own output (and error) file with an additional `.fallback` suffix added to the base table name to identify it.

Resolves: #123964
Epic: CRDB-35278

Release note: None

126354: ui: alter role events render correctly r=xinhaoz a=dhartunian

Previously, ALTER ROLE events without role options would render with an "undefined" option in the event log on the DB Console. This change amends the rendering logic to correctly render events without any options.

Resolves #124871
Epic: None

Release note (bug fix,ui change): ALTER ROLE events in the DB Console event log now render correctly when the event does not contain any role options.

126486: kvserver/rangefeed: remove lockedRangefeedStream r=nvanbenschoten a=wenyihu6

**kvserver: wrap kvpb.RangeFeedEventSink in Stream**

Previously, we declared the same interface signature twice: once in
kvpb.RangeFeedEventSink and again in rangefeed.Stream. This patch embeds
kvpb.RangeFeedEventSink inside rangefeed.Stream, making rangefeed.Stream a
superset of kvpb.RangeFeedEventSink. This approach makes sense, as each
rangefeed server stream should be a rangefeed event sink, capable of making
thread-safe rangefeed event sends.

Epic: none
Release note: none

---

**kvserver/rangefeed: remove lockedRangefeedStream**

Previously, we created separate locked rangefeed streams for each individual
rangefeed stream to ensure Send can be called concurrently as the underlying
grpc stream is not thread safe. However, since the introduction of the mux
rangefeed support, we already have a dedicated lock for the underlying mux
stream, making the Send method on each rangefeed stream thread safe already.
This patch removes the redundant locks from each individual rangefeed stream.

Epic: none
Release note: none

126487: kvserver/rangefeed: remove non-mux rangefeed metrics r=nvanbenschoten a=wenyihu6

Previously, we removed non-mux rangefeed code in
#125610. However, that patch forgot
to remove non-mux rangefeed metrics. This patch removes these metrics as they
are no longer needed.

Epic: none
Release note: none

126498: status: fix TestTenantStatusAPI test r=xinhaoz a=dhartunian

Previously, this test would use a single connection, cancel it, and then use the connection to verify the cancellation.

The test is adjusted here to use two separate sessions, one to cancel for testing, and another to observe the cancellation.

Resolves: #125404
Epic: None

Release note: None

126524: sql: unskip Insights test r=dhartunian a=dhartunian

This test has been flaky for a while because of the async tagging of the TransactionID to the insight that somtimes takes too long to complete. This change removes that check and unskips the test so that we can catch regressions for this feature. In the future we may want to write a separate test to verify the async transactionID tagging separately.

Resolves: #125771
Resolves: #121986

Epic: None
Release note: None

126533: kv: hook Raft StoreLiveness into storeliveness package r=nvanbenschoten a=nvanbenschoten

Fixes #125242.

This commit adds a `replicaRLockedStoreLiveness` adapter type to hook the raft store liveness into the storeliveness package.

This is currently unused.

Release note: None

126536: roachpb: add Leader lease type definition r=nvanbenschoten a=nvanbenschoten

Fixes #125225.

This commit adds a new `Term` field to the Lease struct. This field defines the term of the raft leader that a leader lease is associated with. The lease is valid for as long as the raft leader has a guarantee from store liveness that it remains the leader under this term. The lease is invalid if the raft leader loses leadership (i.e. changes its term).

The field is not yet used.

Release note: None

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #126498 Jul 2, 2024
@craig craig bot closed this as completed in 7d4d517 Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants