pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

cockroach-teamcity · 2024-06-10T10:15:33Z

pkg/ccl/serverccl/statusccl/statusccl_test.TestTenantStatusAPI failed with artifacts on master @ e924a27896b9fbf495609d3965612366de6ebf78:

=== RUN   TestTenantStatusAPI
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI1737636433
    tenant_status_test.go:144: -- test log scope end --
CREATE UNIQUE INDEX test_pkey ON test_db1.public.test USING btree (k ASC)test logs left over in: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI1737636433
--- FAIL: TestTenantStatusAPI (17.28s)
=== RUN   TestTenantStatusAPI/tenant_cancel_session
    tenant_status_test.go:1095: error executing 'SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'': read tcp 127.0.0.1:42566->127.0.0.1:35893: read: connection reset by peer
    tenant_status_test.go:1136: 
        	Error Trace:	pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:1136
        	            				pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:97
        	Error:      	Condition never satisfied
        	Test:       	TestTenantStatusAPI/tenant_cancel_session
    --- FAIL: TestTenantStatusAPI/tenant_cancel_session (5.11s)

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/obs-prs _{This test on roachdash | Improve this report!

Jira issue: CRDB-39439}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2024-06-20T21:43:11Z

pkg/ccl/serverccl/statusccl/statusccl_test.TestTenantStatusAPI failed with artifacts on master @ ba7de435128f4d538168c61d749e5a4d2f1496dd:

=== RUN   TestTenantStatusAPI
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI3715395111
    tenant_status_test.go:144: -- test log scope end --
CREATE UNIQUE INDEX test_pkey ON test_db1.public.test USING btree (k ASC)test logs left over in: /artifacts/tmp/_tmp/7067ad80d000d85ce2eeee021e911c58/logTestTenantStatusAPI3715395111
--- FAIL: TestTenantStatusAPI (18.19s)
=== RUN   TestTenantStatusAPI/tenant_cancel_session
    tenant_status_test.go:1095: error executing 'SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'': read tcp 127.0.0.1:56308->127.0.0.1:39391: read: connection reset by peer
    tenant_status_test.go:1136: 
        	Error Trace:	pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:1136
        	            				pkg/ccl/serverccl/statusccl/statusccl_test/pkg/ccl/serverccl/statusccl/tenant_status_test.go:97
        	Error:      	Condition never satisfied
        	Test:       	TestTenantStatusAPI/tenant_cancel_session
    --- FAIL: TestTenantStatusAPI/tenant_cancel_session (5.11s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

dhartunian · 2024-06-28T21:32:46Z

I've bisected this test failure to #124373

@rafiss is it possible that given your change we can't guarantee that in 5 seconds the session will be closed in our test?

Code in question is here, and it's the require.Eventually on line 1136 that never succeds which is checking crdb_internal.cluster_sessions for the session we've cancelled in the API to get removed. When run under --stress sometimes the session doesn't get removed after 5 seconds. Bumping to 10 seconds still doesn't do it so I think some sort of signal is missing.

cockroach/pkg/ccl/serverccl/statusccl/tenant_status_test.go

Lines 1093 to 1144 in 3962ffc

    
           func selectClusterSessionIDs(t *testing.T, conn *sqlutils.SQLRunner) []string { 
        
           	var sessionIDs []string 
        
           	rows := conn.QueryStr(t, 
        
           		"SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'") 
        
           	for _, row := range rows { 
        
           		sessionIDs = append(sessionIDs, row[0]) 
        
           	} 
        
           	return sessionIDs 
        
           } 
        
           func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelper) { 
        
           	// Open a SQL session on tenant SQL pod 0. 
        
           	sqlPod0 := helper.TestCluster().TenantConn(0) 
        
           	sqlPod0.Exec(t, "SELECT 1") 
        
           	// See the session over HTTP on tenant SQL pod 1. 
        
           	httpPod1 := helper.TestCluster().TenantAdminHTTPClient(t, 1) 
        
           	defer httpPod1.Close() 
        
           	listSessionsResp := serverpb.ListSessionsResponse{} 
        
           	httpPod1.GetJSON("/_status/sessions?exclude_closed_sessions=true", &listSessionsResp) 
        
           	var session serverpb.Session 
        
           	for _, s := range listSessionsResp.Sessions { 
        
           		if s.LastActiveQuery == "SELECT 1" { 
        
           			session = s 
        
           			break 
        
           		} 
        
           	} 
        
           	require.NotNil(t, session.ID, "session not found") 
        
           	// See the session over SQL on tenant SQL pod 0. 
        
           	sessionID := hex.EncodeToString(session.ID) 
        
           	require.Eventually(t, func() bool { 
        
           		return strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID) 
        
           	}, 5*time.Second, 100*time.Millisecond) 
        
           	// Cancel the session over HTTP from tenant SQL pod 1. 
        
           	cancelSessionReq := serverpb.CancelSessionRequest{SessionID: session.ID} 
        
           	cancelSessionResp := serverpb.CancelSessionResponse{} 
        
           	httpPod1.PostJSON("/_status/cancel_session/"+session.NodeID.String(), &cancelSessionReq, &cancelSessionResp) 
        
           	require.Equal(t, true, cancelSessionResp.Canceled, cancelSessionResp.Error) 
        
           	// No longer see the session over SQL from tenant SQL pod 0. 
        
           	// (The SQL client maintains an internal connection pool and automatically reconnects.) 
        
           	require.Eventually(t, func() bool { 
        
           		return !strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID) 
        
           	}, 5*time.Second, 100*time.Millisecond) 
        
           	// Attempt to cancel the session again over HTTP from tenant SQL pod 1, so that we can see the error message. 
        
           	httpPod1.PostJSON("/_status/cancel_session/"+session.NodeID.String(), &cancelSessionReq, &cancelSessionResp) 
        
           	require.Equal(t, false, cancelSessionResp.Canceled) 
        
           	require.Equal(t, fmt.Sprintf("session ID %s not found", sessionID), cancelSessionResp.Error) 
        
           }

rafiss · 2024-06-28T22:05:00Z

It looks like the failure is read: connection reset by peer

    tenant_status_test.go:1095: error executing 'SELECT session_id FROM crdb_internal.cluster_sessions WHERE status = 'ACTIVE' OR status = 'IDLE'': read tcp 127.0.0.1:56308->127.0.0.1:39391: read: connection reset by peer

So I don't think the issue is that the session is never cancelled. Rather, the issue appears to be that the test is trying to use a connection that was closed by the server. It appears to be a test bug -- the test is using the same SQL session to run this query as the SQL session that was cancelled. As such, I'll remove the blocker label. Prior to #124373, the test was probably getting away with this, since it took longer for the session to react to the cancel signal, which might affect how the connection pool automatically reconnects.

Here's an example of how it could be fixed. I've used overly verbose variable names for clarity.

diff --git a/pkg/ccl/serverccl/statusccl/tenant_status_test.go b/pkg/ccl/serverccl/statusccl/tenant_status_test.go
index 2b8f857e774..6db66436a63 100644
--- a/pkg/ccl/serverccl/statusccl/tenant_status_test.go
+++ b/pkg/ccl/serverccl/statusccl/tenant_status_test.go
@@ -1101,9 +1101,16 @@ func selectClusterSessionIDs(t *testing.T, conn *sqlutils.SQLRunner) []string {
 }
 
 func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelper) {
-	// Open a SQL session on tenant SQL pod 0.
-	sqlPod0 := helper.TestCluster().TenantConn(0)
-	sqlPod0.Exec(t, "SELECT 1")
+	ctx := context.TODO()
+	// Open two different SQL sessions on tenant SQL pod 0.
+	sqlPod0 := helper.TestCluster().TenantDB(0)
+	sqlPod0SessionToCancel, err := sqlPod0.Conn(ctx)
+	require.NoError(t, err)
+	sqlPod0SessionForIntrospection, err := sqlPod0.Conn(ctx)
+	require.NoError(t, err)
+	_, err = sqlPod0SessionToCancel.ExecContext(ctx, "SELECT 1")
+	require.NoError(t, err)
+	introspectionRunner := sqlutils.MakeSQLRunner(sqlPod0SessionForIntrospection)
 
 	// See the session over HTTP on tenant SQL pod 1.
 	httpPod1 := helper.TestCluster().TenantAdminHTTPClient(t, 1)
@@ -1122,7 +1129,7 @@ func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelp
 	// See the session over SQL on tenant SQL pod 0.
 	sessionID := hex.EncodeToString(session.ID)
 	require.Eventually(t, func() bool {
-		return strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID)
+		return strings.Contains(strings.Join(selectClusterSessionIDs(t, introspectionRunner), ","), sessionID)
 	}, 5*time.Second, 100*time.Millisecond)
 
 	// Cancel the session over HTTP from tenant SQL pod 1.
@@ -1134,7 +1141,7 @@ func testTenantStatusCancelSession(t *testing.T, helper serverccl.TenantTestHelp
 	// No longer see the session over SQL from tenant SQL pod 0.
 	// (The SQL client maintains an internal connection pool and automatically reconnects.)
 	require.Eventually(t, func() bool {
-		return !strings.Contains(strings.Join(selectClusterSessionIDs(t, sqlPod0), ","), sessionID)
+		return !strings.Contains(strings.Join(selectClusterSessionIDs(t, introspectionRunner), ","), sessionID)
 	}, 5*time.Second, 100*time.Millisecond)
 
 	// Attempt to cancel the session again over HTTP from tenant SQL pod 1, so that we can see the error message.

126352: cli: add fallback query support for debug zip r=xinhaoz a=dhartunian Previously, when SQL queries for dumping tables to debug zip would fail, we would have no follow-up. Engineers can now define "fallback" queries for tables in debug zip in order to make a second attempt with a simpler query. Often we want to run a more complex query to gather more debug data but these queries can fail when the cluster is experiencing problems. This change gives us a chance to define a simpler approach that can be attempted when necessary. In order to define a fallback, there are two new optional fields in the `TableRegistryConfig` struct for redacted and unredacted queries respectively. Debug zip output will still include the failed attempts at the original query along with the error message file as before. If a fallback query is defined, that query will produce its own output (and error) file with an additional `.fallback` suffix added to the base table name to identify it. Resolves: #123964 Epic: CRDB-35278 Release note: None 126354: ui: alter role events render correctly r=xinhaoz a=dhartunian Previously, ALTER ROLE events without role options would render with an "undefined" option in the event log on the DB Console. This change amends the rendering logic to correctly render events without any options. Resolves #124871 Epic: None Release note (bug fix,ui change): ALTER ROLE events in the DB Console event log now render correctly when the event does not contain any role options. 126486: kvserver/rangefeed: remove lockedRangefeedStream r=nvanbenschoten a=wenyihu6 **kvserver: wrap kvpb.RangeFeedEventSink in Stream** Previously, we declared the same interface signature twice: once in kvpb.RangeFeedEventSink and again in rangefeed.Stream. This patch embeds kvpb.RangeFeedEventSink inside rangefeed.Stream, making rangefeed.Stream a superset of kvpb.RangeFeedEventSink. This approach makes sense, as each rangefeed server stream should be a rangefeed event sink, capable of making thread-safe rangefeed event sends. Epic: none Release note: none --- **kvserver/rangefeed: remove lockedRangefeedStream** Previously, we created separate locked rangefeed streams for each individual rangefeed stream to ensure Send can be called concurrently as the underlying grpc stream is not thread safe. However, since the introduction of the mux rangefeed support, we already have a dedicated lock for the underlying mux stream, making the Send method on each rangefeed stream thread safe already. This patch removes the redundant locks from each individual rangefeed stream. Epic: none Release note: none 126487: kvserver/rangefeed: remove non-mux rangefeed metrics r=nvanbenschoten a=wenyihu6 Previously, we removed non-mux rangefeed code in #125610. However, that patch forgot to remove non-mux rangefeed metrics. This patch removes these metrics as they are no longer needed. Epic: none Release note: none 126498: status: fix TestTenantStatusAPI test r=xinhaoz a=dhartunian Previously, this test would use a single connection, cancel it, and then use the connection to verify the cancellation. The test is adjusted here to use two separate sessions, one to cancel for testing, and another to observe the cancellation. Resolves: #125404 Epic: None Release note: None 126524: sql: unskip Insights test r=dhartunian a=dhartunian This test has been flaky for a while because of the async tagging of the TransactionID to the insight that somtimes takes too long to complete. This change removes that check and unskips the test so that we can catch regressions for this feature. In the future we may want to write a separate test to verify the async transactionID tagging separately. Resolves: #125771 Resolves: #121986 Epic: None Release note: None 126533: kv: hook Raft StoreLiveness into storeliveness package r=nvanbenschoten a=nvanbenschoten Fixes #125242. This commit adds a `replicaRLockedStoreLiveness` adapter type to hook the raft store liveness into the storeliveness package. This is currently unused. Release note: None 126536: roachpb: add Leader lease type definition r=nvanbenschoten a=nvanbenschoten Fixes #125225. This commit adds a new `Term` field to the Lease struct. This field defines the term of the raft leader that a leader lease is associated with. The lease is valid for as long as the raft leader has a guarantee from store liveness that it remains the leader under this term. The lease is invalid if the raft leader loses leadership (i.e. changes its term). The field is not yet used. Release note: None Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Wenyi Hu <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

dhartunian assigned xinhaoz Jun 13, 2024

dhartunian added the P-1 Issues/test failures with a fix SLA of 1 month label Jun 26, 2024

exalate-issue-sync bot assigned dhartunian and unassigned xinhaoz Jun 28, 2024

rafiss removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 28, 2024

dhartunian added P-2 Issues/test failures with a fix SLA of 3 months and removed P-1 Issues/test failures with a fix SLA of 1 month labels Jul 1, 2024

dhartunian mentioned this issue Jul 1, 2024

status: fix TestTenantStatusAPI test #126498

Merged

craig bot closed this as completed in #126498 Jul 2, 2024

craig bot closed this as completed in 7d4d517 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

cockroach-teamcity commented Jun 10, 2024 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Jun 20, 2024

dhartunian commented Jun 28, 2024

rafiss commented Jun 28, 2024

pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

pkg/ccl/serverccl/statusccl/statusccl_test: TestTenantStatusAPI failed #125404

Comments

cockroach-teamcity commented Jun 10, 2024 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Jun 20, 2024

dhartunian commented Jun 28, 2024

rafiss commented Jun 28, 2024

cockroach-teamcity commented Jun 10, 2024 •

edited by cockroach-jira-scripts

Loading