server: TestServerShutdownReleasesSession failed #107592

cockroach-teamcity · 2023-07-26T02:56:18Z

server.TestServerShutdownReleasesSession failed with artifacts on master @ 640a56415d3aa263b85d29fcd0089a43f2f7e95a:

=== RUN   TestServerShutdownReleasesSession
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/7f3306bd25c63f593c1cefe649bebfca/logTestServerShutdownReleasesSession4183567719
    test_log_scope.go:81: use -show-logs to present logs inline
    drain_test.go:336: error executing 'SELECT session_id FROM system.sqlliveness WHERE session_id = $1': pq: failed to connect to n2 at : initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp: missing address" [code 14/Unavailable]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/7f3306bd25c63f593c1cefe649bebfca/logTestServerShutdownReleasesSession4183567719
--- FAIL: TestServerShutdownReleasesSession (6.43s)

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv @cockroachdb/server _{This test on roachdash | Improve this report!

Jira issue: CRDB-30119}

The text was updated successfully, but these errors were encountered:

rafiss · 2023-07-26T20:17:02Z

@jeffswenson Could you help me with my understanding of this test?

cockroach/pkg/server/drain_test.go

Lines 311 to 352 in 1d3c11e

    
           func TestServerShutdownReleasesSession(t *testing.T) { 
        
           	defer leaktest.AfterTest(t)() 
        
           	defer log.Scope(t).Close(t) 
        
           	ctx := context.Background() 
        
           	s, _, _ := serverutils.StartServer(t, base.TestServerArgs{ 
        
           		DefaultTestTenant: base.TestControlsTenantsExplicitly, 
        
           	}) 
        
           	defer s.Stopper().Stop(ctx) 
        
           	tenantArgs := base.TestTenantArgs{ 
        
           		TenantID: serverutils.TestTenantID(), 
        
           	} 
        
           	tenant, tenantSQLRaw := serverutils.StartTenant(t, s, tenantArgs) 
        
           	defer tenant.Stopper().Stop(ctx) 
        
           	tenantSQL := sqlutils.MakeSQLRunner(tenantSQLRaw) 
        
           	queryOwner := func(id base.SQLInstanceID) (owner *string) { 
        
           		tenantSQL.QueryRow(t, "SELECT session_id FROM system.sql_instances WHERE id = $1", id).Scan(&owner) 
        
           		return owner 
        
           	} 
        
           	sessionExists := func(session string) bool { 
        
           		rows := tenantSQL.QueryStr(t, "SELECT session_id FROM system.sqlliveness WHERE session_id = $1", session) 
        
           		return 0 < len(rows) 
        
           	} 
        
           	tmpTenant, err := s.StartTenant(ctx, tenantArgs) 
        
           	require.NoError(t, err) 
        
           	tmpSQLInstance := tmpTenant.SQLInstanceID() 
        
           	session := queryOwner(tmpSQLInstance) 
        
           	require.NotNil(t, session) 
        
           	require.True(t, sessionExists(*session)) 
        
           	require.NoError(t, tmpTenant.DrainClients(context.Background())) 
        
           	require.False(t, sessionExists(*session), "expected session %s to be deleted from the sqlliveness table, but it still exists", *session) 
        
           	require.Nil(t, queryOwner(tmpSQLInstance), "expected sql_instance %d to have no owning session_id", tmpSQLInstance) 
        
           }

tmpTenant.DrainClients is called, and then afterwards, it executes tenantSQL.QueryStr(t, "SELECT session_id FROM system.sqlliveness WHERE session_id = $1", session). I'm wondering how the second step can work if the first step drained all the clients?

rafiss · 2023-07-27T15:09:21Z

More details on the failure that happened:

After calling DrainClients, the test executes this query explicitly: SELECT session_id FROM system.sqlliveness WHERE session_id = $1

The query failed with an error from the grpc layer: pq: failed to connect to n2 at : initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp: missing address" [code 14/Unavailable].

Here are some related server logs:

I230726 02:46:45.284098 2523 server/drain.go:451 â‹® [-] 295  SQL server drained successfully; SQL queries cannot execute any more
W230726 02:46:45.284128 7958 sql/sqlliveness/slinstance/slinstance.go:334 â‹® [T10,nsql2] 296  exiting heartbeat loop
I230726 02:46:45.284137 7958 sql/sqlliveness/slinstance/slinstance.go:319 â‹® [T10,nsql2] 297  draining heartbeat loop
E230726 02:46:45.292188 8465 2@rpc/peer.go:611 â‹® [T10,nsql1,rnode=2,raddr,class=default,rpc] 298  failed connection attemptâ€¹ (last connected 0s ago)â€º: grpc: â€¹connection error: desc = "transport: error while dialing: dial tcp: missing address"â€º [code 14/Unavailable]
E230726 02:46:45.293316 8548 2@rpc/peer.go:611 â‹® [T10,nsql1,rnode=2,raddr,class=default,rpc] 299  failed connection attemptâ€¹ (last connected 1ms ago)â€º: grpc: â€¹connection error: desc = "transport: error while dialing: dial tcp: missing address"â€º [code 14/Unavailable]
I230726 02:46:45.294127 5786 sql/stats/automatic_stats.go:572 â‹® [T10,nsql1] 300  quiescing auto stats refresher
I230726 02:46:45.294198 5787 sql/stats/automatic_stats.go:624 â‹® [T10,nsql1] 301  quiescing stats garbage collector
I230726 02:46:45.294321 5667 server/start_listen.go:103 â‹® [T10,nsql1] 302  server shutting down: instructing cmux to stop accepting
W230726 02:46:45.294510 5671 kv/kvclient/kvtenant/connector.go:387 â‹® [T10,nsql1,tenant-connector] 303  error consuming GossipSubscription RPC: recv msg error: grpc: â€¹context canceledâ€º [code 1/Canceled]
I230726 02:46:45.294749 5989 jobs/registry.go:1606 â‹® [T10,nsql1] 304  AUTO SPAN CONFIG RECONCILIATION job 885618127142125569: stepping through state succeeded
W230726 02:46:45.296828 5750 sql/sqlliveness/slinstance/slinstance.go:334 â‹® [T10,nsql1] 305  exiting heartbeat loop
W230726 02:46:45.296852 5750 sql/sqlliveness/slinstance/slinstance.go:321 â‹® [T10,nsql1] 306  exiting heartbeat loop with error: node unavailable; try another peer
W230726 02:46:45.296932 5672 kv/kvclient/kvtenant/setting_overrides.go:59 â‹® [T10,nsql1,tenant-connector] 307  error consuming TenantSettings RPC: recv msg error: grpc: â€¹context canceledâ€º [code 1/Canceled]
E230726 02:46:45.297048 7152 jobs/registry.go:1004 â‹® [-] 308  error getting live session: node unavailable; try another peer
I230726 02:46:45.298703 8553 server/node.go:1341 â‹® [T1,n1,tenant=10] 309  batch request â€¹Scan [/Tenant/10/Table/15/1/885618127142125569,/Tenant/10/Table/15/1/885618127142125570), [txn: 8a3e7723], [can-forward-ts]â€º failed with error: aborted during Replica.Send: context canceled
I230726 02:46:45.298703 8553 server/node.go:1341 â‹® [T1,n1,tenant=10] 309 +trace:
I230726 02:46:45.298703 8553 server/node.go:1341 â‹® [T1,n1,tenant=10] 309 +     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch
W230726 02:46:45.301934 5989 jobs/adopt.go:534 â‹® [-] 310  could not clear job claim: â€¹clear-job-claim: failed to send RPC: sending to all replicas failed; last error: ba: Scan [/Tenant/10/Table/15/1/885618127142125569,/Tenant/10/Table/15/1/885618127142125570), [txn: 8a3e7723], [can-forward-ts] RPC error: grpc: error reading from server: read tcp 127.0.0.1:57440->127.0.0.1:38771: use of closed network connection [code 14/Unavailable]â€º
W230726 02:46:45.306597 2523 server/server_sql.go:1676 â‹® [T10,nsql1] 311  server shutdown without a prior graceful drain
I230726 02:46:45.306822 3423 sql/stats/automatic_stats.go:624 â‹® [T1,n1] 312  quiescing stats garbage collector
I230726 02:46:45.307056 2648 server/start_listen.go:103 â‹® [T1,n1] 313  server shutting down: instructing cmux to stop accepting
W230726 02:46:45.307122 7537 kv/kvserver/intentresolver/intent_resolver.go:826 â‹® [-] 314  failed to gc transaction record: could not GC completed transaction anchored at /Tenant/10/Table/15/1/â€¹885618130467422209â€º: node unavailable; try another peer

chengxiong-ruan · 2023-07-27T16:19:27Z

yeah, so we StartTenant twice with the same tenantID in tenantArgs which means that there is only one tenant. So, looks like DrainClients should also cut off the connection of tenantSQL. Maybe a more stable way is to read the tenant's kv space directly through the server instead of the tenant.

Edited: oh, looks like StartTenant starts new sql instance, so there're actually two sql instances for the tenant. Is it possible that the connection was shifted to the drained instance for some reason?

rafiss · 2023-07-27T16:40:21Z

Oops, I totally misread that that tmpTenant is started as a separate tenant.

OK, so then I think the question here is why does this gRPC request from the tenant that was not stopped fail (only able to repro under stress)? @knz I saw that you moved this to SQL Foundations, but is it correct that the gRPC layer is more of a KV area?

rafiss · 2023-07-27T16:52:25Z

nvm, i think i found a solution: #107727

…107752 #107802 #107803 106508: util/must: add runtime assertion API r=erikgrinaker a=erikgrinaker For details and usage examples, see the [package documentation](https://github.com/erikgrinaker/cockroach/blob/must/pkg/util/must/must.go). --- This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure. Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via `log.Fatalf`, but this should be the exception rather than the norm. It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds. This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings: * `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues. * `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds. * `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available. * `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters. * `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector. For more details and examples, see the `must` package documentation. Resolves #94986. Epic: none Release note: None 107094: streamingest: unskip TestTenantStreamingUnavailableStreamAddress r=lidorcarmel a=lidorcarmel Changing a few things to get this test to pass under stress: - use 50 ranges instead of 10, because there are already 50-ish system ranges, so if we write only 10 more ranges those might not get distributed on all servers. - avoid reading from the source cluster after stopping a node, it's flaky, see #107499 for more info. Epic: none Fixes: #107023 Fixes: #106865 Release note: None 107717: server/profiler: remove `server.cpu_profile.enabled` setting r=xinhaoz a=xinhaoz Cpu profiling can be enabled by setting the cluster setting `server.cpu_profile.cpu_usage_combined_threshold`. This makes `server.cpu_profile.enabled` redundant and makes it more difficult and confusing to enable cpu profiling. This commit removes the `server.cpu_profile.enabled` setting entirely. Note that both jdefault values for the cluster settings set profiling off. Closes: #102024 Release note (sql change): The cluster setting `server.cpu_profile.enabled` has been removed. `server.cpu_profile.cpu_usage_combined_threshold` can enable and disable cpu profiling. 107720: cli: add probe_range in debug.zip r=j82w a=j82w PR #79546 introduces `crdb_internal.probe_range`. This PR adds the `crdb_internal.probe_range` to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long. closes: #80360 Release note (cli change): The debug.zip now includes the `crdb_internal.probe_range` table with a limit of 100 rows to avoid the query from taking to long. 107727: server: deflake TestServerShutdownReleasesSession r=rafiss a=rafiss The tenant was not being fully stopped, so the test could encounter flakes. fixes #107592 Release note: None 107742: ui: show txn fingerprint details page with unspecified app r=xinhaoz a=xinhaoz Previously, when the app was not specified in the url search params for the txn details fingerprint page, the page would fail to load. This commit allows the page to load when there is no app specified but a fingerprint id that matches the requested page in the payload. The first matching fingerprint id is loaded. Additionally, the TransactionDetailsLink will not include the appNames search param unless the provided prop is non-nullish. Fixes: #107731 Release note (bug fix): Txn fingerprint details page in the console UI should load with the fingerprint details even if no app is specified in the URL. Demo: https://www.loom.com/share/810308d3dcd74ca888c42287ebafaecf 107745: kvserver: fix test merge queue when grunning unsupported r=irfansharif a=kvoli `TestMergeQueue/load-based-merging/switch...below-threshold` asserts that switching the split objective between CPU and QPS will not cause ranges to merge, even if their pre-switch load qualified them for merging. This test was broken when `grunning` was unsupported, as the objective never actually switches to anything other than QPS. Add a check for `grunning` support, and assert that a merge occurs if unsupported. Fixes: #106937 Epic: none Release note: None 107749: opt: add enable_durable_locking_for_serializable session variable r=DrewKimball,nvanbenschoten a=michae2 Follow-up from #105857 This commit ammends 6a3e43d to add a session variable to control whether guaranteed-durable locks are used under serializable isolation. Informs: #100194 Epic: CRDB-25322 Release note (sql change): Add a new session variable, `enable_durable_locking_for_serializable`, which controls locking durability under serializable isolation. With this set to true, SELECT FOR UPDATE locks, SELECT FOR SHARED locks, and constraint check locks (e.g. locks acquired during foreign key checks if `enable_implicit_fk_locking_for_serializable` is set to true) will be guaranteed-durable under serializable isolation, meaning they will always be held to transaction commit. (These locks are always guaranteed-durable under weaker isolation levels.) By default, under serializable isolation these locks are best-effort rather than guaranteed-durable, meaning in some cases (e.g. leaseholder transfer, node loss, etc.) they could be released before transaction commit. Serializable isolation does not rely on locking for correctness, only using it to improve performance under contention, so this default is a deliberate choice to avoid the performance overhead of lock replication. 107752: changefeedccl: prevent deadlock in TestChangefeedKafkaMessageTooLarge r=miretskiy a=jayshrivastava Previously, this test would deadlock due to kafka retrying messages too many times. These messages are stored in a buffer of size 1024 created by the CDC testing infra: https://github.com/cockroachdb/cockroach/blob/5c3f96d38cdc3a2d953ca3ffb1e39e97d7e5110e/pkg/ccl/changefeedccl/testfeed_test.go#L1819 The test asserts that 2000 messages pass through the buffer. When the test finishes, it stops reading from the buffer. The problem is that due to retries, there may be more messages sent to the buffer than that are read out of the buffer. Even after the 2000 messages are read and the test is shutting down, the sink may be blocked trying to put resolved messages (plus retries) in the buffer. If this happens, the changefeed resumer (same goroutine as the kafka sink) gets blocked and does not terminate when the job is cancelled at the end of the test. This change caps the number of retries at 200 for this test, so there should be no more than 200 extra messages plus a few resolved messages during this test. This is far less than the buffer size of 1024. See detailed explanation in #107591. Fixes: #107591 Epic: none Release note: None 107802: teamcity-trigger: don't start a job for an empty target r=healthy-pod a=rickystewart This makes no sense, so skip these cases. Closes: #107779 Closes: #107780 Closes: #107781 Epic: none Release note: None 107803: githubpost: set `map` field if `null` r=healthy-pod a=rickystewart Go is a really good language. Informs: #107779 Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: j82w <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

cockroach-teamcity · 2023-07-28T18:10:46Z

server.TestServerShutdownReleasesSession failed with artifacts on master @ 167da6563fbc62ff7ad447bcf44bc24f336aef6b:

=== RUN   TestServerShutdownReleasesSession
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/7f3306bd25c63f593c1cefe649bebfca/logTestServerShutdownReleasesSession3453402071
    test_log_scope.go:81: use -show-logs to present logs inline
    drain_test.go:336: error executing 'SELECT session_id FROM system.sqlliveness WHERE session_id = $1': pq: failed to connect to n2 at : initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp: missing address" [code 14/Unavailable]
    panic.go:522: -- test log scope end --
--- FAIL: TestServerShutdownReleasesSession (4.26s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team labels Jul 26, 2023

cockroach-teamcity added this to the 23.2 milestone Jul 26, 2023

knz added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-kv KV Team labels Jul 26, 2023

rafiss self-assigned this Jul 27, 2023

rafiss mentioned this issue Jul 27, 2023

server: deflake TestServerShutdownReleasesSession #107727

Merged

craig bot closed this as completed in #107727 Jul 28, 2023

nvanbenschoten mentioned this issue Aug 4, 2023

server: TestServerShutdownReleasesSession failed #108152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: TestServerShutdownReleasesSession failed #107592

server: TestServerShutdownReleasesSession failed #107592

cockroach-teamcity commented Jul 26, 2023 •

edited by cockroach-jira-scripts

Loading

rafiss commented Jul 26, 2023 •

edited

Loading

rafiss commented Jul 27, 2023

chengxiong-ruan commented Jul 27, 2023 •

edited

Loading

rafiss commented Jul 27, 2023

rafiss commented Jul 27, 2023

cockroach-teamcity commented Jul 28, 2023

server: TestServerShutdownReleasesSession failed #107592

server: TestServerShutdownReleasesSession failed #107592

Comments

cockroach-teamcity commented Jul 26, 2023 • edited by cockroach-jira-scripts Loading

rafiss commented Jul 26, 2023 • edited Loading

rafiss commented Jul 27, 2023

chengxiong-ruan commented Jul 27, 2023 • edited Loading

rafiss commented Jul 27, 2023

rafiss commented Jul 27, 2023

cockroach-teamcity commented Jul 28, 2023

cockroach-teamcity commented Jul 26, 2023 •

edited by cockroach-jira-scripts

Loading

rafiss commented Jul 26, 2023 •

edited

Loading

chengxiong-ruan commented Jul 27, 2023 •

edited

Loading