sql,admission: ERROR: liveness session expired XXXms before transaction #78691

cucaroach · 2022-03-29T01:13:58Z

Multiple tenants hitting a kv server with a large batched insert load can cause this error. For example see:

Changing warmup batch size from 100 to 1000 reliably demonstrates the issue.

ajwerner · 2022-03-29T01:17:58Z

I think there's two questions here:

How did we get to a place where the session was about to expire, and
Why are we returning this error as opposed to just forcing the system to retry internally

I feel like it's possible that just kv server rate limiting is starving off session renewal, given the shape of the test. As for the error, we probably should treat is as a restart internally like we do with failing to meet a transaction deadline.

cucaroach · 2022-09-23T14:56:53Z

I hit this in another simpler test with just 1 node in the storage tier, in this case I wasn't getting any table statistics so I looked at error from:

> github.com/cockroachdb/cockroach/pkg/sql/stats.(*Refresher).refreshStats() github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/automatic_stats.go:773 (hits goroutine(2072):2 total:2) (PC: 0x1a9ca91)
Warning: debugging optimized function
(dlv) p err.cause.cause
error(*github.com/cockroachdb/errors/withstack.withStack) *{
        cause: error(*github.com/cockroachdb/errors/errutil.leafError) *{
                msg: "liveness session expired 8m6.020911792s before transaction",},
        stack: *github.com/cockroachdb/errors/withstack.stack len: 11, cap: 32, [27814108,52907324,52907254,52905276,52968880,52903647,52858847,52852936,52847627,54041388,4978305],}

Then I noticed my kv node log was full of:

W220923 13:24:46.348441 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r122878/1:‹/Tenant/11/Table/104/1/0/968…›] 1941239  error processing replica: ‹0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster›‹›
W220923 13:24:46.348441 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r122878/1:‹/Tenant/11/Table/104/1/0/968…›] 1941239 +‹trace:›
W220923 13:24:46.348441 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r122878/1:‹/Tenant/11/Table/104/1/0/968…›] 1941239 +‹     0.000ms      0.000ms    === operation:process replica _unfinished:1 _verbose:1 node:1 replicate: store:1 range:122878/1:/Tenant/11/Table/104/1/0/968…›
W220923 13:24:46.348441 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r122878/1:‹/Tenant/11/Table/104/1/0/968…›] 1941239 +‹     0.036ms      0.036ms    event:kv/kvserver/allocator/allocatorimpl/allocator.go:682 [n1,replicate,s1,r122878/1:/Tenant/11/Table/104/1/0/968…] add voter - missing voter need=3, have=1, priority=10001.00›
W220923 13:24:46.348441 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r122878/1:‹/Tenant/11/Table/104/1/0/968…›] 1941239 +‹     0.050ms      0.014ms    event:kv/kvserver/replicate_queue.go:864 [n1,replicate,s1,r122878/1:/Tenant/11/Table/104/1/0/968…] next replica action: add voter›
W220923 13:24:46.348441 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r122878/1:‹/Tenant/11/Table/104/1/0/968…›] 1941239 +‹     0.087ms      0.037ms    event:kv/kvserver/allocator/allocatorimpl/allocator.go:1037 [n1,replicate,s1,r122878/1:/Tenant/11/Table/104/1/0/968…] allocate voter: []›
W220923 13:24:46.349130 301 13@kv/kvserver/replicate_queue.go:815 ⋮ [n1,replicate,s1,r108120/1:‹/Tenant/11/Table/104/1/5/839…›] 1941240  error processing replica: ‹0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster›‹›

So I added two more storage nodes and now its happy. So current hypothesis is lease errors occur due to under provisioned storage tier that's in an unhappy state. Should I close this or do we care about the ability to function in this situation?

ajwerner · 2022-09-23T15:33:44Z

If you pause in your debugger for too long, I'd expect this.

ajwerner · 2022-09-23T15:34:32Z

I think we close this, I'm not sure there's much to do here. The second one is #87452.

irfansharif · 2023-03-16T15:47:11Z

Re-enabled these tests which started tripping again in #97448. Re-opening to track:

Why are we returning this error as opposed to just forcing the system to retry internally?

And also maybe using higher AC priority for session renewal goroutines, which would help with:

How did we get to a place where the session was about to expire, and

irfansharif · 2023-03-16T15:58:35Z

And also maybe using higher AC priority for session renewal goroutines:

Actually the higher pri won’t help, since session renewal is bypassing AC altogether:

cockroach/pkg/sql/sqlliveness/slstorage/slstorage.go

Line 456 in 5129578

err = s.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

That uses AdmissionHeader_OTHER, which hooks into:

cockroach/pkg/kv/kvserver/kvadmission/kvadmission.go

Lines 210 to 212 in 736a67e

    
           if source == kvpb.AdmissionHeader_OTHER { 
        
           	bypassAdmission = true 
        
           }

irfansharif · 2023-03-16T17:02:30Z

Actually the higher pri won’t help, since session renewal is bypassing AC altogether.

Actually, I’m being daft. This is a multi-tenant test, so we’re actually not bypassing AC. We’re just using regular priority then for SQL session extensions.

cockroach/pkg/kv/kvserver/kvadmission/kvadmission.go

Lines 203 to 210 in 736a67e

    
           bypassAdmission := ba.IsAdmin() 
        
           source := ba.AdmissionHeader.Source 
        
           if !roachpb.IsSystemTenantID(tenantID.ToUint64()) { 
        
           	// Request is from a SQL node. 
        
           	bypassAdmission = false 
        
           	source = kvpb.AdmissionHeader_FROM_SQL 
        
           } 
        
           if source == kvpb.AdmissionHeader_OTHER {

I'll go back to explicitly using high pri first.

Fixes cockroachdb#97448 (possibly). Fixes cockroachdb#78691. These tests run under severe CPU overload, and we see the workload getting observing the following errors: ERROR: liveness session expired 571.043163ms before transaction The SQL liveness lease extension work ends up getting severely starved, despite extending leases by 40s every 5s. It turns out for tenant SQL liveness work, we were using admissionpb.NormalPri, so such starvation was possible. This wasn't true for the system tenant where we bypassed AC altogether. Release note: None

98207: sql: add REPLICATION and MANAGETENANT system privileges r=msbutler a=stevendanna This adds two new system privileges: - `REPLICATION`: Allows the user to call the internal functions that produce a cross-cluster replication stream. - `MANAGETENANT`: Allows the user to create and manage tenants. A user with the MANAGETENANT privileges is now able to execute the following statements: - SHOW TENANT - SHOW TENANTS - CREATE TENANT - CREATE TENANT FROM REPLICATION STREAM - DROP TENANT (if it is part of an active stream) - ALTER TENANT A user with the REPLICATION privileges is able to call the following functions: - crdb_internal.start_replication_stream - crdb_internal.replication_stream_progress - crdb_internal.stream_partition - crdb_internal.replication_stream_spec - crdb_internal.complete_replication_stream Fixes #95425 Release note: None 98785: roachtest: de-flake multitenant-fairness/read-heavy/skewed r=irfansharif a=irfansharif Fixes #97448 (possibly). Fixes #78691. These tests run under severe CPU overload, and we see the workload getting observing the following errors: ERROR: liveness session expired 571.043163ms before transaction The SQL liveness lease extension work ends up getting severely starved, despite extending leases by 40s every 5s. It turns out for tenant SQL liveness work, we were using admissionpb.NormalPri, so such starvation was possible. This wasn't true for the system tenant where we bypassed AC altogether. Release note: None 98821: sql: TestRandomSyntaxGeneration fixes r=cucaroach a=cucaroach ### sql: fix some TestRandomSyntaxGeneration bugs The RSG works by calling format on the AST's it generates so its good at finding Format bugs. Fix a missing separator in ShowBackupOptions. Example: ``` SHOW BACKUP 'family' IN ('string', 'placeholder', 'placeholder', 'placeholder', 'string', 'placeholder', 'string', 'placeholder') WITH incremental_location = 'nullif', privilegesdebug_dump_metadata_sst ``` Fix bad construction in ShowTenant. Example: ``` SHOW TENANT [B'10010'] WITH REPLICATION STATUS WITH CAPABILITIES ``` Epic: none Release note: None ### copy: fix copy grammar to match PG Previously COPY would allow a wide range of syntax in the COPY TO substatement. Now like PG we limit it to a few things. PG grammar is: ``` PreparableStmt: SelectStmt | InsertStmt | UpdateStmt | DeleteStmt | MergeStmt ``` And now we do something similar. This prevents the wheels from coming off when RSG generates EXPLAIN's in the substatement for instance. Release note: none Epic: none 99109: kvserver: fortify TestReplicaClosedTimestamp r=erikgrinaker a=tbg This test was flaky until ~Feb 2nd. This has since resolved, likely as a result of some other change, however there's an easy way to make the test a bit more resilient by widening a critical section. Closes #93864 Epic: none Release note: None 99122: ccl/multiregionccl: skip flaky secondary_region test r=matthewtodd a=matthewtodd Part of #92235. Part of #98020. It [flaked][1] this morning, after last night's [other skip][2] landed. [1]: #98020 (comment) [2]: #99031 Release note: None Co-authored-by: Steven Danna <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Matthew Todd <[email protected]>

cucaroach added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 29, 2022

cucaroach closed this as completed Sep 23, 2022

irfansharif mentioned this issue Mar 15, 2023

roachtest: admission-control/multitenant-fairness/read-heavy/skewed failed #97448

Closed

irfansharif reopened this Mar 16, 2023

irfansharif self-assigned this Mar 16, 2023

irfansharif changed the title ~~multitenant: ERROR: liveness session expired XXXms before transaction~~ sql,admission: ERROR: liveness session expired XXXms before transaction Mar 16, 2023

irfansharif mentioned this issue Mar 16, 2023

roachtest: de-flake multitenant-fairness/read-heavy/skewed #98785

Merged

craig bot closed this as completed in 15dc772 Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql,admission: ERROR: liveness session expired XXXms before transaction #78691

sql,admission: ERROR: liveness session expired XXXms before transaction #78691

cucaroach commented Mar 29, 2022 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Mar 29, 2022

cucaroach commented Sep 23, 2022

ajwerner commented Sep 23, 2022

ajwerner commented Sep 23, 2022

irfansharif commented Mar 16, 2023

irfansharif commented Mar 16, 2023

irfansharif commented Mar 16, 2023

sql,admission: ERROR: liveness session expired XXXms before transaction #78691

sql,admission: ERROR: liveness session expired XXXms before transaction #78691

Comments

cucaroach commented Mar 29, 2022 • edited by cockroach-jira-scripts Loading

ajwerner commented Mar 29, 2022

cucaroach commented Sep 23, 2022

ajwerner commented Sep 23, 2022

ajwerner commented Sep 23, 2022

irfansharif commented Mar 16, 2023

irfansharif commented Mar 16, 2023

irfansharif commented Mar 16, 2023

cucaroach commented Mar 29, 2022 •

edited by cockroach-jira-scripts

Loading