Nexus SEC operations panic when CockroachDB is unavailable #2416

davepacheco · 2023-02-23T18:46:07Z

Nexus SEC operations currently panic if they fail:

Lines 63 to 65 in cb3d713

    
           // TODO-robustness This should be wrapped with a retry loop rather than 
        
           // unwrapping the result. 
        
           self.datastore.saga_create_event(&our_event).await.unwrap();

They should use a retry loop instead. I think this should be pretty straightforward as long as we're willing to block saga progress on CockroachDB coming back (which it seems like we should).

smklein · 2024-03-30T01:30:54Z

Saw this on a March 28th iteration of dogfood

sunshowers · 2024-03-30T01:36:11Z

Adding to what @smklein said -- we have two cores at /staff/dock/rack2/mupdate-20240329/cores/sled-08.

With mdb <core>, then $c:

core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.10458.1711663318:

Loading modules: [ libumem.so.1 libnvpair.so.1 libc.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4fd55()
_ZN95_$LT$nexus_db_queries..db..sec_store..CockroachDbSecStore$u20$as$u20$steno..store..SecStore$GT$12record_event28_$u7b$$u7b$closure$u7d$$u7d$17hc33ebb1cf636f348E+0xcec()

remaining frames

_ZN5steno3sec3Sec12executor_log28_$u7b$$u7b$closure$u7d$$u7d$17hc49d437d71bb12e7E+0x53()
_ZN117_$LT$futures_util..stream..futures_unordered..FuturesUnordered$LT$Fut$GT$$u20$as$u20$futures_core..stream..Stream$GT$9poll_next17hfd98f22f6167186cE+0x1ac()
_ZN88_$LT$tokio..future..poll_fn..PollFn$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$4poll17h87f05f880b8d9347E+0x28c()
_ZN5steno3sec3sec28_$u7b$$u7b$closure$u7d$$u7d$17he80dac5e7ee8228cE.llvm.5074909589756572123+0x3b9()
_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17hb16ad800f6b83c3eE+0x49()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h4c987821e435a118E+0x48()
_ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h5720ac884dcfa807E+0x146()
_ZN5tokio7runtime7context6scoped15Scoped$LT$T$GT$3set17h146604d8d66fce1bE+0xabe()
_ZN5tokio7runtime7context7runtime13enter_runtime17had12f1b61675d596E+0x193()
_ZN5tokio7runtime9scheduler12multi_thread6worker3run17hfc17fe6c20967720E+0x4b()
_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h9cfb48bd55732e89E+0x73()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h5f94343b6f772f15E+0x97()
_ZN5tokio7runtime8blocking4pool5Inner3run17h35df801e4bd32ce5E+0xe4()
_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h53b7bfcbc6b4686bE+0x3f()
_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h9026fcd303042064E+0x75()
_ZN3std3sys4unix6thread6Thread3new12thread_start17h1783cbcbbf061711E+0x29()
libc.so.1`_thrp_setup+0x77(fffff5ffeddcda40)
libc.so.1`_lwp_start()

And core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.5125.1711667683:

libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4f48a()
0x399e8a2()
_ZN98_$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$GT$9from_iter17h1c0a0b91677a258cE+0xc4()
_ZN159_$LT$omicron_nexus..app..background..sync_switch_configuration..SwitchPortSettingsManager$u20$as$u20$omicron_nexus..app..background..common..BackgroundTask$GT$8activate28_$u7b$$u7b$closure$u7d$$u7d$17h003d11b741f553f2E+0x26f91()

remaining frames

_ZN13omicron_nexus3app10background6common8TaskExec8activate28_$u7b$$u7b$closure$u7d$$u7d$17h60138c0e224896edE+0x233()
_ZN13omicron_nexus3app10background6common8TaskExec3run28_$u7b$$u7b$closure$u7d$$u7d$17hb86814c79d6f1a8fE.llvm.10434374178345103703+0x467()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17hdf5fcd0b9767bf4aE+0x83()
_ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h5720ac884dcfa807E+0x190()
_ZN5tokio7runtime7context6scoped15Scoped$LT$T$GT$3set17h146604d8d66fce1bE+0xabe()
_ZN5tokio7runtime7context7runtime13enter_runtime17had12f1b61675d596E+0x193()
_ZN5tokio7runtime9scheduler12multi_thread6worker3run17hfc17fe6c20967720E+0x4b()
_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h9cfb48bd55732e89E+0x73()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h5f94343b6f772f15E+0x97()
_ZN5tokio7runtime8blocking4pool5Inner3run17h35df801e4bd32ce5E+0xe4()
_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h53b7bfcbc6b4686bE+0x3f()
_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h9026fcd303042064E+0x75()
_ZN3std3sys4unix6thread6Thread3new12thread_start17h1783cbcbbf061711E+0x29()
libc.so.1`_thrp_setup+0x77(fffff5ffedddda40)
libc.so.1`_lwp_start()

sunshowers · 2024-03-30T01:39:07Z

The first one 10458 seems to be

omicron/nexus/db-queries/src/db/sec_store.rs

Line 65 in cf185c5

self.datastore.saga_create_event(&our_event).await.unwrap();

.

The second one 5125 is from somewhere in

omicron/nexus/src/app/background/sync_switch_configuration.rs

Line 275 in 17510a6

fn activate<'a>(

.

sunshowers · 2024-03-30T03:33:42Z

The second pid 5125 one is unrelated to this issue -- have filed #5361 for that.

We observed during a mupdate on Friday that Nexus panicked because of network flakiness. This is an attempt to address that by adding a retry loop. Fixes #2416.

smklein · 2024-07-15T21:48:35Z

I believe there still exists an unwrap here, which can cause nexus to panic if CRDB is unavailable:

omicron/nexus/db-queries/src/db/sec_store.rs

Lines 127 to 142 in 031b5ec

    
           async fn saga_update(&self, id: SagaId, update: steno::SagaCachedState) { 
        
               // TODO-robustness We should track the current generation of the saga 
        
               // and use it.  We'll know this either from when it was created or when 
        
               // it was recovered. 
        
               info!(&self.log, "updating state"; 
        
                   "saga_id" => id.to_string(), 
        
                   "new_state" => update.to_string() 
        
               ); 
        
               // TODO-robustness This should be wrapped with a retry loop rather than 
        
               // unwrapping the result.  See omicron#2416. 
        
               self.datastore 
        
                   .saga_update_state(id, update, self.sec_id, Generation::new()) 
        
                   .await 
        
                   .unwrap(); 
        
           }

This was the root cause of #6090

sunshowers · 2024-07-15T21:51:20Z

Ah yeah you're right, shouldn't have closed this. Sorry!

smklein · 2024-07-15T21:52:40Z

Ah yeah you're right, shouldn't have closed this. Sorry!

No worries, it was easy to miss. Do you wanna take fixing this one, or should I?

sunshowers · 2024-07-15T21:55:14Z

I'll pick it up, thanks!

davepacheco · 2024-07-15T21:59:14Z

@sunshowers see also #6090 (comment)

sunshowers · 2024-07-15T22:02:07Z

@davepacheco thanks -- the first one can be done easily enough I hope, but does the second one need optimistic concurrency/a generation number? If so, then we should just implement that.

davepacheco · 2024-07-15T23:19:36Z

I think it already has that, just using a generation number that's made up? The "adopt_generation" was intended to be bumped whenever a takeover happens, but we haven't implemented that.

I'm not sure it's worth implementing a case we can't have in production and so can't test. What would we do if the OCC update fails for some reason?

…potent (#6113) See discussion in #2416 and #6090 (comment). A summary of the changes here: 1. Made `saga_create_event` idempotent. Previously, creating another event that duplicated the one which already existed would fail -- now it succeeds. These events are meant to be an append-only idempotent log, so this is okay. Also added a test for this. 2. `saga_update_state` was already idempotent -- added a test which made sure of this. Also added a comment about how idempotence may not be enough in the future. 3. Added a retry loop around saga state updates, similar to the one around recording events.

sunshowers · 2024-07-19T20:58:03Z

Grepped through omicron for "2416" and didn't see any other results, so I think this is done for real now.

davepacheco · 2024-08-14T00:43:08Z

Related: oxidecomputer/steno#302. I hit this because in my PR had I broken things so that this retry loop continued hitting a permanent error:

00:19:42.794Z ERRO c39c0c31-e34e-4b77-9616-2a0bf956f9b6 (ServerContext): client error while updating saga state (likely requires operator intervention), retrying anyway
    call_count = 12
    error = Invalid Request: failed to update saga SagaId(9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2) with state Done: preconditions not met: expected current_sec = c39c0c31-e34e-4b77-9616-2a0bf956f9b6, adopt_generation = Generation(Generation(1)), but found current_sec = Some(c39c0c31-e34e-4b77-9616-2a0bf956f9b6), adopt_generation = Generation(Generation(2)), state = SagaCachedState(Running)
    file = nexus/db-queries/src/db/sec_store.rs:147
    new_state = done
    saga_id = 9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2
    total_duration = 612.261975772s

During that time, I couldn't list sagas. My guess is that other sagas couldn't complete, either. I think Nexus is doing nothing wrong and that Steno ought to behave better here but I figured I'd mention it here so folks were aware.

davepacheco added this to the MVP milestone Feb 23, 2023

davepacheco mentioned this issue Feb 8, 2024

Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error #5026

Open

morlandi7 mentioned this issue Mar 6, 2024

NAT entries not fully populating during scrimlet reboot #5037

Closed

askfongjojo modified the milestones: MVP, 8 Mar 12, 2024

This was referenced Mar 30, 2024

nexus-db-queries' CockroachDbSecStore::record_event should not panic on failure #5355

Closed

Set RUST_BACKTRACE=1 for production services: create a crate and use it #5360

Open

sunshowers mentioned this issue Apr 3, 2024

[nexus-db-queries] add a retry loop to saga record_event #5390

Merged

1 task

sunshowers closed this as completed in #5390 Apr 4, 2024

smklein mentioned this issue Jul 15, 2024

Nexus panicked when there are unhandled instance state update exceptions #6090

Closed

smklein reopened this Jul 15, 2024

sunshowers mentioned this issue Jul 17, 2024

[nexus-db-queries] make saga_update more resilient, record_event idempotent #6113

Merged

sunshowers closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus SEC operations panic when CockroachDB is unavailable #2416

Nexus SEC operations panic when CockroachDB is unavailable #2416

davepacheco commented Feb 23, 2023

smklein commented Mar 30, 2024

sunshowers commented Mar 30, 2024 •

edited

Loading

sunshowers commented Mar 30, 2024

sunshowers commented Mar 30, 2024 •

edited

Loading

smklein commented Jul 15, 2024

sunshowers commented Jul 15, 2024

smklein commented Jul 15, 2024

sunshowers commented Jul 15, 2024

davepacheco commented Jul 15, 2024

sunshowers commented Jul 15, 2024

davepacheco commented Jul 15, 2024

sunshowers commented Jul 19, 2024

davepacheco commented Aug 14, 2024

Nexus SEC operations panic when CockroachDB is unavailable #2416

Nexus SEC operations panic when CockroachDB is unavailable #2416

Comments

davepacheco commented Feb 23, 2023

smklein commented Mar 30, 2024

sunshowers commented Mar 30, 2024 • edited Loading

sunshowers commented Mar 30, 2024

sunshowers commented Mar 30, 2024 • edited Loading

smklein commented Jul 15, 2024

sunshowers commented Jul 15, 2024

smklein commented Jul 15, 2024

sunshowers commented Jul 15, 2024

davepacheco commented Jul 15, 2024

sunshowers commented Jul 15, 2024

davepacheco commented Jul 15, 2024

sunshowers commented Jul 19, 2024

davepacheco commented Aug 14, 2024

sunshowers commented Mar 30, 2024 •

edited

Loading

sunshowers commented Mar 30, 2024 •

edited

Loading