[sled agent] Serialize Nexus notification queue #1917

smklein · 2022-11-04T15:17:03Z

There are many spots in the Sled Agent where we send notifications up to Nexus, such as:

Lines 83 to 123 in 17ab9fd

    
           let notify_nexus = || async { 
        
               info!( 
        
                   log, 
        
                   "contacting server nexus, registering sled: {}", sled_id 
        
               ); 
        
               let role = if is_scrimlet { 
        
                   nexus_client::types::SledRole::Scrimlet 
        
               } else { 
        
                   nexus_client::types::SledRole::Gimlet 
        
               }; 
        
               let nexus_client = lazy_nexus_client 
        
                   .get() 
        
                   .await 
        
                   .map_err(|err| BackoffError::transient(err.to_string()))?; 
        
               nexus_client 
        
                   .sled_agent_put( 
        
                       &sled_id, 
        
                       &nexus_client::types::SledAgentStartupInfo { 
        
                           sa_address: sled_address.to_string(), 
        
                           role, 
        
                       }, 
        
                   ) 
        
                   .await 
        
                   .map_err(|err| BackoffError::transient(err.to_string())) 
        
           }; 
        
           let log_notification_failure = |err, delay| { 
        
               warn!( 
        
                   log, 
        
                   "failed to notify nexus about sled agent: {}, will retry in {:?}", err, delay; 
        
               ); 
        
           }; 
        
           retry_notify( 
        
               internal_service_policy_with_max( 
        
                   std::time::Duration::from_secs(1), 
        
               ), 
        
               notify_nexus, 
        
               log_notification_failure, 
        
           ) 
        
           .await 
        
           .expect("Expected an infinite retry loop contacting Nexus");

These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".

Example problems:

If we attempt to notify Nexus about a new sled - as a "Gimlet" - but later try to notify Nexus about the sled being updated to a "Scrimlet", then the ordering of the two notifications is critical. It's important that Nexus eventually sees the sled as a Scrimlet, and if a retry loop happens to mean that the "see sled as gimlet" notification arrives last, we'd end up in an inconsistent state.
We can try to notify Nexus about datasets before zpools, and about zpools before the sled itself. This results in a "failure + retry", which is fine, but produces some confusing log messages.

Proposal:

Create a more broadly-usable "notification queue", where calls to nexus may be serialized.

andrewjstone · 2022-11-04T15:27:27Z

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

smklein · 2022-11-04T15:31:54Z

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved.

I don't think this should matter - even if we're communicating with multiple Nexuses, we won't get a response until that Nexus has committed our message to durable storage (CRDB) which should be shared among different Nexuses. If awaiting that response blocks the subsequent request, we've successfully serialized our notifications.

There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

Yeah, this is true, and a bit more complicated. Not all notifications from the Sled Agent are equal, so it isn't necessarily right to "just ignore everything but the latest update" too

andrewjstone · 2022-11-04T15:33:43Z

Ah, ok. I guess I didn't realize you'd be waiting for the calls to get a reply, which I probably should have! Carry on!

davepacheco · 2022-11-04T19:06:46Z

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

My intuition (and I suspect @andrewjstone's) is that trying to synchronize all notifications is likely to be more brittle than trying to gracefully handle them on the other side in whatever order they occur, even if that just means rejecting some and having them retried, like it sounds like we do today with the zpool/sled/dataset messages. The kinds of problems I'm imagining include: queue growing without bound, starvation of some later notification because of some problem with a previous notification that's causing the whole queue to be stuck, and having notifications pile up because of any transient Nexus/database issue such that the notifications at the beginning of the queue are old and stale (and maybe superseded by later ones).

andrewjstone · 2022-11-04T19:20:59Z

Those are great points. I wasn't caffeinated enough when I originally objected to think more deeply about my reasons. It was just intuition indeed. The problem of a lost reply is definitely an issue here. For instance, if we timeout and send up the next queued request, instead of waiting forever and building the queue backlog, the old request could be floating around in the network and get applied out of order. Generation numbers feel safer to me.

smklein · 2022-11-07T14:30:42Z

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

The usage of a generation number makes sense to me when we care about the "latest event only" - if Sled Agent emits two different runtime states, this technique ensures that the latter one is received.

I think this example works with, for example, instance states, because:

Resource Granularity: There is only a single property we care about (the state of one virtual machine)
Last-Event, not All Events: We are okay with the last-one-wins policy

Sled Agent is currently notifying Nexus of a few different things, including:

What hardware exists on the current Sled (this is currently the is_scrimlet field, but presumably it'll expand to also include driver information)?
What Zpools exist?
What Datasets exist (though this should get flipped, to be Nexus-driven)?

I think that if we add a generation number for Sled Agent to use when reporting notification to Nexus, we'll actually need to add multiple generation numbers. For example...

If a Sled reports it's status ("I exist, I don't see /dev/tofino, so I think I'm a Gimlet") and then later sees the tofino driver ("I exist, I do have /dev/tofino, so I'm a Scrimlet"), the latter event should always eventually persist over the former event.
If a Sled reports a zpool was added ("I see a new zpool with UUID = X") and then later sees the zpool was removed ("I do not see a zpool with UUID = X"), it should settle on the latter event.

I think an implementation like this would only work if we had multiple, distinct generation numbers for each resource where we're okay just taking the "latest" info.

For example, the following would be wrong:

Sled state vs Zpool state
- Sled reports it exists as a gimlet (gen # = 1)
- Sled reports a zpool was added (gen # = 2)
- Sled reports it exists as a scrimlet (gen # = 3), because the tofino driver loaded
- If event 3 arrives before event 2, and causes event 2 to be discarded, that's a bug - these are different resources, and should be using different generation #s.
Multiple Zpools
- Sled reports zpool A was added (gen # = 1)
- Sled reports zpool B was added (gen # = 2)
- Sled reports zpool A was removed (gen # = 3)
- If event 3 arrives before event 2, and causes event 2 to be discarded, that's a bug - each zpool is a distinct resource, and needs a distinct generation number. Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit.

This seems doable, but it will require more structure to the Sled Agent -> Nexus notification pathway. I'll poke around at this approach.

rmustacc · 2022-11-07T15:50:04Z

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

That may not work, but if you assume that nexus is the one that actually knows what the state of the world is and think of this as sending a topology snapshot and its up to nexus to figure out the diff, that'll could possibly make sense. In particular, Event A in the zpool case is, I think, more complex assuming a disk pull or fault. It's actually more like:

A disk was removed (this needs to be reported to update inventory tracking) or a disk was faulted and here's why. This involves communicating and updating two different things:
- The inventory-centric view of this disk, that is the disk as a primary key in inventory and recording this event happened.
- The sled-centric view of things and reporting that a device in slot N3 (for example) is gone.
Various storage volumes are gone (perhaps this is reported as the whole pool, as phrased above)

The reason to think about this is that Nexus can crash at any time processing this and folks may have to retry. Conditional PUTs (whether with etags or something else) help here. But also when a sled comes back up or has batched up a lot of changes because it's offline, it may just want to send the whole thing.

Obviously reporting and processing all state and the delta can be more expensive, but given we should be able to encapsulate a lot of this in a single topology view of the world that we serialize that may be easier than trying to note these all. Also, there's the critical thing that if sled agent crashes it can't assume it has given nexus any information because it literally doesn't know and should send the entire state of the world to have it make sense again.

smklein · 2022-11-07T16:07:09Z

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.

In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

rmustacc · 2022-11-07T16:47:00Z

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.

In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

So the one difference between what you described and what I tried to describe, I think, was that you had a separate thing for each sub-resource, e.g. a separate thing for all zpools or all say NICs, or other resources. I was trying to say that there is one, single payload for all of the gimlet and then deltas from there where that optimization makes sense and can be applied linearlizability (with the scheme that @davepacheco and @andrewjstone described).

andrewjstone · 2022-11-07T18:14:10Z

@smklein You're examples about last-write-wins and the problems there are dead on, and indeed that problem can be solved with a queue. Unfortunately the queue may lead to other issues as mentioned here. I think @rmustacc makes a good suggestion overall and I don't really see much problem with conditional writes as a whole for sled-agent state.

From an implementation point of view, this issue mentions that the notification calling sites are spread across sled agent. We could collate them with a channel internally to a single task that then performs the conditional updates, so we don't do that repeatedly and end up with a bunch of optimistic concurrency failures at CockroachDB.

… decision (#1918) - Adds `hardware` module to Sled Agent for minimally monitoring `devinfo` output. This will certainly evolve, but this PR includes a "bare minimum" real example of tracking the tofino driver. - Stop relying on `rss-config.toml` to decide if we're running on a scrimlet. Instead... - (General case) Rely on monitoring hardware - (Testing, manual case) Provide a `force-scrimlet` option to make the sled agent assume that it is a scrimlet Fixes oxidecomputer/minimum-upgradable-product#19 Part of #1917 Part of #823 Pre-requisite for oxidecomputer/minimum-upgradable-product#16 Pre-requisite for oxidecomputer/minimum-upgradable-product#18

Before underlay access has been allocated to the `StorageWorker`, there is no chance of successfully talking to nexus. Delay any notifications until after underlay access is available. This occurs when sled-agent starts, after disks and zpools may have already been allocated/recorded locally. By waiting we reduce unnecessary backoff timer increases that are not due to any actual faults, and delay normal sled startup. This is especially important during rack setup since those backoff timers wait indefinitely until RSS is run. This is a partial fix to dependency order, but just a small one. We should take a more systematic approach ala https://github.com/ /issues/1917 in the future.

smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Nov 4, 2022

smklein self-assigned this Nov 4, 2022

smklein mentioned this issue Nov 4, 2022

[sled-agent] Monitor for Tofino driver as factor in 'are we scrimlet' decision #1918

Merged

davepacheco mentioned this issue May 25, 2023

Omicron took 22 minutes to come online after receiving rack-initialize #3082

Closed

andrewjstone mentioned this issue May 26, 2023

[sled-agent] Delay nexus hw notifications #3234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled agent] Serialize Nexus notification queue #1917

[sled agent] Serialize Nexus notification queue #1917

smklein commented Nov 4, 2022

andrewjstone commented Nov 4, 2022

smklein commented Nov 4, 2022

andrewjstone commented Nov 4, 2022

davepacheco commented Nov 4, 2022

andrewjstone commented Nov 4, 2022

smklein commented Nov 7, 2022 •

edited

Loading

rmustacc commented Nov 7, 2022

smklein commented Nov 7, 2022

rmustacc commented Nov 7, 2022

andrewjstone commented Nov 7, 2022 •

edited

Loading

[sled agent] Serialize Nexus notification queue #1917

[sled agent] Serialize Nexus notification queue #1917

Comments

smklein commented Nov 4, 2022

andrewjstone commented Nov 4, 2022

smklein commented Nov 4, 2022

andrewjstone commented Nov 4, 2022

davepacheco commented Nov 4, 2022

andrewjstone commented Nov 4, 2022

smklein commented Nov 7, 2022 • edited Loading

rmustacc commented Nov 7, 2022

smklein commented Nov 7, 2022

rmustacc commented Nov 7, 2022

andrewjstone commented Nov 7, 2022 • edited Loading

smklein commented Nov 7, 2022 •

edited

Loading

andrewjstone commented Nov 7, 2022 •

edited

Loading