Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled agent] Serialize Nexus notification queue #1917

Open
smklein opened this issue Nov 4, 2022 · 10 comments
Open

[sled agent] Serialize Nexus notification queue #1917

smklein opened this issue Nov 4, 2022 · 10 comments
Assignees
Labels
Sled Agent Related to the Per-Sled Configuration and Management

Comments

@smklein
Copy link
Collaborator

smklein commented Nov 4, 2022

There are many spots in the Sled Agent where we send notifications up to Nexus, such as:

let notify_nexus = || async {
info!(
log,
"contacting server nexus, registering sled: {}", sled_id
);
let role = if is_scrimlet {
nexus_client::types::SledRole::Scrimlet
} else {
nexus_client::types::SledRole::Gimlet
};
let nexus_client = lazy_nexus_client
.get()
.await
.map_err(|err| BackoffError::transient(err.to_string()))?;
nexus_client
.sled_agent_put(
&sled_id,
&nexus_client::types::SledAgentStartupInfo {
sa_address: sled_address.to_string(),
role,
},
)
.await
.map_err(|err| BackoffError::transient(err.to_string()))
};
let log_notification_failure = |err, delay| {
warn!(
log,
"failed to notify nexus about sled agent: {}, will retry in {:?}", err, delay;
);
};
retry_notify(
internal_service_policy_with_max(
std::time::Duration::from_secs(1),
),
notify_nexus,
log_notification_failure,
)
.await
.expect("Expected an infinite retry loop contacting Nexus");

These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".

Example problems:

  • If we attempt to notify Nexus about a new sled - as a "Gimlet" - but later try to notify Nexus about the sled being updated to a "Scrimlet", then the ordering of the two notifications is critical. It's important that Nexus eventually sees the sled as a Scrimlet, and if a retry loop happens to mean that the "see sled as gimlet" notification arrives last, we'd end up in an inconsistent state.
  • We can try to notify Nexus about datasets before zpools, and about zpools before the sled itself. This results in a "failure + retry", which is fine, but produces some confusing log messages.

Proposal:

  • Create a more broadly-usable "notification queue", where calls to nexus may be serialized.
@smklein smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Nov 4, 2022
@smklein smklein self-assigned this Nov 4, 2022
@andrewjstone
Copy link
Contributor

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

@smklein
Copy link
Collaborator Author

smklein commented Nov 4, 2022

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved.

I don't think this should matter - even if we're communicating with multiple Nexuses, we won't get a response until that Nexus has committed our message to durable storage (CRDB) which should be shared among different Nexuses. If awaiting that response blocks the subsequent request, we've successfully serialized our notifications.

There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

Yeah, this is true, and a bit more complicated. Not all notifications from the Sled Agent are equal, so it isn't necessarily right to "just ignore everything but the latest update" too

@andrewjstone
Copy link
Contributor

Ah, ok. I guess I didn't realize you'd be waiting for the calls to get a reply, which I probably should have! Carry on!

@davepacheco
Copy link
Collaborator

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

My intuition (and I suspect @andrewjstone's) is that trying to synchronize all notifications is likely to be more brittle than trying to gracefully handle them on the other side in whatever order they occur, even if that just means rejecting some and having them retried, like it sounds like we do today with the zpool/sled/dataset messages. The kinds of problems I'm imagining include: queue growing without bound, starvation of some later notification because of some problem with a previous notification that's causing the whole queue to be stuck, and having notifications pile up because of any transient Nexus/database issue such that the notifications at the beginning of the queue are old and stale (and maybe superseded by later ones).

@andrewjstone
Copy link
Contributor

Those are great points. I wasn't caffeinated enough when I originally objected to think more deeply about my reasons. It was just intuition indeed. The problem of a lost reply is definitely an issue here. For instance, if we timeout and send up the next queued request, instead of waiting forever and building the queue backlog, the old request could be floating around in the network and get applied out of order. Generation numbers feel safer to me.

@smklein
Copy link
Collaborator Author

smklein commented Nov 7, 2022

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

The usage of a generation number makes sense to me when we care about the "latest event only" - if Sled Agent emits two different runtime states, this technique ensures that the latter one is received.

I think this example works with, for example, instance states, because:

  1. Resource Granularity: There is only a single property we care about (the state of one virtual machine)
  2. Last-Event, not All Events: We are okay with the last-one-wins policy

Sled Agent is currently notifying Nexus of a few different things, including:

  • What hardware exists on the current Sled (this is currently the is_scrimlet field, but presumably it'll expand to also include driver information)?
  • What Zpools exist?
  • What Datasets exist (though this should get flipped, to be Nexus-driven)?

I think that if we add a generation number for Sled Agent to use when reporting notification to Nexus, we'll actually need to add multiple generation numbers. For example...

  • If a Sled reports it's status ("I exist, I don't see /dev/tofino, so I think I'm a Gimlet") and then later sees the tofino driver ("I exist, I do have /dev/tofino, so I'm a Scrimlet"), the latter event should always eventually persist over the former event.
  • If a Sled reports a zpool was added ("I see a new zpool with UUID = X") and then later sees the zpool was removed ("I do not see a zpool with UUID = X"), it should settle on the latter event.

I think an implementation like this would only work if we had multiple, distinct generation numbers for each resource where we're okay just taking the "latest" info.

For example, the following would be wrong:

  • Sled state vs Zpool state
    • Sled reports it exists as a gimlet (gen # = 1)
    • Sled reports a zpool was added (gen # = 2)
    • Sled reports it exists as a scrimlet (gen # = 3), because the tofino driver loaded
    • If event 3 arrives before event 2, and causes event 2 to be discarded, that's a bug - these are different resources, and should be using different generation #s.
  • Multiple Zpools
    • Sled reports zpool A was added (gen # = 1)
    • Sled reports zpool B was added (gen # = 2)
    • Sled reports zpool A was removed (gen # = 3)
    • If event 3 arrives before event 2, and causes event 2 to be discarded, that's a bug - each zpool is a distinct resource, and needs a distinct generation number. Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit.

This seems doable, but it will require more structure to the Sled Agent -> Nexus notification pathway. I'll poke around at this approach.

@rmustacc
Copy link

rmustacc commented Nov 7, 2022

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

  • Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
  • In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

That may not work, but if you assume that nexus is the one that actually knows what the state of the world is and think of this as sending a topology snapshot and its up to nexus to figure out the diff, that'll could possibly make sense. In particular, Event A in the zpool case is, I think, more complex assuming a disk pull or fault. It's actually more like:

  • A disk was removed (this needs to be reported to update inventory tracking) or a disk was faulted and here's why. This involves communicating and updating two different things:
    • The inventory-centric view of this disk, that is the disk as a primary key in inventory and recording this event happened.
    • The sled-centric view of things and reporting that a device in slot N3 (for example) is gone.
  • Various storage volumes are gone (perhaps this is reported as the whole pool, as phrased above)

The reason to think about this is that Nexus can crash at any time processing this and folks may have to retry. Conditional PUTs (whether with etags or something else) help here. But also when a sled comes back up or has batched up a lot of changes because it's offline, it may just want to send the whole thing.

Obviously reporting and processing all state and the delta can be more expensive, but given we should be able to encapsulate a lot of this in a single topology view of the world that we serialize that may be easier than trying to note these all. Also, there's the critical thing that if sled agent crashes it can't assume it has given nexus any information because it literally doesn't know and should send the entire state of the world to have it make sense again.

@smklein
Copy link
Collaborator Author

smklein commented Nov 7, 2022

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

  • Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
  • In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

@rmustacc
Copy link

rmustacc commented Nov 7, 2022

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

  • Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
  • In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

So the one difference between what you described and what I tried to describe, I think, was that you had a separate thing for each sub-resource, e.g. a separate thing for all zpools or all say NICs, or other resources. I was trying to say that there is one, single payload for all of the gimlet and then deltas from there where that optimization makes sense and can be applied linearlizability (with the scheme that @davepacheco and @andrewjstone described).

@andrewjstone
Copy link
Contributor

andrewjstone commented Nov 7, 2022

@smklein You're examples about last-write-wins and the problems there are dead on, and indeed that problem can be solved with a queue. Unfortunately the queue may lead to other issues as mentioned here. I think @rmustacc makes a good suggestion overall and I don't really see much problem with conditional writes as a whole for sled-agent state.

From an implementation point of view, this issue mentions that the notification calling sites are spread across sled agent. We could collate them with a channel internally to a single task that then performs the conditional updates, so we don't do that repeatedly and end up with a bunch of optimistic concurrency failures at CockroachDB.

smklein added a commit that referenced this issue Nov 10, 2022
… decision (#1918)

- Adds `hardware` module to Sled Agent for minimally monitoring `devinfo` output. This will certainly evolve, but this PR includes a "bare minimum" real example of tracking the tofino driver.
- Stop relying on `rss-config.toml` to decide if we're running on a scrimlet. Instead...
  - (General case) Rely on monitoring hardware
  - (Testing, manual case) Provide a `force-scrimlet` option to make the sled agent assume that it is a scrimlet

Fixes oxidecomputer/minimum-upgradable-product#19
Part of #1917
Part of #823
Pre-requisite for oxidecomputer/minimum-upgradable-product#16
Pre-requisite for oxidecomputer/minimum-upgradable-product#18
andrewjstone added a commit that referenced this issue May 26, 2023
Before underlay access has been allocated to the `StorageWorker`, there
is no chance of successfully talking to nexus. Delay any notifications
until after underlay access is available. This occurs when sled-agent
starts, after disks and zpools may have already been allocated/recorded
locally.

By waiting we reduce unnecessary backoff timer increases that are
not due to any actual faults, and delay normal sled startup. This is
especially important during rack setup since those backoff timers wait
indefinitely until RSS is run.

This is a partial fix to dependency order, but just a small one.
We  should take a more systematic approach ala https://github.com/
/issues/1917 in the future.
andrewjstone added a commit that referenced this issue May 26, 2023
Before underlay access has been allocated to the `StorageWorker`, there
is no chance of successfully talking to nexus. Delay any notifications
until after underlay access is available. This occurs when sled-agent
starts, after disks and zpools may have already been allocated/recorded
locally.

By waiting we reduce unnecessary backoff timer increases that are not
due to any actual faults, and delay normal sled startup. This is
especially important during rack setup since those backoff timers wait
indefinitely until RSS is run.

This is a partial fix to dependency order, but just a small one. We
should take a more systematic approach ala https://github.com/
/issues/1917 in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

No branches or pull requests

4 participants