-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled agent] Serialize Nexus notification queue #1917
Comments
I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though. |
I don't think this should matter - even if we're communicating with multiple Nexuses, we won't get a response until that Nexus has committed our message to durable storage (CRDB) which should be shared among different Nexuses. If awaiting that response blocks the subsequent request, we've successfully serialized our notifications.
Yeah, this is true, and a bit more complicated. Not all notifications from the Sled Agent are equal, so it isn't necessarily right to "just ignore everything but the latest update" too |
Ah, ok. I guess I didn't realize you'd be waiting for the calls to get a reply, which I probably should have! Carry on! |
If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order. My intuition (and I suspect @andrewjstone's) is that trying to synchronize all notifications is likely to be more brittle than trying to gracefully handle them on the other side in whatever order they occur, even if that just means rejecting some and having them retried, like it sounds like we do today with the zpool/sled/dataset messages. The kinds of problems I'm imagining include: queue growing without bound, starvation of some later notification because of some problem with a previous notification that's causing the whole queue to be stuck, and having notifications pile up because of any transient Nexus/database issue such that the notifications at the beginning of the queue are old and stale (and maybe superseded by later ones). |
Those are great points. I wasn't caffeinated enough when I originally objected to think more deeply about my reasons. It was just intuition indeed. The problem of a lost reply is definitely an issue here. For instance, if we timeout and send up the next queued request, instead of waiting forever and building the queue backlog, the old request could be floating around in the network and get applied out of order. Generation numbers feel safer to me. |
The usage of a generation number makes sense to me when we care about the "latest event only" - if Sled Agent emits two different runtime states, this technique ensures that the latter one is received. I think this example works with, for example, instance states, because:
Sled Agent is currently notifying Nexus of a few different things, including:
I think that if we add a generation number for Sled Agent to use when reporting notification to Nexus, we'll actually need to add multiple generation numbers. For example...
I think an implementation like this would only work if we had multiple, distinct generation numbers for each resource where we're okay just taking the "latest" info. For example, the following would be wrong:
This seems doable, but it will require more structure to the Sled Agent -> Nexus notification pathway. I'll poke around at this approach. |
A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:
That may not work, but if you assume that nexus is the one that actually knows what the state of the world is and think of this as sending a topology snapshot and its up to nexus to figure out the diff, that'll could possibly make sense. In particular, Event A in the zpool case is, I think, more complex assuming a disk pull or fault. It's actually more like:
The reason to think about this is that Nexus can crash at any time processing this and folks may have to retry. Conditional PUTs (whether with etags or something else) help here. But also when a sled comes back up or has batched up a lot of changes because it's offline, it may just want to send the whole thing. Obviously reporting and processing all state and the delta can be more expensive, but given we should be able to encapsulate a lot of this in a single topology view of the world that we serialize that may be easier than trying to note these all. Also, there's the critical thing that if sled agent crashes it can't assume it has given nexus any information because it literally doesn't know and should send the entire state of the world to have it make sense again. |
Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas. This seems to favor the approach of
That I mentioned in my comment above. |
So the one difference between what you described and what I tried to describe, I think, was that you had a separate thing for each sub-resource, e.g. a separate thing for all zpools or all say NICs, or other resources. I was trying to say that there is one, single payload for all of the gimlet and then deltas from there where that optimization makes sense and can be applied linearlizability (with the scheme that @davepacheco and @andrewjstone described). |
@smklein You're examples about last-write-wins and the problems there are dead on, and indeed that problem can be solved with a queue. Unfortunately the queue may lead to other issues as mentioned here. I think @rmustacc makes a good suggestion overall and I don't really see much problem with conditional writes as a whole for sled-agent state. From an implementation point of view, this issue mentions that the notification calling sites are spread across sled agent. We could collate them with a channel internally to a single task that then performs the conditional updates, so we don't do that repeatedly and end up with a bunch of optimistic concurrency failures at CockroachDB. |
… decision (#1918) - Adds `hardware` module to Sled Agent for minimally monitoring `devinfo` output. This will certainly evolve, but this PR includes a "bare minimum" real example of tracking the tofino driver. - Stop relying on `rss-config.toml` to decide if we're running on a scrimlet. Instead... - (General case) Rely on monitoring hardware - (Testing, manual case) Provide a `force-scrimlet` option to make the sled agent assume that it is a scrimlet Fixes oxidecomputer/minimum-upgradable-product#19 Part of #1917 Part of #823 Pre-requisite for oxidecomputer/minimum-upgradable-product#16 Pre-requisite for oxidecomputer/minimum-upgradable-product#18
Before underlay access has been allocated to the `StorageWorker`, there is no chance of successfully talking to nexus. Delay any notifications until after underlay access is available. This occurs when sled-agent starts, after disks and zpools may have already been allocated/recorded locally. By waiting we reduce unnecessary backoff timer increases that are not due to any actual faults, and delay normal sled startup. This is especially important during rack setup since those backoff timers wait indefinitely until RSS is run. This is a partial fix to dependency order, but just a small one. We should take a more systematic approach ala https://github.com/ /issues/1917 in the future.
Before underlay access has been allocated to the `StorageWorker`, there is no chance of successfully talking to nexus. Delay any notifications until after underlay access is available. This occurs when sled-agent starts, after disks and zpools may have already been allocated/recorded locally. By waiting we reduce unnecessary backoff timer increases that are not due to any actual faults, and delay normal sled startup. This is especially important during rack setup since those backoff timers wait indefinitely until RSS is run. This is a partial fix to dependency order, but just a small one. We should take a more systematic approach ala https://github.com/ /issues/1917 in the future.
There are many spots in the Sled Agent where we send notifications up to Nexus, such as:
omicron/sled-agent/src/server.rs
Lines 83 to 123 in 17ab9fd
These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".
Example problems:
Proposal:
The text was updated successfully, but these errors were encountered: