-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Launch switch zone automatically on scrimlets #1933
Conversation
- Expand the service manager to support initializing the switch zone as a background task. - Reserve the Switch Zone address as the second address in the sled's subnet. The first address is already reserved for the sled agent, but this ensures that any sled can "become" a scrimlet without needing to provision a new IP through Nexus. - This has some fall-out effects - we bump all the other addresses up, which are currently statically allocated. - Miscellaneous cleanups - Identify zones by a "ZoneType" enum, rather than a name. - Incorporate the portions of #1820 responsible for verifying links and devices when launching zones. Fixes oxidecomputer/minimum-upgradable-product#16 Part of oxidecomputer/minimum-upgradable-product#18
4f8d683
to
51aae10
Compare
Throwing @jgallagher on the review list too, since this also impacts when MGS gets launched |
sled-agent/src/services.rs
Outdated
} | ||
(SwitchZone::Initializing { initializer, .. }, None) => { | ||
info!(log, "Disabling switch zone (was initializing)"); | ||
initializer.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this introduce a chance for the switch zone to be orphaned mid-startup? If we're inside the self.initialize_zone(&request).await
point inside initialize_switch_zone
and we abort the task, will the zone only be torn down if we've made it far enough to have instantiated a RunningZone
that will perform cleanup when it's dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both this ensure_switch
function and the initialize_switch_zone
task perform operations on the "zone" while holding onto the self.inner.switch_zone
lock, so the zone should "fully exist (in SwitchZone::Running
) or not exist" at any point that we try to abort that initializer task.
That was my intent at least; does that look right to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating using https://docs.rs/tokio/latest/tokio/sync/oneshot/index.html when implementing this, to avoid any concerns of cancellation safety, if this seems too perilous to you - I could go down that route if this seems unsafe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that we were holding the lock the whole time zone initialization is running. I think this looks right then, but just to make sure my understanding matches yours: The only ways we can get to this branch to abort the initializer is if either
a) someone calls ensure_switch(Some(_))
and then immediately calls ensure_switch(None)
, and the second ensure_switch
happens to acquire the lock before the initialize_switch_zone()
spawned by the first call can
b) zone initialization fails, in which case initialize_switch_zone()
sleeps for 1 second before retrying (so any call to ensure_switch(None)
during that sleep would hit this branch)
In either case, the abort is only cancelling a task that isn't doing anything meaningful (it either hasn't started yet or is sleeping between retries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's here is safe because of the lock, but I think it would be clearer with message passing. Holding locks across .await
points is somewhat confusing to me personally. Most of this is just personal preference though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added 282baee - I believe this is functionally the same, but it operates by asking the worker task to terminate, rather than invoking abort
.
Does this seem preferable to y'all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like that better, but maybe just because I'm wary of cancellation issues in general? If others liked the abort version better I'd be fine going back to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit clearer now, yes. It makes it very explicit inside the task that it will only cancel at this point (or at least unless something else calls abort :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I don't personally have any major concerns.
sled-agent/src/services.rs
Outdated
} | ||
(SwitchZone::Initializing { initializer, .. }, None) => { | ||
info!(log, "Disabling switch zone (was initializing)"); | ||
initializer.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's here is safe because of the lock, but I think it would be clearer with message passing. Holding locks across .await
points is somewhat confusing to me personally. Most of this is just personal preference though.
Fixes https://github.com/oxidecomputer/minimum-upgradable-product/issues/16
Part of https://github.com/oxidecomputer/minimum-upgradable-product/issues/18