Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstairs state machine refactoring (3/3) #1577

Open
wants to merge 21 commits into
base: mkeeter/explicit-stop-state
Choose a base branch
from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Nov 26, 2024

(staged on top of #1570)

This PR refactors the DsState type into more specific, data-bearing types. Here's a quick look at the new types:

pub enum DsState {
    /// New connection
    Connecting {
        state: NegotiationState,
        mode: ConnectionMode,
    },

    /// Ready for and/or currently receiving IO
    Active,

    /// This downstairs is undergoing LiveRepair
    LiveRepair,

    /// The IO task for the client is being stopped
    Stopping(ClientStopReason),
}

pub enum ConnectionMode {
    Offline,
    New,
    Faulted,
    Replaced,
}

pub enum NegotiationState {
    /// Initial state, waiting to hear `YesItsMe` from the client
    ///
    /// Once this message is heard, transitions to either `WaitActive` (if
    /// `auto_promote` is `false`) or `WaitQuorum` (if `auto_promote` is `true`)
    Start {
        auto_promote: bool,
    },

    /// Waiting for activation by the guest
    WaitActive,

    /// Waiting for the minimum number of downstairs to be present.
    WaitQuorum,

    WaitForPromote,
    WaitForRegionInfo,
    GetLastFlush,
    GetExtentVersions,

    /// Initial startup, downstairs are repairing from each other.
    Reconcile,

    /// Waiting for live-repair to begin
    LiveRepairReady,
}

Many data members of the DownstairsClient – which were only valid for a single state – are now data within a specific variant, e.g. auto_promote and negotiation_state. This makes the code both more precise and harder to mess up.

+400 of this change is updating the OpenAPI spec, which includes DsState. I don't think anyone else is expecting it to be a particular shape, but let me know if I'm wrong (also, it already changed in #1570, so warn me there as well!).

Remaining work:

  • Answer XXX questions, which are places that I wasn't sure about existing behavior
  • Figure out what to do about checked_state_transition, which I just deleted for the moment.
    • Manually specifying valid state transitions and manually writing a bunch of good + bad test cases is a lot of toil; could we be smarter about this?
    • How do we handle nested state machines (negotiation_state within DsState::Connecting?)
  • Update state machine diagrams

@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from 79652b6 to 93a6788 Compare December 2, 2024 14:44
@mkeeter mkeeter changed the title Upstairs state machine refactoring Upstairs state machine refactoring (3/3) Dec 2, 2024
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch 3 times, most recently from d23dd68 to 3835f2a Compare December 9, 2024 15:09
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from 3835f2a to 43a1e30 Compare December 10, 2024 16:39
@mkeeter mkeeter changed the base branch from main to mkeeter/explicit-stop-state December 11, 2024 14:53
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch 2 times, most recently from 29ee56a to 448a57a Compare December 11, 2024 15:44
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from 448a57a to 958627d Compare December 11, 2024 16:03
@mkeeter mkeeter marked this pull request as ready for review December 12, 2024 18:03
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, just questions to help me understand.
I reviewed up to is_state_transition_valid() in upstairs/src/client.rs
I wanted to post this much so it would be recorded and I'll continue and either
post more later, or post that I'm done.

DsState::WaitActive => "WAC".to_string(),
DsState::WaitQuorum => "WAQ".to_string(),
DsState::Reconcile => "REC".to_string(),
DsState::Connecting {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about how to present this, but what you have done here is fine for now and if I come up with something I like better, I'll put out a PR for it

self.state = DsState::Reconcile;
info!(self.log, "Transition from {:?} to Reconcile", self.state);
let DsState::Connecting { state, mode } = &mut self.state else {
panic!("invalid state {:?}", self.state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was there before, but could we add the client ID to the panic string here?
Sometimes the log right before a panic does not make it out, and it would be nice to
know which client was in the invalid state.

// Otherwise, use live-repair; `ConnectionMode::Replaced`
// indicates that the address is allowed to change.
UpstairsState::Active | UpstairsState::Deactivating { .. } => {
ConnectionMode::Replaced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are deactivating here, could we just go to New? Or we need the replaced state to let ourselves know when we restart this client that we should expect a new downstairs?

DsState::Active | DsState::Offline if !can_replay => {
Some(DsState::Faulted)
}
let new_mode = match current {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still tough to track what happens here. I don't know how to make it
any easier to follow though. The new states do help somewhat, but we still
have the can_replay adding into the layers here.

DsState::Active => Some(DsState::Offline),
// If the Downstairs has spontaneously stopped, we will attempt to
// replay jobs when reconnecting
DsState::Active => ConnectionMode::Offline,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might put this right after line 553 with a comment like you have on line 552 about that covering the other cases for DsState::Connecting. This way you have all the cases covered for Active and Connecting in the same area. I was initially confused when I saw DsState::Active again here after seeing it at line 544.

DsState::Stopping(ClientStopReason::Replacing) => {
Some(DsState::Replaced)
// XXX NegotiationFailed could also be hit during reconnection,
// in which case we shouldn't use `ConnectionMode::New` (?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is, what path would we take that is different if we see NegotiationFailed on initial startup (before we are active) vs. seeing it after we are active.

Looking at all the reasons we might set NegotiationFailed, they appear to be something that will fail if we try again (version mismatch), or receiving some message out of expected order (programmer bug??). If we go to New for this client, as long as we have to go back through the negotiation again I think our downstairs will end up in the right place.

@@ -920,9 +869,10 @@ impl DownstairsClient {
) -> EnqueueResult {
match self.state {
// We never send jobs if we're in certain inactive states
DsState::Faulted
| DsState::Replaced
| DsState::LiveRepairReady
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does having a downstairs DsState::Connecting with state: NegotiationState::LiveRepairReady end up? In the panic below?

| DsState::Offline => {} // Okay

DsState::LiveRepairReady if self.cfg.read_only => {} // Okay
"invalid state transition from {:?} -> {:?} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the client ID and upstairs ID in the panic message. And, maybe even add the session_id, as a propolis instance with multiple Upstairs could panic here and we might have trouble figuring out which piece hit this panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants