Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance start/delete sagas hang while sleds are unreachable #4259

Closed
gjcolombo opened this issue Oct 11, 2023 · 10 comments · Fixed by #5568
Closed

Instance start/delete sagas hang while sleds are unreachable #4259

gjcolombo opened this issue Oct 11, 2023 · 10 comments · Fixed by #5568
Assignees
Labels
known issue To include in customer documentation and training
Milestone

Comments

@gjcolombo
Copy link
Contributor

gjcolombo commented Oct 11, 2023

Repro environment: Seen on rack3 after a sled trapped into kmdb and became inoperable.

When an instance starts, the start saga calls Nexus::create_instance_v2p_mappings to ensure that every sled in the cluster knows how to route traffic directed to the instance's virtual IPs. This function calls sled_list to get the list of active sleds and then invokes the sled agent's set_v2p endpoint on each one. The calls to set_v2p are wrapped in a retry_until_known_result wrapper that treats Progenitor communication errors (including client timeouts) as transient errors requiring an operation to be retried (consider, for example, a request to do X that sled agent receives and begins processing but does not finish processing until Nexus has decided not to wait anymore; if this produces an error that unwinds the saga, X will not be undone, because a failure in a saga only undoes steps that previously completed successfully, not the one that produced the failure). Instance deletion does something similar to all this via delete_instance_v2p_mappings.

Rack3 has a sled that keeps panicking with symptoms of a known host OS issue. To better identify the problem, we set the sled up to drop into kmdb on panicking instead of rebooting. This rendered the sled's agent totally and permanently unresponsive. Since retry_until_known_result treats progenitor_client::Error::CommunicationErrors as transient errors, this caused all subsequent instance creation and deletion attempts to get stuck retrying the same attempt to edit V2P mappings on the sled being debugged, causing the relevant instances to get stuck in the Creating/Stopped states (soon to be the Starting/Stopped states once 4194 lands).

There are several things to unpack here (probably into their own issues):

  • retry_until_known_result doesn't have a way to bail out after a certain amount of time/number of attempts; even if it did, such a bailout would have to respect the undo rules for sagas described above
  • There's no way to recover when an instance gets stuck in a creating/starting state: the instance can't be stopped or destroyed, and even if these state transitions were allowed or there were some other way to "reset" the instance to a stopped state (see Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004), there's no way to cooperatively cancel the instance's ongoing saga.
  • Recovering from the "bad sled" case is very challenging:
    • There's currently no way to mark a sled as unhealthy or out-of-service; there's a time_deleted column in the sleds table, but the datastore's sled_list function doesn't filter on it, so create_instance_v2p_mappings won't ignore deleted sleds.
    • Even if sled_list did ignore unhealthy sleds, there's still a race where create_instance_v2p_mappings decides to start talking to a sled before it's marked as unhealthy and never reconsiders that decision. (This feeds back into the first two items in this list.)
@askfongjojo askfongjojo added this to the MVP milestone Oct 12, 2023
@morlandi7 morlandi7 modified the milestones: MVP, 6 Nov 27, 2023
@morlandi7
Copy link

morlandi7 commented Dec 21, 2023

@internet-diglett Is there some alignment here with the other RPW work? (#4715 )

@internet-diglett
Copy link
Contributor

@internet-diglett Is there some alignment here with the other RPW work? (#4715 )

@morlandi7 at the moment, I don't believe so unless there has been a decision to move v2p mappings to an RPW based model. @gjcolombo have there been any discussions about this?

@askfongjojo
Copy link

@internet-diglett - Perhaps a useful fix for now is modify the sled inclusion criteria to consider the time_deleted value. The change seems valid regardless of how we want to handle unresponsive sleds in general.

In situations like OS panic or sled-agent restart, we've seen in one customer's case that the saga was able to resume/complete once the problem sled came back up (not ideal but also not too bad). There are cases in which the sleds are out indefinitely but we'll take the necessary time to solve them through other ways.

@morlandi7 morlandi7 modified the milestones: 6, 7 Jan 26, 2024
@davepacheco
Copy link
Collaborator

Is there anything in the system that sets time_deleted on a sled today? I wouldn't have thought so. I'd suggest we use the policy field proposed in RFD 457 instead but that's basically the same idea, and I think it's a good idea, but has the same problem that I don't think it would help this problem in practice until we actually implement support for sled removal. It'd be tempting to use provision_state, but I don't think that's quite right because there might still be instances on a sled to which provisioning new instances is currently disabled.


(The rest of this is probably repeating what folks already know but it took me a while to understand the discussion above so I'm summarizing here for myself and others that might be confused.)

I think it's important to distinguish three cases:

  1. sleds that are transiently unavailable,
  2. sleds that are unavailable for an extended period (say, more than a few minutes) but we don't know if they're coming back, and
  3. sleds that are permanently unavailable (which means an operator has told us that it's not coming back)

I can see how if a sled is unreachable for several minutes, we don't want all instance start/stop for instances on that sled to hang, and certainly not all start/stop for all instances. But we also don't want to give up forever. It might still have instances on it, it might come back, and it may need that v2p update, right? So I can see why we're asking about an RPW. I'm not that familiar with create_instance_v2p_mappings, but yeah, it sounds like an RPW may well be a better fit than a saga step. The RPW would do its best to update all sleds that it can. But if it can't reach some, no sweat -- it'll try again the next time the RPW is activated. And we can use the same pattern we use with other RPWs to report status (e.g., to omdb) about which sleds we've been able to keep updated to which set of v2p mappings.

@davepacheco
Copy link
Collaborator

davepacheco commented Jan 26, 2024

@gjcolombo would you object to retitling this Instance start/delete sagas hang while sleds are unreachable?
(edit: confirmed no objection offline)

@davepacheco davepacheco changed the title Instance start/delete sagas will spin forever if a sled is permanently unreachable Instance start/delete sagas hang while sleds are unreachable Jan 26, 2024
@hawkw hawkw self-assigned this Feb 20, 2024
@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@morlandi7 morlandi7 modified the milestones: 7, 8 Mar 12, 2024
@hawkw hawkw removed their assignment Mar 28, 2024
@davepacheco
Copy link
Collaborator

In today's update call, we discussed whether this was a blocker for R8. The conclusion is "no" because this should not be made any worse during sled expungement. The sled we plan to expunge in R8 is not running any instances and so should not need to have its v2p mappings updated as part of instance create/delete sagas. Beyond that, all instances are generally stopped before the maintenance window starts, and when they start again, the sled will be expunged and so not included in the list of sleds to update.

@internet-diglett
Copy link
Contributor

@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood.

@askfongjojo
Copy link

askfongjojo commented Jun 28, 2024

Checked the current behavior on rack2: I put sled 23 to A2 and provisioned a bunch of instances. All of them stayed in starting state (they didn't transition to running after the sled was brought back to A0 and that's a different problem to be investigated).

According to https://github.com/oxidecomputer/omicron/blob/main/nexus/src/app/sagas/instance_start.rs#L61-L62, which in turn references #3879, it looks like fixing this requires one (hopefully small) lift.

@internet-diglett
Copy link
Contributor

@askfongjojo I think that is an old comment that didn't get removed, as that saga node already has been updated (through a series of function calls) to use the nat rpw. Do you have the instance ids / any identifying information so I can check the logs to see what caused it to hang?

@askfongjojo
Copy link

Ah you are right. I retested just now and had no problem bringing up instances when one of the sled is offline. I probably ran into an issue related to some bad downstairs when I tested that last time. This time I'm testing with a brand new disk snapshot to avoid hitting the bad downstairs problem again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants