-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance start/delete sagas hang while sleds are unreachable #4259
Comments
@internet-diglett Is there some alignment here with the other RPW work? (#4715 ) |
@morlandi7 at the moment, I don't believe so unless there has been a decision to move v2p mappings to an RPW based model. @gjcolombo have there been any discussions about this? |
@internet-diglett - Perhaps a useful fix for now is modify the sled inclusion criteria to consider the In situations like OS panic or sled-agent restart, we've seen in one customer's case that the saga was able to resume/complete once the problem sled came back up (not ideal but also not too bad). There are cases in which the sleds are out indefinitely but we'll take the necessary time to solve them through other ways. |
Is there anything in the system that sets (The rest of this is probably repeating what folks already know but it took me a while to understand the discussion above so I'm summarizing here for myself and others that might be confused.) I think it's important to distinguish three cases:
I can see how if a sled is unreachable for several minutes, we don't want all instance start/stop for instances on that sled to hang, and certainly not all start/stop for all instances. But we also don't want to give up forever. It might still have instances on it, it might come back, and it may need that v2p update, right? So I can see why we're asking about an RPW. I'm not that familiar with |
@gjcolombo would you object to retitling this |
In today's update call, we discussed whether this was a blocker for R8. The conclusion is "no" because this should not be made any worse during sled expungement. The sled we plan to expunge in R8 is not running any instances and so should not need to have its v2p mappings updated as part of instance create/delete sagas. Beyond that, all instances are generally stopped before the maintenance window starts, and when they start again, the sled will be expunged and so not included in the list of sleds to update. |
@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood. |
Checked the current behavior on rack2: I put sled 23 to A2 and provisioned a bunch of instances. All of them stayed in According to https://github.com/oxidecomputer/omicron/blob/main/nexus/src/app/sagas/instance_start.rs#L61-L62, which in turn references #3879, it looks like fixing this requires one (hopefully small) lift. |
@askfongjojo I think that is an old comment that didn't get removed, as that saga node already has been updated (through a series of function calls) to use the nat rpw. Do you have the instance ids / any identifying information so I can check the logs to see what caused it to hang? |
Ah you are right. I retested just now and had no problem bringing up instances when one of the sled is offline. I probably ran into an issue related to some bad downstairs when I tested that last time. This time I'm testing with a brand new disk snapshot to avoid hitting the bad downstairs problem again. |
Repro environment: Seen on rack3 after a sled trapped into kmdb and became inoperable.
When an instance starts, the start saga calls
Nexus::create_instance_v2p_mappings
to ensure that every sled in the cluster knows how to route traffic directed to the instance's virtual IPs. This function callssled_list
to get the list of active sleds and then invokes the sled agent'sset_v2p
endpoint on each one. The calls toset_v2p
are wrapped in aretry_until_known_result
wrapper that treats Progenitor communication errors (including client timeouts) as transient errors requiring an operation to be retried (consider, for example, a request to do X that sled agent receives and begins processing but does not finish processing until Nexus has decided not to wait anymore; if this produces an error that unwinds the saga, X will not be undone, because a failure in a saga only undoes steps that previously completed successfully, not the one that produced the failure). Instance deletion does something similar to all this viadelete_instance_v2p_mappings
.Rack3 has a sled that keeps panicking with symptoms of a known host OS issue. To better identify the problem, we set the sled up to drop into kmdb on panicking instead of rebooting. This rendered the sled's agent totally and permanently unresponsive. Since
retry_until_known_result
treatsprogenitor_client::Error::CommunicationError
s as transient errors, this caused all subsequent instance creation and deletion attempts to get stuck retrying the same attempt to edit V2P mappings on the sled being debugged, causing the relevant instances to get stuck in the Creating/Stopped states (soon to be the Starting/Stopped states once 4194 lands).There are several things to unpack here (probably into their own issues):
retry_until_known_result
doesn't have a way to bail out after a certain amount of time/number of attempts; even if it did, such a bailout would have to respect the undo rules for sagas described abovetime_deleted
column in the sleds table, but the datastore'ssled_list
function doesn't filter on it, socreate_instance_v2p_mappings
won't ignore deleted sleds.sled_list
did ignore unhealthy sleds, there's still a race wherecreate_instance_v2p_mappings
decides to start talking to a sled before it's marked as unhealthy and never reconsiders that decision. (This feeds back into the first two items in this list.)The text was updated successfully, but these errors were encountered: