-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
503 error when snapshotting disks attached to stopped instances #3289
Comments
In this case, it's a functional issue. Taking a snapshot of an attached disk is allowed. |
Here's the bug in the snapshot create saga: omicron/nexus/src/app/sagas/snapshot_create.rs Lines 820 to 853 in c2de480
The disk state here is expected to be Detached, but if the disk is attached to a stopped instance this match will return 503. The part of Nexus that checks whether or not the Pantry should be used to take a snapshot says to use the Pantry if the instance is stopped: omicron/nexus/src/app/snapshot.rs Lines 106 to 126 in c2de480
There needs to be more work here: at the minimum, the disk's state changes to Maintenance as part of this saga, and this has to work with (read: block) the instance from starting. This may not be a candidate for FCS though, due to the workaround of starting the instance existing? |
Thanks for root-causing/sizing this. Let's re-target this to MVP give the effort and impact involved. I'll document this known issue in the release notes because users will likely want to create clean snapshots on stopped instances (so that things such as temporary files locks created by running applications won't be part of the snapshot). |
Want to note that a customer indicated that they would like to see a fix for this issue sooner for the same reason I mentioned above (i.e. their best practice is to create snapshots on stopped instances). |
Volumes are "checked out" from Nexus for many reasons, some of which include sending to another service for use in `Volume::construct`. When that service activates the resulting Volume, this will forcibly take over any existing downstairs connections based on the Upstairs' generation number. This is intentional, and was designed so Nexus, in handing out Volumes with increasing generation numbers, can be sure that the resulting Volume works no matter what (for example, even if a previous Upstairs is wedged somehow, even if the service that is running the previous Upstairs is no longer accepting network connections). Up until now, Nexus wouldn't allow checking out a Volume if there is any chance a Propolis could be running that may use that Volume. This meant restricting certain operations, like creating a snapshot when a disk is attached to an instance that is stopped: any action Nexus would take to attempt a snapshot using a Pantry would race with a user's request to start that instance, and if the Volume checkouts occur in the wrong order the Pantry would take over connections from Propolis, resulting in guest OS errors. Nexus _can_ do this safely though: it has all the information required to know when a checkout is safe to do, and when it may not be safe. This commit adds checks to the Volume checkout transaction that are based on the reason that checkout is occurring, and requires call sites that are performing a checkout to say why they are. Because these checks are performed inside a transaction, Nexus can say for sure when it is safe to allow a Volume to be checked out for a certain reason. For example, in the scenario of taking a snapshot of a disk attached to an instance that is stopped, there are two checkout operations that have the possiblity of racing: 1) the one that Nexus will send to a Pantry during a snapshot create saga. 2) the one that Nexus will send to a Propolis during an instance start saga. If oxidecomputer#1 occurs before oxidecomputer#2, then Propolis will take over the downstairs connections that the Pantry has established, and the snapshot create saga will fail, but the guest OS for that Propolis will not see any errors. If oxidecomputer#2 occurs before oxidecomputer#1, then the oxidecomputer#1 checkout will fail due to one of the conditions added in this commit: the checkout is being performed for use with a Pantry, and a Propolis _may_ exist, so reject the checkout attempt. Fixes oxidecomputer#3289.
Volumes are "checked out" from Nexus for many reasons, some of which include sending to another service for use in `Volume::construct`. When that service activates the resulting Volume, this will forcibly take over any existing downstairs connections based on the Upstairs' generation number. This is intentional, and was designed so Nexus, in handing out Volumes with increasing generation numbers, can be sure that the resulting Volume works no matter what (for example, even if a previous Upstairs is wedged somehow, even if the service that is running the previous Upstairs is no longer accepting network connections). Up until now, Nexus wouldn't allow checking out a Volume if there is any chance a Propolis could be running that may use that Volume. This meant restricting certain operations, like creating a snapshot when a disk is attached to an instance that is stopped: any action Nexus would take to attempt a snapshot using a Pantry would race with a user's request to start that instance, and if the Volume checkouts occur in the wrong order the Pantry would take over connections from Propolis, resulting in guest OS errors. Nexus _can_ do this safely though: it has all the information required to know when a checkout is safe to do, and when it may not be safe. This commit adds checks to the Volume checkout transaction that are based on the reason that checkout is occurring, and requires call sites that are performing a checkout to say why they are. Because these checks are performed inside a transaction, Nexus can say for sure when it is safe to allow a Volume to be checked out for a certain reason. For example, in the scenario of taking a snapshot of a disk attached to an instance that is stopped, there are two checkout operations that have the possiblity of racing: 1) the one that Nexus will send to a Pantry during a snapshot create saga. 2) the one that Nexus will send to a Propolis during an instance start saga. If 1 occurs before 2, then Propolis will take over the downstairs connections that the Pantry has established, and the snapshot create saga will fail, but the guest OS for that Propolis will not see any errors. If 2 occurs before 1, then the 1 checkout will fail due to one of the conditions added in this commit: the checkout is being performed for use with a Pantry, and a Propolis _may_ exist, so reject the checkout attempt. Fixes #3289.
I think I have seen the 503 before but mistaken that as a control plane issue:
(Once I started up the VM, I was able to create snapshots for both of the disks attached to this vm on rack2.)
If snapshot is prohibited on disks attached to stopped instances, we should probably prevent the snapshot action with a more explicit error. A 503 response would imply that the service is only temporarily unavailable and user may retry the action.
If snapshots should be allowed on disks, then it is a functional issue.
The text was updated successfully, but these errors were encountered: