503 error when snapshotting disks attached to stopped instances #3289

askfongjojo · 2023-06-04T16:33:04Z

I think I have seen the 503 before but mistaken that as a control plane issue:

$ oxide snapshot create --name loadgen-snap-stoppedvm --disk loadgen --description "disk attached to vm" --project try
error
Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "7c2b9888-1d8f-43c7-8dd7-8bec8167ca8f", "content-length": "133", "date": "Sun, 04 Jun 2023 06:29:02 GMT"}; value: Error { error_code: Some("ServiceNotAvailable"), message: "Service Unavailable", request_id: "7c2b9888-1d8f-43c7-8dd7-8bec8167ca8f" }

(Once I started up the VM, I was able to create snapshots for both of the disks attached to this vm on rack2.)

If snapshot is prohibited on disks attached to stopped instances, we should probably prevent the snapshot action with a more explicit error. A 503 response would imply that the service is only temporarily unavailable and user may retry the action.

If snapshots should be allowed on disks, then it is a functional issue.

The text was updated successfully, but these errors were encountered:

jmpesp · 2023-06-05T14:50:17Z

In this case, it's a functional issue. Taking a snapshot of an attached disk is allowed.

jmpesp · 2023-07-07T15:46:56Z

Here's the bug in the snapshot create saga:

omicron/nexus/src/app/sagas/snapshot_create.rs

Lines 820 to 853 in c2de480

    
           match db_disk.state().into() { 
        
               external::DiskState::Detached => { 
        
                   info!(log, "setting state of {} to maintenance", params.disk_id); 
        
                   osagactx 
        
                       .datastore() 
        
                       .disk_update_runtime( 
        
                           &opctx, 
        
                           &authz_disk, 
        
                           &db_disk.runtime().maintenance(), 
        
                       ) 
        
                       .await 
        
                       .map_err(ActionError::action_failed)?; 
        
               } 
        
               external::DiskState::Finalizing => { 
        
                   // This saga is a sub-saga of the finalize saga if the user has 
        
                   // specified an optional snapshot should be taken. No state change 
        
                   // is required. 
        
                   info!(log, "disk {} in state finalizing", params.disk_id); 
        
               } 
        
               _ => { 
        
                   // Return a 503 indicating that the user should retry 
        
                   return Err(ActionError::action_failed( 
        
                       Error::ServiceUnavailable { 
        
                           internal_message: format!( 
        
                               "disk is in state {:?}", 
        
                               db_disk.state(), 
        
                           ), 
        
                       }, 
        
                   )); 
        
               } 
        
           }

The disk state here is expected to be Detached, but if the disk is attached to a stopped instance this match will return 503. The part of Nexus that checks whether or not the Pantry should be used to take a snapshot says to use the Pantry if the instance is stopped:

omicron/nexus/src/app/snapshot.rs

Lines 106 to 126 in c2de480

    
           match instance_state { 
        
               // If there's a propolis running, use that 
        
               InstanceState::Running | 
        
               // Rebooting doesn't deactivate the volume 
        
               InstanceState::Rebooting 
        
               => false, 
        
               // If there's definitely no propolis running, then use the 
        
               // pantry 
        
               InstanceState::Stopped | InstanceState::Destroyed => true, 
        
               // If there *may* be a propolis running, then fail: we can't 
        
               // know if that propolis has activated the Volume or not, or if 
        
               // it's in the process of deactivating. 
        
               _ => { 
        
                   return Err( 
        
                       Error::invalid_request( 
        
                           &format!("cannot snapshot attached disk for instance in state {}", instance_state) 
        
                       ) 
        
                   ); 
        
               }

There needs to be more work here: at the minimum, the disk's state changes to Maintenance as part of this saga, and this has to work with (read: block) the instance from starting. This may not be a candidate for FCS though, due to the workaround of starting the instance existing?

askfongjojo · 2023-07-08T22:04:39Z

Thanks for root-causing/sizing this. Let's re-target this to MVP give the effort and impact involved. I'll document this known issue in the release notes because users will likely want to create clean snapshots on stopped instances (so that things such as temporary files locks created by running applications won't be part of the snapshot).

askfongjojo · 2023-07-25T05:32:17Z

Want to note that a customer indicated that they would like to see a fix for this issue sooner for the same reason I mentioned above (i.e. their best practice is to create snapshots on stopped instances).

Volumes are "checked out" from Nexus for many reasons, some of which include sending to another service for use in `Volume::construct`. When that service activates the resulting Volume, this will forcibly take over any existing downstairs connections based on the Upstairs' generation number. This is intentional, and was designed so Nexus, in handing out Volumes with increasing generation numbers, can be sure that the resulting Volume works no matter what (for example, even if a previous Upstairs is wedged somehow, even if the service that is running the previous Upstairs is no longer accepting network connections). Up until now, Nexus wouldn't allow checking out a Volume if there is any chance a Propolis could be running that may use that Volume. This meant restricting certain operations, like creating a snapshot when a disk is attached to an instance that is stopped: any action Nexus would take to attempt a snapshot using a Pantry would race with a user's request to start that instance, and if the Volume checkouts occur in the wrong order the Pantry would take over connections from Propolis, resulting in guest OS errors. Nexus _can_ do this safely though: it has all the information required to know when a checkout is safe to do, and when it may not be safe. This commit adds checks to the Volume checkout transaction that are based on the reason that checkout is occurring, and requires call sites that are performing a checkout to say why they are. Because these checks are performed inside a transaction, Nexus can say for sure when it is safe to allow a Volume to be checked out for a certain reason. For example, in the scenario of taking a snapshot of a disk attached to an instance that is stopped, there are two checkout operations that have the possiblity of racing: 1) the one that Nexus will send to a Pantry during a snapshot create saga. 2) the one that Nexus will send to a Propolis during an instance start saga. If oxidecomputer#1 occurs before oxidecomputer#2, then Propolis will take over the downstairs connections that the Pantry has established, and the snapshot create saga will fail, but the guest OS for that Propolis will not see any errors. If oxidecomputer#2 occurs before oxidecomputer#1, then the oxidecomputer#1 checkout will fail due to one of the conditions added in this commit: the checkout is being performed for use with a Pantry, and a Propolis _may_ exist, so reject the checkout attempt. Fixes oxidecomputer#3289.

Volumes are "checked out" from Nexus for many reasons, some of which include sending to another service for use in `Volume::construct`. When that service activates the resulting Volume, this will forcibly take over any existing downstairs connections based on the Upstairs' generation number. This is intentional, and was designed so Nexus, in handing out Volumes with increasing generation numbers, can be sure that the resulting Volume works no matter what (for example, even if a previous Upstairs is wedged somehow, even if the service that is running the previous Upstairs is no longer accepting network connections). Up until now, Nexus wouldn't allow checking out a Volume if there is any chance a Propolis could be running that may use that Volume. This meant restricting certain operations, like creating a snapshot when a disk is attached to an instance that is stopped: any action Nexus would take to attempt a snapshot using a Pantry would race with a user's request to start that instance, and if the Volume checkouts occur in the wrong order the Pantry would take over connections from Propolis, resulting in guest OS errors. Nexus _can_ do this safely though: it has all the information required to know when a checkout is safe to do, and when it may not be safe. This commit adds checks to the Volume checkout transaction that are based on the reason that checkout is occurring, and requires call sites that are performing a checkout to say why they are. Because these checks are performed inside a transaction, Nexus can say for sure when it is safe to allow a Volume to be checked out for a certain reason. For example, in the scenario of taking a snapshot of a disk attached to an instance that is stopped, there are two checkout operations that have the possiblity of racing: 1) the one that Nexus will send to a Pantry during a snapshot create saga. 2) the one that Nexus will send to a Propolis during an instance start saga. If 1 occurs before 2, then Propolis will take over the downstairs connections that the Pantry has established, and the snapshot create saga will fail, but the guest OS for that Propolis will not see any errors. If 2 occurs before 1, then the 1 checkout will fail due to one of the conditions added in this commit: the checkout is being performed for use with a Pantry, and a Propolis _may_ exist, so reject the checkout attempt. Fixes #3289.

askfongjojo added this to the FCS milestone Jun 4, 2023

askfongjojo assigned jmpesp Jun 4, 2023

askfongjojo added the known issue To include in customer documentation and training label Jul 6, 2023

askfongjojo modified the milestones: FCS, MVP, 1.0.1 Jul 8, 2023

askfongjojo modified the milestones: MVP, 1.0.2 Aug 14, 2023

askfongjojo modified the milestones: 1.0.2, 1.0.3 Aug 22, 2023

askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023

askfongjojo modified the milestones: 3, 4 Oct 17, 2023

morlandi7 modified the milestones: 4, 5 Nov 14, 2023

morlandi7 modified the milestones: 5, 6 Dec 5, 2023

leftwo modified the milestones: 6, MVP+1 Jan 19, 2024

jmpesp mentioned this issue Feb 23, 2024

Want better error on refusal to snapshot disk attached to stopped instance #5108

Open

jmpesp mentioned this issue Mar 7, 2024

Prevent unintentional Upstairs takeovers #5221

Merged

askfongjojo modified the milestones: MVP+1, 8 Mar 9, 2024

askfongjojo mentioned this issue Mar 12, 2024

Disallow disk snapshots for disks attached to stopped instances oxidecomputer/console#1947

Closed

jmpesp closed this as completed in #5221 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

503 error when snapshotting disks attached to stopped instances #3289

503 error when snapshotting disks attached to stopped instances #3289

askfongjojo commented Jun 4, 2023

jmpesp commented Jun 5, 2023

jmpesp commented Jul 7, 2023

askfongjojo commented Jul 8, 2023

askfongjojo commented Jul 25, 2023

503 error when snapshotting disks attached to stopped instances #3289

503 error when snapshotting disks attached to stopped instances #3289

Comments

askfongjojo commented Jun 4, 2023

jmpesp commented Jun 5, 2023

jmpesp commented Jul 7, 2023

askfongjojo commented Jul 8, 2023

askfongjojo commented Jul 25, 2023