Skip to content

Commit

Permalink
Add some unit tests for sled-agent Instance creation
Browse files Browse the repository at this point in the history
At time of writing, instance creation roughly looks like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
        - `illumos_utils::svc::wait_for_service`
        - `self::wait_for_http_server` for propolis-server itself
      - sled-agent: `Instance::ensure_propolis_and_tasks`
        - sled-agent: spawn `Instance::monitor_state_task`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
    - sled-agent: return ok result
- nexus: `handle_instance_put_result`

Or at least, it does in the happy path. omicron#3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
`instance_put_state`. That might've looked something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
- nexus: i've been waiting a whole minute for this. connection timeout!
- nexus: `handle_instance_put_result`
    - sled-agent: [...] return... oh, they hung up. :(

To avoid this timeout being implicit at the *Dropshot configuration*
layer (that is to say, we should still have *some* timeout),
we could consider a small refactor to make `instance_put_state` not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band `cpapi_instances_put` calls! That might look
something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: spawn {
      - sled-agent: `Instance::propolis_ensure`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent: `Instance::setup_propolis_locked` (blocking!)
        - sled-agent: `Instance::ensure_propolis_and_tasks`
          - sled-agent: spawn `Instance::monitor_state_task`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call

(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)

Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.
  • Loading branch information
lif committed Nov 15, 2023
1 parent f686829 commit 8c23c45
Show file tree
Hide file tree
Showing 4 changed files with 526 additions and 3 deletions.
41 changes: 38 additions & 3 deletions sled-agent/src/fakes/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,18 @@
//! to operate correctly.
use dropshot::{
endpoint, ApiDescription, FreeformBody, HttpError, HttpResponseOk, Path,
RequestContext,
endpoint, ApiDescription, FreeformBody, HttpError, HttpResponseOk,
HttpResponseUpdatedNoContent, Path, RequestContext, TypedBody,
};
use hyper::Body;
use internal_dns::ServiceName;
use omicron_common::api::external::Error;
use omicron_common::api::internal::nexus::UpdateArtifactId;
use omicron_common::api::internal::nexus::{
SledInstanceState, UpdateArtifactId,
};
use schemars::JsonSchema;
use serde::Deserialize;
use uuid::Uuid;

/// Implements a fake Nexus.
///
Expand All @@ -28,6 +33,13 @@ pub trait FakeNexusServer: Send + Sync {
) -> Result<Vec<u8>, Error> {
Err(Error::internal_error("Not implemented"))
}
fn cpapi_instances_put(
&self,
_instance_id: Uuid,
_new_runtime_state: SledInstanceState,
) -> Result<(), Error> {
Err(Error::internal_error("Not implemented"))
}
}

/// Describes the server context type.
Expand All @@ -52,9 +64,32 @@ async fn cpapi_artifact_download(
))
}

#[derive(Deserialize, JsonSchema)]
struct InstancePathParam {
instance_id: Uuid,
}

#[endpoint {
method = PUT,
path = "/instances/{instance_id}",
}]
async fn cpapi_instances_put(
request_context: RequestContext<ServerContext>,
path_params: Path<InstancePathParam>,
new_runtime_state: TypedBody<SledInstanceState>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
let context = request_context.context();
context.cpapi_instances_put(
path_params.into_inner().instance_id,
new_runtime_state.into_inner(),
)?;
Ok(HttpResponseUpdatedNoContent())
}

fn api() -> ApiDescription<ServerContext> {
let mut api = ApiDescription::new();
api.register(cpapi_artifact_download).unwrap();
api.register(cpapi_instances_put).unwrap();
api
}

Expand Down
Loading

0 comments on commit 8c23c45

Please sign in to comment.