Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TOCTOU when adding network interfaces to an instance #1222

Merged
merged 4 commits into from
Jun 22, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 60 additions & 43 deletions nexus/src/app/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ use super::MAX_DISKS_PER_INSTANCE;
use crate::app::sagas;
use crate::authn;
use crate::authz;
use crate::authz::ApiResource;
use crate::context::OpContext;
use crate::db;
use crate::db::identity::Resource;
use crate::db::lookup::LookupPath;
use crate::db::model::Name;
use crate::db::queries::network_interface;
use crate::external_api::params;
use omicron_common::api::external;
use omicron_common::api::external::ByteCount;
use omicron_common::api::external::CreateResult;
use omicron_common::api::external::DataPageParams;
Expand Down Expand Up @@ -654,33 +654,24 @@ impl super::Nexus {
instance_name: &Name,
params: &params::NetworkInterfaceCreate,
) -> CreateResult<db::model::NetworkInterface> {
let (.., authz_project, authz_instance, db_instance) =
let (.., authz_project, authz_instance) =
LookupPath::new(opctx, &self.db_datastore)
.organization_name(organization_name)
.project_name(project_name)
.instance_name(instance_name)
.fetch()
.lookup_for(authz::Action::Modify)
.await?;

// TODO-completeness: We'd like to relax this once hot-plug is
// supported.
//
// TODO-correctness: There's a TOCTOU race here. Someone might start the
// instance between this check and when we actually create the NIC
// record. One solution is to place the state verification in the query
// to create the NIC. Unfortunately, that query is already very
// complicated. See
// https://github.com/oxidecomputer/omicron/issues/1134.
let stopped =
db::model::InstanceState::new(external::InstanceState::Stopped);
if db_instance.runtime_state.state != stopped {
return Err(external::Error::invalid_request(
"Instance must be stopped to attach a new network interface",
));
}

// NOTE: We need to lookup the VPC and VPC Subnet, since we need both
// IDs for creating the network interface.
//
// TODO-correctness: There are additional races here. The VPC and VPC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't one, it's another, eh?

(Semi-related to this PR, I really want better testing / analysis to help us suss these issues out. DB-related raciness is an area with a lot of potential for unsafety)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't one, it's another, eh?

Indeed :)

I really want better testing / analysis to help us suss these issues out

Yeah, that'd be nice. I'm not sure how to force ourselves to hit this, since the flow would be: fetch the VPC Subnet successfully, delete the VPC Subnet, then try to run this query. How would we do that? Maybe part of a set of stress tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I can think of a view options:

  • Random-walk stress testing: ideally, having a "set of operations" and "set of objects/names" that can all be acted upon, and doing some random-walk style operations would be neat.
  • Permutation testing: we could try to integrate something like https://github.com/tokio-rs/loom to invasively test the DB-related ops. There would still be reliance on CRDB, however, so these tests would probably not run as a part of the CI check - they might be delegated to "once daily" jobs.
  • Model checking: we could attempt to model the DB interactions externally (e.g., through TLA+) and try to prove safety properties about them. I'm skeptical about the cost/benefit of this approach; it seems like a more expensive option.

Copy link
Collaborator Author

@bnaecker bnaecker Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the second approach. The first is probably easiest, but I'd be worried about baking in our assumptions about how things should be used. I agree the formal verification approach is likely to be too heavyweight.

// Subnet could both be deleted between the time we fetch them and
// actually insert the record for the interface. The solution is likely
// to make both objects implement `DatastoreCollection` for their
// children, and then use `VpcSubnet::insert_resource` inside the
// `instance_create_network_interface` method. See
// https://github.com/oxidecomputer/omicron/issues/738.
let vpc_name = db::model::Name(params.vpc_name.clone());
let subnet_name = db::model::Name(params.subnet_name.clone());
let (.., authz_vpc, authz_subnet, db_subnet) =
Expand All @@ -699,17 +690,35 @@ impl super::Nexus {
params.identity.clone(),
params.ip,
)?;
let interface = self
.db_datastore
self.db_datastore
.instance_create_network_interface(
opctx,
&authz_subnet,
&authz_instance,
interface,
)
.await
.map_err(network_interface::InsertError::into_external)?;
Ok(interface)
.map_err(|e| {
debug!(
self.log,
"failed to create network interface";
"instance_id" => ?authz_instance.id(),
"interface_id" => ?interface_id,
"error" => ?e,
);
if matches!(
e,
network_interface::InsertError::InstanceNotFound(_)
) {
// Return the not-found message of the authz interface
// object, so that the message reflects how the caller
// originally looked it up
authz_instance.not_found()
} else {
// Convert other errors into an appropriate client error
network_interface::InsertError::into_external(e)
}
})
}

/// Lists network interfaces attached to the instance.
Expand Down Expand Up @@ -794,37 +803,45 @@ impl super::Nexus {
instance_name: &Name,
interface_name: &Name,
) -> DeleteResult {
let (.., authz_instance, db_instance) =
LookupPath::new(opctx, &self.db_datastore)
.organization_name(organization_name)
.project_name(project_name)
.instance_name(instance_name)
.fetch_for(authz::Action::Modify)
.await?;
let (.., authz_instance) = LookupPath::new(opctx, &self.db_datastore)
.organization_name(organization_name)
.project_name(project_name)
.instance_name(instance_name)
.lookup_for(authz::Action::Modify)
.await?;
let (.., authz_interface) = LookupPath::new(opctx, &self.db_datastore)
.instance_id(authz_instance.id())
.network_interface_name(interface_name)
.lookup_for(authz::Action::Delete)
.await?;

// TODO-completeness: We'd like to relax this once hot-plug is supported
// TODO-correctness: There's a race condition here. Someone may start
// the instance after this check but before we actually delete the NIC.
let stopped =
db::model::InstanceState::new(external::InstanceState::Stopped);
if db_instance.runtime_state.state != stopped {
return Err(external::Error::invalid_request(
"Instance must be stopped to detach a network interface",
));
}
self.db_datastore
.instance_delete_network_interface(
opctx,
&authz_instance,
&authz_interface,
)
.await
.map_err(network_interface::DeleteError::into_external)
.map_err(|e| {
debug!(
self.log,
"failed to delete network interface";
"instance_id" => ?authz_instance.id(),
"interface_id" => ?authz_interface.id(),
"error" => ?e,
);
if matches!(
e,
network_interface::DeleteError::InstanceNotFound(_)
) {
// Return the not-found message of the authz interface
// object, so that the message reflects how the caller
// originally looked it up
authz_instance.not_found()
} else {
// Convert other errors into an appropriate client error
network_interface::DeleteError::into_external(e)
}
})
}

/// Invoked by a sled agent to publish an updated runtime state for an
Expand Down
Loading