Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No external IP addresses available error when creating an instance with no external IPs #4322

Closed
sudomateo opened this issue Oct 23, 2023 · 5 comments

Comments

@sudomateo
Copy link

Summary

I was previewing an Oxide rack at All Things Open 2023 and spun up 125 16vCPU/256GB RAM instances in an attempt to load test the rack when I ran into the following error.

╷
│ Error: Error creating instance
│ 
│   with oxide_disk.ato[68],
│   on main.tf line 76, in resource "oxide_instance" "ato":
│   76: resource "oxide_instance" "ato" {
│ 
│ API error: HTTP 400 (https://ato.sys.r3.oxide-preview.com/v1/instances?project=63adf1cc-91b5-40cd-9f3e-9568d8d07d26) BODY -> {
│   "request_id": "fa980654-08b1-484d-87f7-f107110c2101",
│   "error_code": "InvalidRequest",
│   "message": "No external IP addresses available"
│ }
╵

I spun up two classes of instances, bastion and worker. The bastion instance had an external IP address.

> oxide instance external-ip list --instance example-bastion --project ato-sandbox
success
ExternalIpResultsPage {
    items: [
        ExternalIp {
            ip: 45.154.216.66,
            kind: Ephemeral,
        },
    ],
    next_page: None,
}

The worker instances did not have an external IP address.

> oxide instance external-ip list --instance example-worker-000 --project ato-sandbox
success
ExternalIpResultsPage {
    items: [],
    next_page: None,
}

I would not expect a No external IP addresses available error when creating instances without external IPs. The error itself seems to be coming from omicron here:

/// Variant of [Self::allocate_external_ip] which may be called from a
/// transaction context.
pub(crate) async fn allocate_external_ip_on_connection(
conn: &async_bb8_diesel::Connection<DbConnection>,
data: IncompleteExternalIp,
) -> CreateResult<ExternalIp> {
let explicit_ip = data.explicit_ip().is_some();
NextExternalIp::new(data).get_result_async(conn).await.map_err(|e| {
use diesel::result::Error::NotFound;
match e {
NotFound => {
if explicit_ip {
Error::invalid_request(
"Requested external IP address not available",
)
} else {
Error::invalid_request(
"No external IP addresses available",
)
}
}
_ => crate::db::queries::external_ip::from_diesel(e),
}
})
}

Reproduction

  1. Get an Oxide host and API token.

  2. Create a main.tf with the following content. Update host and token to your Oxide host and token.

    terraform {
      required_providers {
        oxide = {
          source  = "oxidecomputer/oxide"
          version = "0.1.0-beta"
        }
      }
    }
    
    provider "oxide" {
      host  = "https://ato.sys.r3.oxide-preview.com/"
      token = "oxide-token-REDACTED"
    }
    
    data "oxide_project" "ato" {
      name = "ato-sandbox"
    }
    
    data "oxide_image" "ubuntu" {
      name = "jammy-server"
    }
    
    data "oxide_vpc" "ato" {
      name         = "default"
      project_name = data.oxide_project.ato.name
    }
    
    data "oxide_vpc_subnet" "ato" {
      name         = "default"
      project_name = data.oxide_project.ato.name
      vpc_name     = data.oxide_vpc.ato.name
    }
    
    locals {
      # Number of worker instances to create. Increase this until the API returns
      # the InvalidRequest "No external IP addresses available" error.
      num_instances = 125
    }
    
    resource "oxide_disk" "bastion" {
      name            = "example-bastion"
      description     = "Boot disk for example-bastion"
      project_id      = data.oxide_project.ato.id
      size            = data.oxide_image.ubuntu.size
      source_image_id = data.oxide_image.ubuntu.id
    }
    
    resource "oxide_instance" "bastion" {
      name             = "example-bastion"
      description      = "example-bastion"
      project_id       = data.oxide_project.ato.id
      host_name        = "example-bastion"
      memory           = 4294967296
      ncpus            = 2
      start_on_create  = true
      disk_attachments = [oxide_disk.bastion.id]
      network_interfaces = [{
        name        = "net0"
        description = "example-bastion"
        subnet_id   = data.oxide_vpc_subnet.ato.id
        vpc_id      = data.oxide_vpc.ato.id
      }]
      external_ips = [{
        ip_pool_name = "default"
        type         = "ephemeral"
      }]
    }
    
    resource "oxide_disk" "ato" {
      count           = local.num_instances
      project_id      = data.oxide_project.ato.id
      description     = "Boot disk for example-worker-${format("%03d", count.index)}"
      name            = "example-worker-${format("%03d", count.index)}"
      size            = 10737418240
      source_image_id = data.oxide_image.ubuntu.id
    }
    
    resource "oxide_instance" "ato" {
      count            = local.num_instances
      project_id       = data.oxide_project.ato.id
      description      = "example-worker-${format("%03d", count.index)}"
      name             = "example-worker-${format("%03d", count.index)}"
      host_name        = "example-worker-${format("%03d", count.index)}"
      memory           = 274877906944
      ncpus            = 16
      start_on_create  = true
      disk_attachments = [oxide_disk.ato[count.index].id]
      network_interfaces = [{
        name        = "net0"
        description = "example-worker-${format("%03d", count.index)}"
        subnet_id   = data.oxide_vpc_subnet.ato.id
        vpc_id      = data.oxide_vpc.ato.id
      }]
    }
    
    data "oxide_instance_external_ips" "bastion" {
      instance_id = oxide_instance.bastion.id
    }
    
    output "bastion_external_ip" {
      value = data.oxide_instance_external_ips.bastion
    }
    
    output "worker_internal_ips" {
      value = [for ip in oxide_instance.ato[*].network_interfaces[*].ip_address : ip[0]]
    }
  3. Apply the configuration. Increase the num_instances local until you exhaust your external IP address pool.

@bnaecker
Copy link
Collaborator

This is an unfortunate, but known, limitation in the current implementation of external IPs. For any instance which can reach external networks, we assign a source NAT address, in addition to any requested Ephemeral addresses. SNAT addresses are consumed at a lower rate than Ephemeral addresses, since they use 1/4 of the port range, while Ephemeral addresses take the entire range. But this currently means that one can create fewer instances than the size of the IP pool (assuming they all have external networking of some kind), since each gets roughly 1.25 addresses.

@sudomateo
Copy link
Author

This is an unfortunate, but known, limitation in the current implementation of external IPs. For any instance which can reach external networks, we assign a source NAT address, in addition to any requested Ephemeral addresses. SNAT addresses are consumed at a lower rate than Ephemeral addresses, since they use 1/4 of the port range, while Ephemeral addresses take the entire range. But this currently means that one can create fewer instances than the size of the IP pool (assuming they all have external networking of some kind), since each gets roughly 1.25 addresses.

Thank you for the information! That explains the behavior I was seeing where sometimes adding one more instance would trigger this error and other times adding a few more instances would trigger this error. In each of those cases there were other users creating and destroying instances with external IPs while I was testing.

I'll leave this issue open so you all can decide what to do with it.

For my own curiosity, are there specific RFD(s) one can read to learn more about how networking for instances works in the Oxide rack?

@bnaecker
Copy link
Collaborator

You're welcome, I'm glad that was helpful. RFD 63 is the probably the best overview of networking in the product that's publicly available. Keep in mind that it's definitely the intended state of the product, and as you've discovered, we're still very much in the process of building it! Some of the features it describes are not completed, but it certainly captures the high-level design and intent very well.

@karencfv
Copy link
Contributor

Hey there! Thanks a bunch for such a detailed issue and trying out our software 😊 . Please keep the feedback coming!

I'll transfer this issue to the Omicron repo as this behaviour stems from there.

@karencfv karencfv transferred this issue from oxidecomputer/terraform-provider-oxide Oct 24, 2023
@bnaecker
Copy link
Collaborator

Closing as dup of #4317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants