Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Allocate static Ipv6 addresses for propolis zones #801

Closed
wants to merge 9 commits into from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 23, 2022

Add ability for Nexus to hand static Ipv6 addresses for use by propolis
zones.

TODO still draft

@jmpesp jmpesp requested a review from smklein March 23, 2022 12:49
@jmpesp
Copy link
Contributor Author

jmpesp commented Mar 23, 2022

There's a few XXX to deal with before merging, would like some suggestions on them please :)

@smklein smklein requested a review from bnaecker March 23, 2022 15:16
common/src/sql/dbinit.sql Outdated Show resolved Hide resolved
common/src/sql/dbinit.sql Outdated Show resolved Hide resolved
@@ -5,6 +5,12 @@
# Identifier for this instance of Nexus
id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"

[static_v6_allocation_config]
# Subnet to allocate static v6 IPs from
subnet = "fd00:1234::/96"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions about this subnet:

  1. What differentiates this from the subnet on the underlay assigned to the sled during rack setup?
  2. It seems like this should be a /64, as with other sled-specific subnetworks, but I'm not 100% clear on the context.

nexus/src/db/datastore.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, lots of ben's comments mirror my own. Left 'em in for posterity though.

@@ -5,6 +5,12 @@
# Identifier for this instance of Nexus
id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"

[static_v6_allocation_config]
# Subnet to allocate static v6 IPs from
subnet = "fd00:1234::/96"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading https://rfd.shared.oxide.computer/rfd/0063#_ipv6_addressing correctly:

An availability-zone is represented by an IPv6 /48.
A rack and its resources is represented by an IPv6 /56.
A server and its resources are represented by an IPv6 /64.

It's unclear to me if this should currently change the allocation scheme - we aren't using the ULAs yet, really (though maybe we should) but it makes sense to consider a subsection of the /56 "rack-wide" physical address space for this purpose.


self.pool()
.transaction(move |conn| {
// First, search for released addresses
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added @bnaecker to this review because this definitely looks like it overlaps with

fn push_select_next_available_ip_subquery(
somewhat.

@@ -277,3 +280,38 @@ async fn cpapi_artifact_download(

Ok(Response::builder().status(StatusCode::OK).body(body.into())?)
}

/// Endpoint used by Sled Agents to allocate static V6 addresses
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may simplify the allocation space by allocating within the context of a sled.

This API implies that we're allocating from an IP space that's basically rack-wide - but if we're allocating within an individual sled's /64 space (which I think we planned to do anyway, according to RFD 63), we'll have less risk of collision, which should make the allocation problem a little easier?

Basically: WDYT about providing an identifier for the Sled as an input parameter here, and to the allocator? Then Nexus can basically make a "per-sled" subnet from which to allocate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice idea, and I think fits better with the rack setup service's behavior too. The operator will choose an AZ and rack number, and then each sled will have a further ID within those. That'll all combine to make an fd00::/64 for each sled, so a per-sled allocation here seems good.

It also suggests that the database table also might need the sled UUID in it, I believe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that tracks. Being able to predictably access all IP addresses allocated to a sled would also be important when we, for example, decommission a sled (or cope with it failing).

nexus/src/db/schema.rs Outdated Show resolved Hide resolved
nexus/src/db/datastore.rs Outdated Show resolved Hide resolved
nexus/src/db/datastore.rs Outdated Show resolved Hide resolved
}

Err(_) => {
// XXX should match on primary key conflict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaecker 's subnet allocation has an example of this:

// Primary key constraint violation. See notes above.
Some(constraint) if constraint == PRIMARY_KEY_CONSTRAINT => {
NetworkInterfaceError::DuplicatePrimaryKey(
interface.identity.id,
)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmpesp If you find yourself checking specific database error conditions or messages, let me know. I've been prototyping some tooling to make that kind of thing easier. That hasn't been too valuable yet, with only 2 use sites, but if we add more...

nexus/src/db/datastore.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
Comment on lines +3408 to +3420
self.pool()
.transaction(move |conn| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how important this is, but it'll definitely be feasible to allocate addresses in a single query without a transcation. The push_next_available_ip_subquery function is probably a good place to start. One could a left outer join with the table itself, joining on the address being non-NULL, and taking only the first.

For example, this query should work OK:

 select 'fd00::'::INET + off as addr from generate_series(0, 10) as off left outer join static_v6_address on (address = 'fd00::'::INET + off) WHERE address is NULL LIMIT 1;

That's basically the same thing we have in the linked function, so it might be nice to factor it if possible. That takes the first address where there is no corresponding row already in the static_v6_address table. It's effectively the query you have here in the Rust transaction, just in one query. It's still linear in the number of allocated addresses, but requires no transaction or multiple attempts. Hope that helps!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on top of this: I wonder how long we could survive with a (simple) linear allocator by just starting at the "last allocated address", rather than "always start from the beginning". This would basically let us comb around the full address space, rather than repeatedly checking the first few addresses.

(Could definitely be a punted-until-later optimization, but seems like it would help both our address allocators)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's true. Part of the reason we're using IPv6 is that each sled has an entire /64 to work with. That's...a lot of addresses, so just picking the "next" rather than the lowest unallocated seems OK in the short term. Neither is great in the long term, but the "next after the last allocated address" approach is definitely simpler. The "first fit" one in the linked function makes more sense there, because we're talking about potentially tiny subnets in a customer's VPC Subnet, which can be as small as an IPv4 /26 I believe. That's only 16 addresses, 6 of which we reserve for ourselves!

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

Besides my comments in the source, I'm not sure I understand the whole picture here. Are these just for propolis zones? Or for any Oxide service that the sled-agent starts and manages? The other main question is about the actual subnet these come from. IIUC, these should be on the underlay network, which means they must be allocated out of the /64 that the sled gets during rack setup. Not that we know what that subnet is, mind you, just that we know the size is /64/, not /96.

@smklein
Copy link
Collaborator

smklein commented Mar 23, 2022

Besides my comments in the source, I'm not sure I understand the whole picture here. Are these just for propolis zones? Or for any Oxide service that the sled-agent starts and manages?

@jmpesp , please correct me if I'm wrong, but I believe this is attempting to address this chunk of code in instance creation:

let running_zone = RunningZone::boot(installed_zone).await?;
let network = running_zone.ensure_address(AddressRequest::Dhcp).await?;
info!(inner.log, "Created address {} for zone: {}", network, zname);

Currently, when we allocate Propolis instances, we are giving them IPv4 addresses assigned via DHCP.

I think we want the underlay to actually be on IPv6, and chatting with James, I think there have been some difficulties faced by Propolis just trying to access other IPv6 addresses if it doesn't have one of it's own. I thiiiink this particularly popped up when trying to access Crucible services via IPv6 addresses - our workaround had been "allocate an arbitrary IPv6 address to Propolis, so it can speak to other IPv6 addresses", but that doesn't really scale.

@smklein
Copy link
Collaborator

smklein commented Mar 23, 2022

I'm brainstorming a bit, but:

The proposal for the "bootstrap network" in RFD 63 suggests using a sequence of unique bits (i.e., a Sled's MAC address) prepended by a well-known prefix to form a unique IPv6 address.

What do you think about doing something similar here?

If we start with:

  • A server and its resources are represented by an IPv6 /64...
  • and we know that MAC addresses are 48 bits long...
  • and we know that each Propolis Zone gets a dedicated VNIC for control-plane communication...

Couldn't we give static addresses to Propolis like the following:

128 bit IPv6 address = 
[ 64 bits of Sled-Specific Underlay Prefix ] [ 48 bits of Propolis Zone VNIC MAC address ] [ 12 bits of whatever we want ] 

This has some advantages:

  • The address remains unique within the physical network, because of the prefix
  • The address remains unique on the Sled, because of the VNIC's MAC address
  • We don't need to allocate anything / track any of this state in Nexus

@rmustacc
Copy link

So IPv6 already has a well formed way of transforming a MAC address into an EUI64. So what I'd recommend is that in general we don't try to allocate specific ordering of addresses, that we always go and conjoin things based on the MAC address. This means that the lower 64-bits will always be unique and will always be appended to. Now, that still comes down to how that MAC address for an underlying propolis instance (as opposed to the guest-visible things are being done), but that just makes it an AZ-wide random value that we want to be unique.

Add ability for Nexus to hand static Ipv6 addresses for use by propolis
zones.

Allocate the address as part of the instance create saga, and the
instance migrate saga.

Send to sled-agent as part of the instance ensure request so it can
attach to the control vnic.

TODO left in internal api endpoints, are they required anymore?
TODO instance destroy should free IP? or through sled agent?
TODO associate IP with instance?
@jmpesp
Copy link
Contributor Author

jmpesp commented Mar 24, 2022

Took another swing at this today:

  • addresses are allocated as part of instance create and instance migrate sagas
  • they're sent to the sled agent as part of the instance ensure request

lots of XXX aka things for me to resolve still.

@jmpesp
Copy link
Contributor Author

jmpesp commented Mar 24, 2022

Thoughts for tomorrow: I'm pretty sure control_ip is going to have to be a column in the instance, but I was hoping that the static_v6_address table would be a single location for allocation to avoid having to check many places.

@jmpesp
Copy link
Contributor Author

jmpesp commented Mar 25, 2022

Thoughts for tomorrow: I'm pretty sure control_ip is going to have to be a column in the instance, but I was hoping that the static_v6_address table would be a single location for allocation to avoid having to check many places.

This should have been titled "thoughts for hack night" :)

8aa5773 adds an associated UUID to the row, meaning callers that allocate a static IP have to now supply their UUID. It's difficult to go from this column to what actually allocated the address (read: one has to search every table for the ID), but this means that multiple types of resources can allocate IPs for themselves and conflicts are localized to one table.

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 8, 2022

Obviated by #891, closing

@jmpesp jmpesp closed this Apr 8, 2022
@jmpesp jmpesp deleted the allocate_v6 branch April 8, 2022 20:11
jmpesp added a commit that referenced this pull request Jul 6, 2023
Bump crucible rev to pick up:

- fix build break caused by merging #801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Note that #3456 bumped package-manifest, but this commit bumps both that
and Cargo.toml. All Propolis commits between 04a27573 and 25111a88 are
listed above.
jmpesp added a commit that referenced this pull request Jul 7, 2023
Bump crucible rev to pick up:

- Self assembling zones must set their own MTU
- fix build break caused by merging #801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Self assembling zones must set their own MTU
- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Nexus is currently setting the MTU inside self-assembling zones, but
this goes against the idea that self-assembling zones perform their own
configuration. This will change shortly, so set the MTU of $DATALINK in
the Clickhouse and CockroachDB method scripts to set the MTU of
$DATALINK. Also remove the code in `RunningZone::boot` that performs
commands on self-assembling zones.
jmpesp added a commit to jmpesp/omicron that referenced this pull request Jul 7, 2023
Bump crucible rev to pick up:

- Self assembling zones must set their own MTU
- fix build break caused by merging oxidecomputer#801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Self assembling zones must set their own MTU
- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Nexus is currently setting the MTU inside self-assembling zones, but
this goes against the idea that self-assembling zones perform their own
configuration. Remove the code in `RunningZone::boot` that performs
commands on self-assembling zones, and set the MTU of $DATALINK in the
Clickhouse and CockroachDB method scripts.
jmpesp added a commit to jmpesp/omicron that referenced this pull request Jul 7, 2023
Bump crucible rev to pick up:

- Self assembling zones must set their own MTU
- fix build break caused by merging oxidecomputer#801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Self assembling zones must set their own MTU
- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Nexus is currently setting the MTU inside self-assembling zones, but
this goes against the idea that self-assembling zones perform their own
configuration. Remove the code in `RunningZone::boot` that performs
commands on self-assembling zones, and set the MTU of $DATALINK in the
Clickhouse and CockroachDB method scripts.

Fixes oxidecomputer#3512
leftwo pushed a commit that referenced this pull request Oct 25, 2024
Crucible changes
Add test and fix for replay race condition (#1519)
Fix clippy warnings (#1517)
Add edition to `crucible-workspace-hack` (#1516)
Split out Downstairs-specific stats (#1511)
Move remaining `GuestWork` functionality into `Downstairs` (#1510)
Track both jobs and bytes in each IO state (#1507)
Fix new `rustc` and `clippy` warnings (#1509)
Remove IOP/BW limits (for now) (#1506)
Move `GuestBlockRes` into `DownstairsIO` (#1502)
Update actions/checkout digest to eef6144 (#1499)
Update Rust crate hyper-staticfile to 0.10.1 (#1411)
Turn off test-up-2region-encrypted.sh (#1504)
Add `IOop::Barrier` (#1494)
Fix IPv6 addresses in `crutest` (#1503)
Add region set options to more tests. (#1496)
Simplify `CompleteJobs` (#1493)
Removed ignored CI jobs (#1497)
Minor cleanups to `print_last_completed` (#1501)
Remove remaining `Arc<Volume>` instances (#1500)
Add `VolumeBuilder` type (#1492)
remove old unused scripts (#1495)
More multiple region support. (#1484)
Simplify matches (#1490)
Move complete job tracker to a helper object (#1489)
Expand summary and add documentation references to the README. (#1486)
Remove `GuestWorkId` (2/2) (#1482)
Remove `JobId` from `DownstairsIO` (1/2) (#1481)
Remove unused `#[derive(..)]` (#1483)
Update more tests to use dsc (#1480)
Crutest now Volume only (#1479)

Propolis changes
manually impl Deserialize for PciPath for validation purposes (#801)
phd: gate OS-specific tests, make others more OS-agnostic (#799)
lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795)
add marker trait to help check safety of guest memory reads (#794)
clippy fixes for 1.82 (#796)
lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782)
PHD: write efivars in one go (#786)
PHD: support guest-initiated reboot (#785)
server: accept CPUID values in instance specs and plumb them to bhyve (#780)
PHD: allow patched Crucible dependencies (#778)
server: add a first-class error type to machine init (#777)
PciPath to Bdf conversion is infallible; prove it and refactor (#774)
instance spec rework: flatten InstanceSpecV0 (#767)
Make PUT /instance/state 503 when waiting to init
Less anxiety-inducing `Vm::{get, state_watcher}`
leftwo added a commit that referenced this pull request Oct 30, 2024
Crucible changes
Add test and fix for replay race condition (#1519) Fix clippy warnings
(#1517)
Add edition to `crucible-workspace-hack` (#1516)
Split out Downstairs-specific stats (#1511)
Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track
both jobs and bytes in each IO state (#1507) Fix new `rustc` and
`clippy` warnings (#1509)
Remove IOP/BW limits (for now) (#1506)
Move `GuestBlockRes` into `DownstairsIO` (#1502)
Update actions/checkout digest to eef6144 (#1499)
Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off
test-up-2region-encrypted.sh (#1504)
Add `IOop::Barrier` (#1494)
Fix IPv6 addresses in `crutest` (#1503)
Add region set options to more tests. (#1496)
Simplify `CompleteJobs` (#1493)
Removed ignored CI jobs (#1497)
Minor cleanups to `print_last_completed` (#1501)
Remove remaining `Arc<Volume>` instances (#1500)
Add `VolumeBuilder` type (#1492)
remove old unused scripts (#1495)
More multiple region support. (#1484)
Simplify matches (#1490)
Move complete job tracker to a helper object (#1489) Expand summary and
add documentation references to the README. (#1486) Remove `GuestWorkId`
(2/2) (#1482)
Remove `JobId` from `DownstairsIO` (1/2) (#1481)
Remove unused `#[derive(..)]` (#1483)
Update more tests to use dsc (#1480)
Crutest now Volume only (#1479)

Propolis changes
manually impl Deserialize for PciPath for validation purposes (#801)
phd: gate OS-specific tests, make others more OS-agnostic (#799) lib:
log vCPU diagnostics on triple fault and for some unhandled exit types
(#795) add marker trait to help check safety of guest memory reads
(#794) clippy fixes for 1.82 (#796)
lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts
(#782) PHD: write efivars in one go (#786)
PHD: support guest-initiated reboot (#785)
server: accept CPUID values in instance specs and plumb them to bhyve
(#780) PHD: allow patched Crucible dependencies (#778)
server: add a first-class error type to machine init (#777) PciPath to
Bdf conversion is infallible; prove it and refactor (#774) instance spec
rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when
waiting to init
Less anxiety-inducing `Vm::{get, state_watcher}`

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants