-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silo-scoped IP pools #3926
Comments
I have been doing some research. The work plan isn't complete, especially around allocation. I will be adding more as I figure it out. Work planThis work has two broad parts:
What we do during instance createRight now we have a list of fully global IP pools with a list child IP pool ranges. Each IP pool range has a parent pool ID and a start and end address. When an instance is created, it seems there are at least two external IP address allocations: one for the “source NAT” and one for the instance itself. These two functions are pretty similar. The only real difference:
The updated data model might be a little annoying, but the allocation story here is the complicated part. Or maybe it’s not so bad — each of the two things takes a pool identifier. If we can simply determine the appropriate pool in the calling code and pass it in, we shouldn’t need to change anything else. The two datastore functions take their pool parameter in stupidly different ways: one takes a pool ID and the other takes a pool name. If they both took an ID, they would be identical. And at that point, they do not need to be two functions at all — all they do is call different constructor methods on I thought maybe I had understood the app code wrong because of a comment on the DB query that should theoretically take the specified pool and pick an IP address from that pool because of this comment (2022-06-30):
But the where clause in the query itself says otherwise. Logic added in this PR (2022-07-25). I have a to-do item to fix this comment. Questions
Answered questions
|
@david-crespo: On the two "functional" questions,
No, we already settled on the assumption that IP addresses are rack resources that should be managed by fleet admin only (especially if they are the ones who deal with the team who owns corporate firewall configurations).
That was my original proposal in the linked customer ticket but with the most recent discussion, we won't need something like that until we implement project-level IP pools. |
Sorry meant to address this one earlier as well:
I think there is no need to change the current logic. The odds of running out of addresses may be less when ephemeral IP addresses are released from stopped instances some day. Btw, these are not in your list of questions but it sounds like they can use some feedback:
Yes
There are still use cases that can leverage non-default, fleet-wide pools (e.g. a special pool that has routes to another cloud). I think we can leave the functionality there. Relatedly, we probably to want to fix the issue of requiring user to set the pool name explicitly in API/CLI requests. When user specifies |
Yes, all helpful. |
Talked through the plan with @ahl we were able to narrow scope further in a couple of ways:
We also realized we might as well add the |
This is quite small, but I'm pretty confident it's a sufficient basis for the rest of the work in #3926, so we might as well review and merge already. A couple of minor pain points I ran into while doing this: * Needed guidance on version number (3.0.0 it is) — it will be more obvious when there are more, but I added a line to the readme anyway * Diff output for [`dbinit_equals_sum_of_all_up`](https://github.com/oxidecomputer/omicron/blob/fb16870de8c0dba92d0868a984dd715749141b73/nexus/tests/integration_tests/schema.rs#L601) is terrible — thousands of lines with only a few relevant ones. It turns out the column order matters, so if you add columns to a table in a migration, you have to add them to the _end_ of the `create table` in `dbinit.sql`.
Closes #3926 A lot going on here. I will update this description as I finish things out. ## Important background * #3981 added `silo_id` and `project_id` columns to the IP pools table in the hope that that would be sufficient for the rest of the work schema-wise, but it turns out there was more needed. * Before this change, the concept of a **default** IP pool was implemented through the `name` on the IP pool, i.e., the IP pool named `default` is the one used by default for instance IP allocation if no pool name is specified as part of the instance create POST * Before this change the concept of an **internal** IP pool was implemented through a boolean `internal` column on the `ip_pool` table ## IP pool selection logic There are two situations where we pick an IP pool to allocate IPs from. For an instance's **source NAT**, we always use the default pool. Before this change, with only fleet pools, this simply meant picking the one named `default` (analogous now to the one with `is_default == true`). With the possibility of silo pools, we now pick the most specific default available. That means that if there is a silo-scoped pool marked default _matching the current silo_, we use that. If not, we pick the fleet-level pool marked default, which should always exist (see possible todos at the bottom — we might want to take steps to guarantee this). For instance ephemeral IPs, the instance create POST body takes an optional pool name. If none is specified, we follow the same defaulting logic as above — the most-specific pool marked `is_default`. We are leaving pool names globally unique (as opposed to per-scope) which IMO makes the following lookup logic easy to understand and implement: given a pool name, look up the pool by name. (That part we were already going.) The difference now with scopes is that we need to make sure that the scope of the selected pool (assuming it exists) **does not conflict** with the current scope, i.e., the current silo. In this situation, we do not care about what's marked default, and we are not trying to get an exact match on scope. We just need to disallow an instance from using an IP pool reserved for a different silo. We can revisit this, but as implemented here you can, for example, specify a non-default pool scoped to fleet or silo (if one exists) even if there exists a default pool scoped to your silo. ## DB migrations on `ip_pool` table There are three migrations here based on guidance from @smklein based on [CRDB docs about limitations to online schema changes](https://www.cockroachlabs.com/docs/stable/online-schema-changes#limitations) and some conversations he had with them. It's possible they could be made into two. I don't think it can be done in one. * Add `is_default` column and a unique index ensuring there is only one default IP pool per "scope" (unique `silo_id`, including null as a distinct value) * Populate correct data in new columns * Populate `is_default = true` for any IP pools named `default` (there should be exactly one, but nothing depends on that) * `silo_id = INTERNAL_SILO_ID` for any IP pools marked `internal` (there should be exactly one, but nothing depends on that) * Drop the `internal` column ## Code changes - [x] Add [`similar-asserts`](https://crates.io/crates/similar-asserts) so we can get a usable diff when the schema migration tests fail. Without this you could get a 20k+ line diff with 4 relevant lines. - [x] Use `silo_id == INTERNAL_SILO_ID` everywhere we were previously looking at the `internal` flag (thanks @luqmana for the [suggestion](#3981 (comment))) - [x] Add `silo_id` and `default` to `IpPoolCreate` (create POST params) and plumb them down the create chain * Leave off `project_id` for now, we can add that later - [x] Fix index that is failing to prevent multiple `default` pools for a given scope (see comment #3985 (comment)) - [x] Source NAT IP allocation uses new defaulting logic to get the most specific available default IP Pool - [x] Instance ephemeral IP allocation uses that default logic if no pool name specified in the create POST, otherwise look up pool by specified name (but can only get pools matching its scope, i.e., its project, silo, or the whole fleet) ### Limitations that we might want to turn into to-dos * You can't update `default` on a pool, i.e., you can't make a pool default. You have to delete it and make a new one. This one isn't that hard — I would think of it like image promotion, where it's not a regular update pool, it's a special endpoint for making a pool default that can unset the current default if it's a different pool. * Project-scoped IP pools endpoints are fake — they just return all IP pools. They were made in anticipation of being able to make them real. I'm thinking we should remove them or make them work, but I don't think we have time to make them work. * Ensure somehow that there is always a fleet-level default pool to fall back to
Details to be added. Discussion in https://github.com/oxidecomputer/customer-support/issues/18.
Currently we have fleet-scoped IP pools with CRUD ops. We want to be able to do the same thing for each silo so instances created in a given silo are only drawing from the IP pools for that silo.
The text was updated successfully, but these errors were encountered: