[nexus] Silo IP pools schema change #3981

david-crespo · 2023-08-28T19:41:18Z

This is quite small, but I'm pretty confident it's a sufficient basis for the rest of the work in #3926, so we might as well review and merge already.

A couple of minor pain points I ran into while doing this:

Needed guidance on version number (3.0.0 it is) — it will be more obvious when there are more, but I added a line to the readme anyway
Diff output for dbinit_equals_sum_of_all_up is terrible — thousands of lines with only a few relevant ones. It turns out the column order matters, so if you add columns to a table in a migration, you have to add them to the end of the create table in dbinit.sql.

smklein · 2023-08-28T20:30:13Z

schema/crdb/dbinit.sql

+     * Fields representating association with a silo or project. silo_id must be
+     * non-null if project_id is non-null. When project_id is non-null, silo_id
+     * will (naturally) be the ID of the project's silo. Both must be null if
+     * internal is true, i.e., internal IP pools must be fleet-level pools.


Should we add a CONSTRAINT here validating some of these conditions?

Something like:

CONSTRAINT silo_and_project_id_match_nullability CHECK ( (silo_id IS NULL) = (project_id IS NULL) ), CONSTRAINT internal_ips_have_null_silo_and_project CHECK ( (NOT INTERNAL) OR ((silo_id IS NULL) AND (project_id IS NULL)) ),

Definitely — I didn't know you could do that.

Sorry didn't see this yesterday, was there a specific reason internal pools couldn't be silo or project scoped? We already have a separate silo for control plane services and create project, vpc, etc in it during RSS. We could totally just scope the internal ip pool to that silo/project. And in fact, with that we could even just drop the internal flag on ip pools.

Great idea. I'll try to make that work.

smklein

The mechanics of updating the DB look correct to me 👍

ahl

looks good to me modulo @smklein's suggestion

@smklein

Closes #3926 A lot going on here. I will update this description as I finish things out. ## Important background * #3981 added `silo_id` and `project_id` columns to the IP pools table in the hope that that would be sufficient for the rest of the work schema-wise, but it turns out there was more needed. * Before this change, the concept of a **default** IP pool was implemented through the `name` on the IP pool, i.e., the IP pool named `default` is the one used by default for instance IP allocation if no pool name is specified as part of the instance create POST * Before this change the concept of an **internal** IP pool was implemented through a boolean `internal` column on the `ip_pool` table ## IP pool selection logic There are two situations where we pick an IP pool to allocate IPs from. For an instance's **source NAT**, we always use the default pool. Before this change, with only fleet pools, this simply meant picking the one named `default` (analogous now to the one with `is_default == true`). With the possibility of silo pools, we now pick the most specific default available. That means that if there is a silo-scoped pool marked default _matching the current silo_, we use that. If not, we pick the fleet-level pool marked default, which should always exist (see possible todos at the bottom — we might want to take steps to guarantee this). For instance ephemeral IPs, the instance create POST body takes an optional pool name. If none is specified, we follow the same defaulting logic as above — the most-specific pool marked `is_default`. We are leaving pool names globally unique (as opposed to per-scope) which IMO makes the following lookup logic easy to understand and implement: given a pool name, look up the pool by name. (That part we were already going.) The difference now with scopes is that we need to make sure that the scope of the selected pool (assuming it exists) **does not conflict** with the current scope, i.e., the current silo. In this situation, we do not care about what's marked default, and we are not trying to get an exact match on scope. We just need to disallow an instance from using an IP pool reserved for a different silo. We can revisit this, but as implemented here you can, for example, specify a non-default pool scoped to fleet or silo (if one exists) even if there exists a default pool scoped to your silo. ## DB migrations on `ip_pool` table There are three migrations here based on guidance from @smklein based on [CRDB docs about limitations to online schema changes](https://www.cockroachlabs.com/docs/stable/online-schema-changes#limitations) and some conversations he had with them. It's possible they could be made into two. I don't think it can be done in one. * Add `is_default` column and a unique index ensuring there is only one default IP pool per "scope" (unique `silo_id`, including null as a distinct value) * Populate correct data in new columns * Populate `is_default = true` for any IP pools named `default` (there should be exactly one, but nothing depends on that) * `silo_id = INTERNAL_SILO_ID` for any IP pools marked `internal` (there should be exactly one, but nothing depends on that) * Drop the `internal` column ## Code changes - [x] Add [`similar-asserts`](https://crates.io/crates/similar-asserts) so we can get a usable diff when the schema migration tests fail. Without this you could get a 20k+ line diff with 4 relevant lines. - [x] Use `silo_id == INTERNAL_SILO_ID` everywhere we were previously looking at the `internal` flag (thanks @luqmana for the [suggestion](#3981 (comment))) - [x] Add `silo_id` and `default` to `IpPoolCreate` (create POST params) and plumb them down the create chain * Leave off `project_id` for now, we can add that later - [x] Fix index that is failing to prevent multiple `default` pools for a given scope (see comment #3985 (comment)) - [x] Source NAT IP allocation uses new defaulting logic to get the most specific available default IP Pool - [x] Instance ephemeral IP allocation uses that default logic if no pool name specified in the create POST, otherwise look up pool by specified name (but can only get pools matching its scope, i.e., its project, silo, or the whole fleet) ### Limitations that we might want to turn into to-dos * You can't update `default` on a pool, i.e., you can't make a pool default. You have to delete it and make a new one. This one isn't that hard — I would think of it like image promotion, where it's not a regular update pool, it's a special endpoint for making a pool default that can unset the current default if it's a different pool. * Project-scoped IP pools endpoints are fake — they just return all IP pools. They were made in anticipation of being able to make them real. I'm thinking we should remove them or make them work, but I don't think we have time to make them work. * Ensure somehow that there is always a fleet-level default pool to fall back to

@zephraph

Closes #2148 Closes #4002 Closes #4003 Closes #4006 ## Background #3985 (and followups #3998 and #4007) made it possible to associate an IP pool with a silo so that instances created in that silo would get their ephemeral IPs from said pool by default (i.e., without the user having to say anything other than "I want an ephemeral IP"). An IP pool associated with a silo was not accessible for ephemeral IP allocation from other silos — if a disallowed pool was specified by name at instance create time, the request would 404. However! That was the quick version, and the data model left much to be desired. The relation was modeled by adding a nullable `silo_id` and sort-of-not-really-nullable `is_default` column directly on the IP pool table, which has the following limitations (and there are probably more): * A given IP pool could only be associated with at most one silo, could not be shared * The concept of `default` was treated as a property of the pool itself, rather than a property of the _association_ with another resource, which is quite strange. Even if you could associate the pool with multiple silos, you could not have it be the default for one and not for the other * There is no way to create an IP pool without associating it with either the fleet or a silo * Extending this model to allow association at the project level would be inelegant — we'd have to add a `project_id` column (which I did in #3981 before removing it in #3985) More broadly (and vaguely), the idea of an IP pool "knowing" about silos or projects doesn't really make sense. Entities aren't really supposed to know about each other unless they have a parent-child relationship. ## Changes in this PR ### No such thing as fleet-scoped pool, only silo Thanks to @zephraph for encouraging me to make this change. It is dramatically easier to explain "link silo to IP pool" than it is to explain "link resource (fleet or silo) to IP pool". The way to recreate the behavior of a single default pool for the fleet is to simply associate a pool with all silos. Data migrations ensure that existing fleet-scoped pools will be associated with all silos. There can only be one default pool for a silo, so in the rare case where pool A is a fleet default and pool B is default on silo S, we associate both A and B with S, but only B is made silo default pool. ### API These endpoints are added. They're pretty self-explanatory. ``` ip_pool_silo_link POST /v1/system/ip-pools/{pool}/silos ip_pool_silo_list GET /v1/system/ip-pools/{pool}/silos ip_pool_silo_unlink DELETE /v1/system/ip-pools/{pool}/silos/{silo} ip_pool_silo_update PUT /v1/system/ip-pools/{pool}/silos/{silo} ``` The `silo_id` and `is_default` fields are removed from the `IpPool` response as they are now a property of the `IpPoolLink`, not the pool itself. I also fixed the silo-scoped IP pools list (`/v1/ip-pools`) and fetch (`/v1/ip-pools/{pool}`) endpoints, which a) did not actually filter for the current silo, allowing any user to fetch any pool, and b) took a spurious `project` query param that didn't do anything. ### DB The association between IP pools and fleet or silo (or eventually projects, but not here) is now modeled through a polymorphic join table called `ip_pool_resource`: ip_pool_id | resource_type | resource_id | is_default -- | -- | -- | -- 123 | silo | 23 | true 123 | silo | 4 | false ~~65~~ | ~~fleet~~ | ~~FLEET_ID~~ | ~~true~~ Now, instead of setting the association with a silo or fleet at IP pool create or update time, there are separate endpoints for adding and removing an association. A pool can be associated with any number of resources, but a unique index ensures that a given resource can only have one default pool. ### Default IP pool logic If an instance ephemeral IP or a floating IP is created **with a pool specified**, we simply use that pool if it exists and is linked to the user's silo. If an instance ephemeral IP or a floating IP is created **without a pool unspecified**, we look for a default pool for the current silo. If there is a pool linked with the current silo with `is_default=true`, use that. Otherwise, there is no default pool for the given scope and IP allocation will fail, which means the instance create or floating IP create request will fail. The difference introduced in this PR is that we do not fall back to fleet default if there is no silo default because we have removed the concept of a fleet-scoped pool. ### Tests and test helpers This is the source of a lot of noise in this PR. Because there can no longer be a fleet default pool, we can no longer rely on that for tests. The test setup was really confusing. We assumed a default IP pool existed, but we still had to populate it (add a range) if we had to do anything with it. Now, we don't assume it exists, we create it and add a range and associate it with a silo all in one helper. ## What do customers have to do when they upgrade? They should not _have_ to do anything at upgrade time. If they were relying on a single fleet default pool to automatically be used by new silos, when they create silos in the future they will have to manually associate each new silo with the desired pool. We are working on ways to make that easier or more automatic, but that's not in this change. It is less urgent because silo creation is an infrequent operation. If they are _not_ using the previously fleet default IP pool named `default` and do not want it to exist, they can simply delete any IP ranges it contains, unlink it from all silos and delete it. If they are not using it, there should not be any IPs allocated from it (which means they can delete it). --------- Co-authored-by: Justin Bennett <[email protected]>

david-crespo added 3 commits August 28, 2023 14:20

schema migration adding silo and project ID columns to ip pool table

600f96a

add to diesel schema and DB model struct

db17ab1

add line to readme about picking a version number

952bce6

david-crespo requested review from ahl and smklein August 28, 2023 19:41

comments on IpPool DB model

e81de31

david-crespo force-pushed the silo-ip-pools-schema branch from 9099bd2 to e81de31 Compare August 28, 2023 19:51

smklein reviewed Aug 28, 2023

View reviewed changes

smklein approved these changes Aug 28, 2023

View reviewed changes

ahl approved these changes Aug 28, 2023

View reviewed changes

add some constraints on internal and silo_id and project_id

cec576f

david-crespo force-pushed the silo-ip-pools-schema branch from 11d80a6 to cec576f Compare August 28, 2023 21:00

david-crespo enabled auto-merge (squash) August 28, 2023 23:19

david-crespo merged commit f5d8e09 into main Aug 29, 2023

david-crespo deleted the silo-ip-pools-schema branch August 29, 2023 00:05

david-crespo mentioned this pull request Aug 30, 2023

[nexus] Silo IP pools #3985

Merged

6 tasks

david-crespo mentioned this pull request Oct 11, 2023

IP pools data model and API rework #4261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Silo IP pools schema change #3981

[nexus] Silo IP pools schema change #3981

david-crespo commented Aug 28, 2023

smklein Aug 28, 2023

david-crespo Aug 28, 2023

luqmana Aug 29, 2023

david-crespo Aug 29, 2023

smklein left a comment

ahl left a comment

[nexus] Silo IP pools schema change #3981

[nexus] Silo IP pools schema change #3981

Conversation

david-crespo commented Aug 28, 2023

smklein Aug 28, 2023

Choose a reason for hiding this comment

david-crespo Aug 28, 2023

Choose a reason for hiding this comment

luqmana Aug 29, 2023

Choose a reason for hiding this comment

david-crespo Aug 29, 2023

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

ahl left a comment

Choose a reason for hiding this comment