Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resource limits #4605

Merged
merged 53 commits into from
Dec 13, 2023
Merged

Add resource limits #4605

merged 53 commits into from
Dec 13, 2023

Conversation

zephraph
Copy link
Contributor

@zephraph zephraph commented Dec 4, 2023

This PR aims to introduce quotas as a concept into Nexus for allowing operators to enforce virtual resource limits at the silo level. The initial implementation will be limited to checks during instance start, disk creation, and snapshot creation. We will not being doing advanced quota recalculation as system resources change. We will not yet be enforcing intelligent quota caps where the sum of all quotas must be less than the theoretical available system virtual resources.

The implementation of this functionality is shaped by RFD-427 but some desired functionality will be deferred given time/complexity constraints.

Longer term I believe the shape of quotas and perhaps even their relationship to silos may change. This PR implements a simplified version that matches closely to how the virtual resource provisioning tables are already built out. I know there's some oddness around the shape of the quotas table with it not having its own ID and otherwise being mildly divergent from other resources, but this was largely to ensure we could migrate to another solution and not overcomplicate the initial implementation.

TODO

  • Add quota creation as a step of silo creation
  • Add initialization checks in CTEs for instance create, etc to only proceed when quota unmet
  • Wire up CTE sentinels in upstream callsites
  • Add backfill migration for existing customers
  • Add tests for quota enforcement
  • Delete the quotas when the silo is deleted

In RFD-427 we specify that for the initial implementation we want to require
that all silos have a quota. In considering the API and implementaiton more carefully
I decided that at least temporarily we should drop create/delete given that deletions should
never happen and creations should only happen when the silo is created (or via some internal process
which creates quotas for pre-existing silos)
There will be a follow up PR to add capacity/utilization to the API both
at the silo and system levels. Given that, and the general lack of actionability
of quotas to silo users, I've just dropped the quota view from the API.
@zephraph zephraph force-pushed the add-resource-limits branch from e34542b to 96877b2 Compare December 6, 2023 01:48
Comment on lines 49 to 50
return external::Error::InvalidRequest { message: "Insufficient Capacity: Not enough CPUs to complete request. Either stop unused instances to free up resources or contact the rack operator to request a capacity increase.".to_string() }
}
Copy link
Contributor Author

@zephraph zephraph Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually it would be nice to use @sunshowers's InsufficientCapacity error from #4573

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, maybe not. I think we should distinguish between hitting up against human set capacity limits vs hardware limits.

Copy link
Contributor

@sunshowers sunshowers Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zephraph and I chatted about this and post- this is what I suggested post-#4573:

  • Have the InsufficientCapacity variant carry around an enum, e.g. InsufficientCapacityKind::UserQuota and ::System, and the insufficient_capacity constructor accept that as an argument (forcing callers to specify which InsufficientCapacityKind variant they want)
  • In the message, include the kind: "Insufficient capacity (user quota)" and "... (system)"
  • Consider returning different HTTP error codes for each variant.

cc @askfongjojo who I think has some thoughts about this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunshowers @zephraph: Thanks for looping me in. Once silo quotas are fully in place, user should rarely hit the System capacity error, assuming that the sum of all silo quotas should not exceed the total usable capacity of the rack. But I can think of two situations when the System capacity error would be hit before the UserQuota one:

  1. Some sleds are in maintenance or taken out of the provisioning pool
  2. All sleds are relatively full and the user is trying to provision a very large VM.

As such, perhaps the UserQuota kind error will be a HTTP 400 (end user needs to free up some capacity themselves) whereas the System one is a 507 (operator needs to migrate some instances, check the sled status, force some silos to reduce their usage temporarily, etc.)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful distinction. I think we should definitely make the difference clear in the error code. I don't know if they should have different status codes, though. I can imagine a situation where the system quota violation is also user-correctable by freeing up enough stuff. Or, like you say, if the instance they're trying to create is simply very large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'm just using the InsufficientCapacity error wholesale. This isn't the ideal state, it would be good to follow up with a better error state, but it's sufficient for the time being.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to denote the need for better error messages: #4680

silo_user_list GET /v1/system/users
silo_user_view GET /v1/system/users/{user_id}
silo_view GET /v1/system/silos/{silo}
system_quotas_list GET /v1/system/silo-quotas
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this endpoint from /v1/system/quotas to /v1/system/silo-quotas just b/c it's going under the silos tag now (as per @ahl's suggestion) and it seemed to fit better. I will likely just delete this endpoint in the capacity and utilization branch in favor of an endpoint that returns both quotas and provisioned counts.

@smklein smklein self-assigned this Dec 12, 2023
Comment on lines 20 to 25
-- ~70% of 128 threads leaving 30% for internal use
90 AS cpus,
-- 708 GiB (~70% of memory leaving 30% for internal use)
760209211392 AS memory_bytes,
-- 850 GiB (total storage / 3.5)
912680550400 AS storage_bytes
Copy link

@askfongjojo askfongjojo Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are on ongoing discussions about abandoning the use of "assumed" usable capacity figures. But if we still want to use them, these numbers should be multiplied by the number of sleds and physical disks (excluding the M2s). So a half-rack will provide 90*16 = 1440 threads, and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is replaced with an arbitrarily high limit, significantly more than can be provisioned on even a full rack.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for taking on the heroic effort of tackling this. The CTE and DB migration code make sense to me -- this cut all the way across Nexus, and looks great.

A few comments below, but the only thing I'd consider a "blocker" is the from_sled_count function, which is IMO kinda misleading. Everything else is just nits!

nexus/types/src/external_api/params.rs Outdated Show resolved Hide resolved
schema/crdb/20.0.0/up02.sql Outdated Show resolved Hide resolved
nexus/types/src/external_api/params.rs Outdated Show resolved Hide resolved
nexus/types/src/external_api/params.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/quotas.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/quotas.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/quotas.rs Outdated Show resolved Hide resolved
@smklein smklein removed their assignment Dec 12, 2023
@zephraph zephraph enabled auto-merge (squash) December 13, 2023 01:15
Copy link
Contributor

@david-crespo david-crespo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbelievable work. The core diesel bit is wild. Had a couple of comments, nothing important.

storage: silo_quotas.storage.into(),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you need to go from view to model? Is this used?

-- Adds quotas for any existing silos without them.
-- The selected quotas are based on the resources of a half rack
-- with 30% CPU and memory reserved for internal use and a 3.5x tax
-- on storage for replication, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment no longer accurate I think 😂

.transaction_async(|conn| async move {
diesel::insert_into(quotas_dsl::silo_quotas)
.values(SiloQuotas::arbitrarily_high_default(
DEFAULT_SILO.id(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment on the possibly-surprising-in-the-future fact that we do this for default silo but not internal silo.

@zephraph zephraph merged commit 877a886 into main Dec 13, 2023
21 checks passed
@zephraph zephraph deleted the add-resource-limits branch December 13, 2023 06:19
zephraph added a commit that referenced this pull request Dec 15, 2023
This PR is a follow up to #4605 which adds views into capacity and
utilization both at the silo and system level.

API: 
|op|method|url|
|--|--|--|
|silo_utilization_list | GET    |  /v1/system/utilization/silos |
|silo_utilization_view | GET | /v1/system/utilization/silos/{silo} |
|utilization_view          |               GET  |    /v1/utilization |

I'm not entirely satisfied w/ the silo utilization endpoints. They could
be this instead:

|op|method|url|
|--|--|--|
|silo_utilization_list | GET    |  /v1/system/silos-utilization |
|silo_utilization_view | GET | /v1/system/silos/{silo}/utilization |

Also take special note of the views

```rust
// For the eyes of end users
/// View of the current silo's resource utilization and capacity
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct Utilization {
    /// Accounts for resources allocated to running instances or storage allocated via disks or snapshots
    /// Note that CPU and memory resources associated with a stopped instances are not counted here
    /// whereas associated disks will still be counted
    pub provisioned: VirtualResourceCounts,
    /// The total amount of resources that can be provisioned in this silo
    /// Actions that would exceed this limit will fail
    pub capacity: VirtualResourceCounts,
}

// For the eyes of an operator
/// View of a silo's resource utilization and capacity
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct SiloUtilization {
    pub silo_id: Uuid,
    pub silo_name: Name,
    /// Accounts for resources allocated by in silos like CPU or memory for running instances and storage for disks and snapshots
    /// Note that CPU and memory resources associated with a stopped instances are not counted here
    pub provisioned: VirtualResourceCounts,
    /// Accounts for the total amount of resources reserved for silos via their quotas
    pub allocated: VirtualResourceCounts,
}
```

For users in the silo I use `provisioned` and `capacity` as the
language. Their `capacity` is represented by the quota set by an
operator. For the operator `provisioned` is the same but `allocated` is
used to denote the amount of resources allotted via quotas.

---

Note: I had planned to add a full system utilization endpoint to this PR
but that would increase the scope. Instead will ship that API as a part
of the next release. We can calculate some version of the full system
utilization on the client by listing all the silos and their
utilization.

---------

Co-authored-by: Sean Klein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants