Skip to content

Commit

Permalink
Allocate Crucible regions randomly across zpools (#3650)
Browse files Browse the repository at this point in the history
Currently nexus allocates crucible regions on the least used datasets.
This leads to repeating failures (see #3416 ). This change introduces
the concept of region allocation strategies at the database layer. It replaces the previously used approach of allocating on the least-used dataset with a "random" strategy that selects randomly from datasets with enough capacity for the requested region. We can expand this to support multiple configurable allocation strategies.

The random strategy picks 3 distinct datasets from zpools with enough
space to hold a copy of the region being allocated. datasets are
shuffled using the md5 of a number appended to the dataset UUID. This
number can be specified as part of the allocation strategy
to get a deterministic allocation, mainly for
test purposes. When unspecified, as in production, it simply uses the current time in
nanoseconds. Because the md5 function has a uniformly random output
distribution, sorting on this provides a random shuffling of the
datasets, while allowing more control than simply using `RANDOM()`.

At present, allocation selects 3 distinct datasets from zpools that have enough space for the region. Since there is currently only one crucible dataset per zpool, this selects 3 distinct zpools. If a future change to the rack adds additional crucible datasets to zpools, the code may select multiple datasets on the same zpool, however it will detect this and produce an error instead of performing the allocation. In a future change we will improve the allocation strategy to pick from 3 distinct sleds and eliminate this problem in the process, but that is not part of this commit.

We will plumb the allocation strategy through more parts of 
Nexus when moving to a 3-sled policy so that we can relax it to
a 1-sled requirement for development/testing.

Testing whether the allocation distribution is truly uniform is
difficult to do in a reproducible manner in CI. I made some attempts at
doing some statistical analysis, but to get a fully deterministic region
allocation we would need to allocate all the dataset Uuids
deterministically, which would require pulling in a direct dependency on
the chacha crate, and then hooking that up. Doing analysis on anything
other than perfectly deterministic data will eventually result in false
failures given enough CI runs. That's just the nature of measuring
whether the data is random. Additionally, a simple chi analysis isn't
quite appropriate here: The 3 dataset selections for a single region
are dependent on each other, because each dataset can only be chosen once.

I ran 3 sets of 3000 region allocations, each resulting in 9000
dataset selections across 27 datasets. I got these distributions,
counting how many times each dataset was selected.

```
[351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319]
[329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345]
[352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344]
```

This seems convincingly uniform to me.
  • Loading branch information
faithanalog authored Jul 17, 2023
1 parent fb2c08d commit d986cdc
Show file tree
Hide file tree
Showing 10 changed files with 477 additions and 344 deletions.
46 changes: 25 additions & 21 deletions nexus/db-model/src/queries/region_allocation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,6 @@ table! {
}
}

table! {
candidate_zpools {
id -> Uuid,
total_size -> Int8,
}
}

table! {
candidate_regions {
id -> Uuid,
Expand All @@ -69,13 +62,6 @@ table! {
}
}

table! {
zpool_size_delta (pool_id) {
pool_id -> Uuid,
size_used_delta -> Numeric,
}
}

table! {
proposed_dataset_changes {
id -> Uuid,
Expand All @@ -92,8 +78,8 @@ table! {
}

table! {
proposed_datasets_fit (fits) {
fits -> Bool,
candidate_zpools (pool_id) {
pool_id -> Uuid
}
}

Expand Down Expand Up @@ -136,8 +122,6 @@ table! {
}
}

diesel::allow_tables_to_appear_in_same_query!(candidate_datasets, zpool,);

diesel::allow_tables_to_appear_in_same_query!(
proposed_dataset_changes,
dataset,
Expand All @@ -150,12 +134,9 @@ diesel::allow_tables_to_appear_in_same_query!(
zpool,
);

diesel::allow_tables_to_appear_in_same_query!(candidate_zpools, dataset,);

diesel::allow_tables_to_appear_in_same_query!(
old_zpool_usage,
zpool,
zpool_size_delta,
proposed_dataset_changes,
);

Expand All @@ -165,3 +146,26 @@ diesel::allow_tables_to_appear_in_same_query!(
inserted_regions,
updated_datasets,
);

diesel::allow_tables_to_appear_in_same_query!(candidate_zpools, dataset,);
diesel::allow_tables_to_appear_in_same_query!(candidate_zpools, zpool,);

// == Needed for random region allocation ==

pub mod cockroach_md5 {
pub mod functions {
use diesel::sql_types::*;
diesel::sql_function!(fn md5(x: Bytea) -> Bytea);
}

pub mod helper_types {
pub type Md5<Expr> = super::functions::md5::HelperType<Expr>;
}

pub mod dsl {
pub use super::functions::*;
pub use super::helper_types::*;
}
}

// == End random region allocation dependencies ==
19 changes: 11 additions & 8 deletions nexus/db-model/src/region.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ pub struct Region {
volume_id: Uuid,

block_size: ByteCount,

// These are i64 only so that we can derive a diesel table from them. We
// never expect them to be negative.
blocks_per_extent: i64,
extent_count: i64,
}
Expand All @@ -42,16 +45,16 @@ impl Region {
dataset_id: Uuid,
volume_id: Uuid,
block_size: ByteCount,
blocks_per_extent: i64,
extent_count: i64,
blocks_per_extent: u64,
extent_count: u64,
) -> Self {
Self {
identity: RegionIdentity::new(Uuid::new_v4()),
dataset_id,
volume_id,
block_size,
blocks_per_extent,
extent_count,
blocks_per_extent: blocks_per_extent as i64,
extent_count: extent_count as i64,
}
}

Expand All @@ -64,11 +67,11 @@ impl Region {
pub fn block_size(&self) -> external::ByteCount {
self.block_size.0
}
pub fn blocks_per_extent(&self) -> i64 {
self.blocks_per_extent
pub fn blocks_per_extent(&self) -> u64 {
self.blocks_per_extent as u64
}
pub fn extent_count(&self) -> i64 {
self.extent_count
pub fn extent_count(&self) -> u64 {
self.extent_count as u64
}
pub fn encrypted(&self) -> bool {
// Per RFD 29, data is always encrypted at rest, and support for
Expand Down
62 changes: 62 additions & 0 deletions nexus/db-queries/src/db/cast_uuid_as_bytea.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Cast UUID to BYTES
use diesel::expression::ValidGrouping;
use diesel::pg::Pg;
use diesel::query_builder::AstPass;
use diesel::query_builder::QueryFragment;
use diesel::query_builder::QueryId;
use diesel::Expression;
use diesel::SelectableExpression;

/// Cast an expression which evaluates to a Uuid and cast it to a Bytea. It's
/// that simple!
#[derive(ValidGrouping, QueryId)]
pub struct CastUuidToBytea<E> {
expression: E,
}

impl<E> CastUuidToBytea<E>
where
E: Expression<SqlType = diesel::sql_types::Uuid>,
{
pub const fn new(expression: E) -> Self {
Self { expression }
}
}

impl<E> Expression for CastUuidToBytea<E>
where
E: Expression,
{
type SqlType = diesel::sql_types::Bytea;
}

impl<E, QS> diesel::AppearsOnTable<QS> for CastUuidToBytea<E> where
E: diesel::AppearsOnTable<QS>
{
}

impl<E, T> SelectableExpression<T> for CastUuidToBytea<E> where
E: SelectableExpression<T>
{
}

impl<E> QueryFragment<Pg> for CastUuidToBytea<E>
where
E: QueryFragment<Pg>,
{
fn walk_ast<'a>(
&'a self,
mut out: AstPass<'_, 'a, Pg>,
) -> diesel::QueryResult<()> {
out.push_sql("CAST(");
self.expression.walk_ast(out.reborrow())?;
out.push_sql(" as BYTEA)");

Ok(())
}
}
Loading

0 comments on commit d986cdc

Please sign in to comment.