Allocate Crucible regions randomly across zpools #3650

faithanalog · 2023-07-16T09:42:35Z

Currently nexus allocates crucible regions on the least used datasets. This leads to repeating failures (see #3416 ). This change introduces the concept of region allocation strategies at the database layer. Currently only one strategy is used: the random strategy. However this could be expanded.

The random strategy picks 3 distinct datasets from zpools with enough space to hold a copy of the region being allocated. datasets are shuffled using the md5 of the dataset UUID appended to some value. This value can be specified to get a deterministic allocation, mainly for test purposes, but in prod it simply uses the current time in nanoseconds. Because the md5 function has a uniformly random output distribution, sorting on this provides a random shuffling of the datasets, while allowing more control than simply using RANDOM().

Due to SQL limitations, to the best of my knowledge, we cannot make the selection both random and distinct on the zpool or sled IDs, therefore an allocation may select multiple datasets in the same sled or zpool. But, we can detect that this has occurred, and fail the query. I have included some code which will retry a few times if two datasets on the same zpool are selected, as a future shuffling is likely to result in a good selection. Note that currently in production, we don't ever have two Crucible datasets on the same zpool, but it was raised as a future possibility.

I have not done anything to ensure we are spread across 3 sleds. We could attempt to implement that using the same retry-approach, but it feels a bit hacky to me as-is, and feels moreso if we were to rely on it for that. Regardless of how we solve the problem of distributing across 3 distinct sleds, we should plumb the allocation strategy through more parts of Nexus when moving to a 3-sled policy so that we can relax it to a 1-sled requirement for development/testing.

Testing whether the allocation distribution is truly uniform is difficult to do in a reproducible manner in CI. I made some attempts at doing some statistical analysis, but to get a fully deterministic region allocation we would need to allocate all the dataset Uuids deterministically, which would require pulling in a direct dependency on the chacha crate, and then hooking that up. Doing analysis on anything other than perfectly deterministic data will eventually result in false failures given enough CI runs. That's just the nature of measuring whether the data is random. Additionally, a simple chi analysis isn't quite appopriate here since the 3 dataset selections for a single region impact each other due to the requirement for distinct zpools.

ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in 9000 dataset selections across 27 datasets. I got these distributions, counting how many times each dataset was selected.

[351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319]
[329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345]
[352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344]

This seems convincingly uniform to me.

Currently nexus allocates crucible regions on the least used datasets. This leads to repeating failures (see #3416 ). This change introduces the concept of region allocation strategies at the database layer. Currently only one strategy is used: the random strategy. However this could be expanded. The random strategy picks 3 distinct datasets from zpools with enough space to hold a copy of the region being allocated. datasets are shuffled using the md5 of the dataset UUID appended to some value. This value can be specified to get a deterministic allocation, mainly for test purposes, but in prod it simply uses the current time in nanoseconds. Because the md5 function has a uniformly random output distribution, sorting on this provides a random shuffling of the datasets, while allowing more control than simply using `RANDOM()`. Due to SQL limitations, to the best of my knowledge, we cannot make the selection both random and distinct on the zpool or sled IDs, therefore an allocation may select multiple datasets in the same sled or zpool. But, we can detect that this has occurred, and fail the query. I have included some code which will retry a few times if two datasets on the same zpool are selected, as a future shuffling is likely to result in a good selection. Note that currently in production, we don't ever have two Crucible datasets on the same zpool, but it was raised as a future possibility. I have not done anything to ensure we are spread across 3 sleds. We could attempt to implement that using the same retry-approach, but it feels a bit hacky to me as-is, and feels moreso if we were to rely on it for that. Regardless of how we solve the problem of distributing across 3 distinct sleds, we should plumb the allocation strategy through more parts of Nexus when moving to a 3-sled policy so that we can relax it to a 1-sled requirement for development/testing. Testing whether the allocation distribution is truly uniform is difficult to do in a reproducible manner in CI. I made some attempts at doing some statistical analysis, but to get a fully deterministic region allocation we would need to allocate all the dataset Uuids deterministically, which would require pulling in a direct dependency on the chacha crate, and then hooking that up. Doing analysis on anything other than perfectly deterministic data will eventually result in false failures given enough CI runs. That's just the nature of measuring whether the data is random. Additionally, a simple chi analysis isn't quite appopriate here since the 3 dataset selections for a single region impact each other due to the requirement for distinct zpools. ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in 9000 dataset selections across 27 datasets. I got these distributions, counting how many times each dataset was selected. ``` [351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319] [329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345] [352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344] ``` This seems convincingly uniform to me.

nexus/db-model/src/queries/region_allocation.rs

smklein · 2023-07-16T20:22:08Z

nexus/db-queries/src/db/datastore/mod.rs

-        let logctx = dev::test_setup_log("test_region_allocation");
-        let mut db = test_setup_database(&logctx.log).await;
-        let (opctx, datastore) = datastore_test(&logctx, &db).await;
+    async fn create_test_datasets_for_region_allocation(


You don't need to rebase around it, but FYI I made something similar in https://github.com/oxidecomputer/omicron/pull/3526/files#diff-749f37bf13cd957fe3d0d5a1050abab569157462c57ceaab6566db63d8e01d9f

smklein · 2023-07-16T20:27:49Z

nexus/db-queries/src/db/datastore/mod.rs

            );
-            assert_eq!(extent_count, region.extent_count());
-        }
+            let volume1_id = Uuid::new_v4();


Just volume_id? I don't see a non-1 volume

nexus/db-queries/src/db/datastore/mod.rs

smklein · 2023-07-16T21:56:39Z

nexus/db-queries/src/db/datastore/region.rs

+        // can_be_retried(error) will return true. It is highly likely that if
+        // we simply retry the query with a different shuffle-ordering, we will
+        // get a valid result. However, we should no retry forever, in the event
+        // that something else is going on, so we limit it.


I have a pretty strong aversion to this approach. Constructing a query that we know might return incorrect results, and hoping it randomly selects a valid selection is a recipe for significantly increasing the rate of traffic to the DB to perform an allocation.

I particularly have aversion to this approach because I think it's avoidable:

(Compared with main)

--- a/nexus/db-queries/src/db/queries/region_allocation.rs +++ b/nexus/db-queries/src/db/queries/region_allocation.rs @@ -88,6 +88,7 @@ struct CandidateDatasets { impl CandidateDatasets { fn new() -> Self { use crate::db::schema::dataset::dsl as dataset_dsl; + use crate::db::schema::zpool::dsl as zpool_dsl; Self { query: Box::new( @@ -95,7 +96,12 @@ impl CandidateDatasets { .filter(dataset_dsl::time_deleted.is_null()) .filter(dataset_dsl::size_used.is_not_null()) .filter(dataset_dsl::kind.eq(DatasetKind::Crucible)) - .order(dataset_dsl::size_used.asc()) + .inner_join( + zpool_dsl::zpool + .on(zpool_dsl::id.eq(dataset_dsl::pool_id)) + ) + .distinct_on(zpool_dsl::sled_id) + .order((zpool_dsl::sled_id.asc(), dataset_dsl::size_used.asc())) .limit(REGION_REDUNDANCY_THRESHOLD.try_into().unwrap()) .select((dataset_dsl::id, dataset_dsl::pool_id)), ), diff --git a/smf/sled-agent/non-gimlet/config-rss.toml b/smf/sled-agent/non-gimlet/config-rss.toml index bd02ec95e..0851aab5a 100644

Using roughly this raw SQL: https://www.db-fiddle.com/f/kKyiEESGGx5GUaviTAzkuB/1

This is sufficient to select dataset which are unique to each sled -- it basically makes candidate datasets "give me the best dataset each sled could offer, by size".

I do think we would need one more subquery to then order again by space, since the DISTINCT ON clause forces us to sort by sled UUID first, when we really want to sort by least-used-size first.

This is sufficient to select dataset which are unique to each sled -- it basically makes candidate datasets "give me the best dataset each sled could offer, by size".

yes, but it also eliminates the random ordering, defeating the point of this PR. We can use distinct_on but we can'd do random ordering. We can do random ordering but we can't use distinct_on. Best I can tell, it's one or the other. I would like to be proven wrong about this, because doing the distinction in the SQL would be a lot nicer.

Following-up from a discussion in meet:

We should be able to do the following:

Use DISTINCT ON sled with ORDER BY sled, random() to pick random (but valid!) allocation targets from each sled, and

Use the output of the subquery from (1) to randomly select sleds to use for provisioning.

Basically: "Randomly pick a valid target from all sleds, then randomly pick one of those sleds"

However, we can do that as a follow-up to this PR.

to add on, I'm convinced after talking with sean that it may be possible to do that in the SQL, and we'll find out for sure once we try it. I'll be opening an issue to track picking across multiple sleds. It's important to reach the level of redundancy we want. But, it's potentially more complicated than this change since we may need to wire up a way to allocate within just one sled as well to avoid breaking dev workflows, and we may find more tests that need to be updated. This PR is pretty important to get in sooner than later to avoid the error conditions it addresses, which is why we're keeping allocations across sleds as future work.

nexus/db-queries/src/db/queries/region_allocation.rs

nexus/db-queries/src/db/datastore/region.rs

smklein · 2023-07-17T00:18:54Z

nexus/db-queries/src/db/datastore/region.rs

+        // can_be_retried(error) will return true. It is highly likely that if
+        // we simply retry the query with a different shuffle-ordering, we will
+        // get a valid result. However, we should no retry forever, in the event
+        // that something else is going on, so we limit it.


Following-up from a discussion in meet:

We should be able to do the following:

Use DISTINCT ON sled with ORDER BY sled, random() to pick random (but valid!) allocation targets from each sled, and

Use the output of the subquery from (1) to randomly select sleds to use for provisioning.

Basically: "Randomly pick a valid target from all sleds, then randomly pick one of those sleds"

However, we can do that as a follow-up to this PR.

nexus/db-queries/src/db/queries/region_allocation.rs

PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.

PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood. Adds the `-r` / `--rack-topology` command line argument to omicron-package target create. Use this to specify whether you are packaging for a single-sled or multi-sled environment. Under single-sled environments, the requirement for 3 distinct sleds is removed. Fixes #3702 --------- Co-authored-by: iliana etaoin <[email protected]>

faithanalog requested review from smklein and leftwo July 16, 2023 09:42

faithanalog linked an issue Jul 16, 2023 that may be closed by this pull request

region allocation can consistently fail when there is one non-functional crucible #3416

Closed

fix URL formatting in doc comment

bc2da07

smklein reviewed Jul 16, 2023

View reviewed changes

smklein approved these changes Jul 17, 2023

View reviewed changes

faithanalog added 2 commits July 17, 2023 02:04

remove retry code, clean up tests, update comments

a6181d0

make candidate_zpools data source explicit

7212720

faithanalog enabled auto-merge (squash) July 17, 2023 02:49

faithanalog merged commit d986cdc into main Jul 17, 2023

faithanalog deleted the artemis/random-region-allocation branch July 17, 2023 04:31

faithanalog mentioned this pull request Jul 18, 2023

Allocate Crucible Datasets On Separate Sleds #3702

Closed

faithanalog mentioned this pull request Aug 9, 2023

RandomnWithDistinctSleds region allocation strategy #3858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocate Crucible regions randomly across zpools #3650

Allocate Crucible regions randomly across zpools #3650

faithanalog commented Jul 16, 2023

smklein Jul 16, 2023

smklein Jul 16, 2023

smklein Jul 16, 2023

faithanalog Jul 16, 2023 •

edited

Loading

smklein Jul 17, 2023

faithanalog Jul 17, 2023 •

edited

Loading

smklein Jul 17, 2023

Allocate Crucible regions randomly across zpools #3650

Allocate Crucible regions randomly across zpools #3650

Conversation

faithanalog commented Jul 16, 2023

smklein Jul 16, 2023

Choose a reason for hiding this comment

smklein Jul 16, 2023

Choose a reason for hiding this comment

smklein Jul 16, 2023

Choose a reason for hiding this comment

faithanalog Jul 16, 2023 • edited Loading

Choose a reason for hiding this comment

smklein Jul 17, 2023

Choose a reason for hiding this comment

faithanalog Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

smklein Jul 17, 2023

Choose a reason for hiding this comment

faithanalog Jul 16, 2023 •

edited

Loading

faithanalog Jul 17, 2023 •

edited

Loading