-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocate Crucible regions randomly across zpools #3650
Conversation
Currently nexus allocates crucible regions on the least used datasets. This leads to repeating failures (see #3416 ). This change introduces the concept of region allocation strategies at the database layer. Currently only one strategy is used: the random strategy. However this could be expanded. The random strategy picks 3 distinct datasets from zpools with enough space to hold a copy of the region being allocated. datasets are shuffled using the md5 of the dataset UUID appended to some value. This value can be specified to get a deterministic allocation, mainly for test purposes, but in prod it simply uses the current time in nanoseconds. Because the md5 function has a uniformly random output distribution, sorting on this provides a random shuffling of the datasets, while allowing more control than simply using `RANDOM()`. Due to SQL limitations, to the best of my knowledge, we cannot make the selection both random and distinct on the zpool or sled IDs, therefore an allocation may select multiple datasets in the same sled or zpool. But, we can detect that this has occurred, and fail the query. I have included some code which will retry a few times if two datasets on the same zpool are selected, as a future shuffling is likely to result in a good selection. Note that currently in production, we don't ever have two Crucible datasets on the same zpool, but it was raised as a future possibility. I have not done anything to ensure we are spread across 3 sleds. We could attempt to implement that using the same retry-approach, but it feels a bit hacky to me as-is, and feels moreso if we were to rely on it for that. Regardless of how we solve the problem of distributing across 3 distinct sleds, we should plumb the allocation strategy through more parts of Nexus when moving to a 3-sled policy so that we can relax it to a 1-sled requirement for development/testing. Testing whether the allocation distribution is truly uniform is difficult to do in a reproducible manner in CI. I made some attempts at doing some statistical analysis, but to get a fully deterministic region allocation we would need to allocate all the dataset Uuids deterministically, which would require pulling in a direct dependency on the chacha crate, and then hooking that up. Doing analysis on anything other than perfectly deterministic data will eventually result in false failures given enough CI runs. That's just the nature of measuring whether the data is random. Additionally, a simple chi analysis isn't quite appopriate here since the 3 dataset selections for a single region impact each other due to the requirement for distinct zpools. ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in 9000 dataset selections across 27 datasets. I got these distributions, counting how many times each dataset was selected. ``` [351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319] [329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345] [352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344] ``` This seems convincingly uniform to me.
let logctx = dev::test_setup_log("test_region_allocation"); | ||
let mut db = test_setup_database(&logctx.log).await; | ||
let (opctx, datastore) = datastore_test(&logctx, &db).await; | ||
async fn create_test_datasets_for_region_allocation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to rebase around it, but FYI I made something similar in https://github.com/oxidecomputer/omicron/pull/3526/files#diff-749f37bf13cd957fe3d0d5a1050abab569157462c57ceaab6566db63d8e01d9f
); | ||
assert_eq!(extent_count, region.extent_count()); | ||
} | ||
let volume1_id = Uuid::new_v4(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just volume_id
? I don't see a non-1 volume
// can_be_retried(error) will return true. It is highly likely that if | ||
// we simply retry the query with a different shuffle-ordering, we will | ||
// get a valid result. However, we should no retry forever, in the event | ||
// that something else is going on, so we limit it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a pretty strong aversion to this approach. Constructing a query that we know might return incorrect results, and hoping it randomly selects a valid selection is a recipe for significantly increasing the rate of traffic to the DB to perform an allocation.
I particularly have aversion to this approach because I think it's avoidable:
(Compared with main
)
--- a/nexus/db-queries/src/db/queries/region_allocation.rs
+++ b/nexus/db-queries/src/db/queries/region_allocation.rs
@@ -88,6 +88,7 @@ struct CandidateDatasets {
impl CandidateDatasets {
fn new() -> Self {
use crate::db::schema::dataset::dsl as dataset_dsl;
+ use crate::db::schema::zpool::dsl as zpool_dsl;
Self {
query: Box::new(
@@ -95,7 +96,12 @@ impl CandidateDatasets {
.filter(dataset_dsl::time_deleted.is_null())
.filter(dataset_dsl::size_used.is_not_null())
.filter(dataset_dsl::kind.eq(DatasetKind::Crucible))
- .order(dataset_dsl::size_used.asc())
+ .inner_join(
+ zpool_dsl::zpool
+ .on(zpool_dsl::id.eq(dataset_dsl::pool_id))
+ )
+ .distinct_on(zpool_dsl::sled_id)
+ .order((zpool_dsl::sled_id.asc(), dataset_dsl::size_used.asc()))
.limit(REGION_REDUNDANCY_THRESHOLD.try_into().unwrap())
.select((dataset_dsl::id, dataset_dsl::pool_id)),
),
diff --git a/smf/sled-agent/non-gimlet/config-rss.toml b/smf/sled-agent/non-gimlet/config-rss.toml
index bd02ec95e..0851aab5a 100644
Using roughly this raw SQL: https://www.db-fiddle.com/f/kKyiEESGGx5GUaviTAzkuB/1
This is sufficient to select dataset which are unique to each sled -- it basically makes candidate datasets "give me the best dataset each sled could offer, by size".
I do think we would need one more subquery to then order again by space, since the DISTINCT ON
clause forces us to sort by sled UUID first, when we really want to sort by least-used-size first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sufficient to select dataset which are unique to each sled -- it basically makes candidate datasets "give me the best dataset each sled could offer, by size".
yes, but it also eliminates the random ordering, defeating the point of this PR. We can use distinct_on but we can'd do random ordering. We can do random ordering but we can't use distinct_on. Best I can tell, it's one or the other. I would like to be proven wrong about this, because doing the distinction in the SQL would be a lot nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following-up from a discussion in meet:
We should be able to do the following:
- Use
DISTINCT ON sled
withORDER BY sled, random()
to pick random (but valid!) allocation targets from each sled, and - Use the output of the subquery from (1) to randomly select sleds to use for provisioning.
Basically: "Randomly pick a valid target from all sleds, then randomly pick one of those sleds"
However, we can do that as a follow-up to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add on, I'm convinced after talking with sean that it may be possible to do that in the SQL, and we'll find out for sure once we try it. I'll be opening an issue to track picking across multiple sleds. It's important to reach the level of redundancy we want. But, it's potentially more complicated than this change since we may need to wire up a way to allocate within just one sled as well to avoid breaking dev workflows, and we may find more tests that need to be updated. This PR is pretty important to get in sooner than later to avoid the error conditions it addresses, which is why we're keeping allocations across sleds as future work.
// can_be_retried(error) will return true. It is highly likely that if | ||
// we simply retry the query with a different shuffle-ordering, we will | ||
// get a valid result. However, we should no retry forever, in the event | ||
// that something else is going on, so we limit it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following-up from a discussion in meet:
We should be able to do the following:
- Use
DISTINCT ON sled
withORDER BY sled, random()
to pick random (but valid!) allocation targets from each sled, and - Use the output of the subquery from (1) to randomly select sleds to use for provisioning.
Basically: "Randomly pick a valid target from all sleds, then randomly pick one of those sleds"
However, we can do that as a follow-up to this PR.
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood. Adds the `-r` / `--rack-topology` command line argument to omicron-package target create. Use this to specify whether you are packaging for a single-sled or multi-sled environment. Under single-sled environments, the requirement for 3 distinct sleds is removed. Fixes #3702 --------- Co-authored-by: iliana etaoin <[email protected]>
Currently nexus allocates crucible regions on the least used datasets. This leads to repeating failures (see #3416 ). This change introduces the concept of region allocation strategies at the database layer. Currently only one strategy is used: the random strategy. However this could be expanded.
The random strategy picks 3 distinct datasets from zpools with enough space to hold a copy of the region being allocated. datasets are shuffled using the md5 of the dataset UUID appended to some value. This value can be specified to get a deterministic allocation, mainly for test purposes, but in prod it simply uses the current time in nanoseconds. Because the md5 function has a uniformly random output distribution, sorting on this provides a random shuffling of the datasets, while allowing more control than simply using
RANDOM()
.Due to SQL limitations, to the best of my knowledge, we cannot make the selection both random and distinct on the zpool or sled IDs, therefore an allocation may select multiple datasets in the same sled or zpool. But, we can detect that this has occurred, and fail the query. I have included some code which will retry a few times if two datasets on the same zpool are selected, as a future shuffling is likely to result in a good selection. Note that currently in production, we don't ever have two Crucible datasets on the same zpool, but it was raised as a future possibility.
I have not done anything to ensure we are spread across 3 sleds. We could attempt to implement that using the same retry-approach, but it feels a bit hacky to me as-is, and feels moreso if we were to rely on it for that. Regardless of how we solve the problem of distributing across 3 distinct sleds, we should plumb the allocation strategy through more parts of Nexus when moving to a 3-sled policy so that we can relax it to a 1-sled requirement for development/testing.
Testing whether the allocation distribution is truly uniform is difficult to do in a reproducible manner in CI. I made some attempts at doing some statistical analysis, but to get a fully deterministic region allocation we would need to allocate all the dataset Uuids deterministically, which would require pulling in a direct dependency on the chacha crate, and then hooking that up. Doing analysis on anything other than perfectly deterministic data will eventually result in false failures given enough CI runs. That's just the nature of measuring whether the data is random. Additionally, a simple chi analysis isn't quite appopriate here since the 3 dataset selections for a single region impact each other due to the requirement for distinct zpools.
ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in 9000 dataset selections across 27 datasets. I got these distributions, counting how many times each dataset was selected.
This seems convincingly uniform to me.