Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocate Crucible regions randomly across zpools #3650

Merged
merged 4 commits into from
Jul 17, 2023

Conversation

faithanalog
Copy link
Contributor

Currently nexus allocates crucible regions on the least used datasets. This leads to repeating failures (see #3416 ). This change introduces the concept of region allocation strategies at the database layer. Currently only one strategy is used: the random strategy. However this could be expanded.

The random strategy picks 3 distinct datasets from zpools with enough space to hold a copy of the region being allocated. datasets are shuffled using the md5 of the dataset UUID appended to some value. This value can be specified to get a deterministic allocation, mainly for test purposes, but in prod it simply uses the current time in nanoseconds. Because the md5 function has a uniformly random output distribution, sorting on this provides a random shuffling of the datasets, while allowing more control than simply using RANDOM().

Due to SQL limitations, to the best of my knowledge, we cannot make the selection both random and distinct on the zpool or sled IDs, therefore an allocation may select multiple datasets in the same sled or zpool. But, we can detect that this has occurred, and fail the query. I have included some code which will retry a few times if two datasets on the same zpool are selected, as a future shuffling is likely to result in a good selection. Note that currently in production, we don't ever have two Crucible datasets on the same zpool, but it was raised as a future possibility.

I have not done anything to ensure we are spread across 3 sleds. We could attempt to implement that using the same retry-approach, but it feels a bit hacky to me as-is, and feels moreso if we were to rely on it for that. Regardless of how we solve the problem of distributing across 3 distinct sleds, we should plumb the allocation strategy through more parts of Nexus when moving to a 3-sled policy so that we can relax it to a 1-sled requirement for development/testing.

Testing whether the allocation distribution is truly uniform is difficult to do in a reproducible manner in CI. I made some attempts at doing some statistical analysis, but to get a fully deterministic region allocation we would need to allocate all the dataset Uuids deterministically, which would require pulling in a direct dependency on the chacha crate, and then hooking that up. Doing analysis on anything other than perfectly deterministic data will eventually result in false failures given enough CI runs. That's just the nature of measuring whether the data is random. Additionally, a simple chi analysis isn't quite appopriate here since the 3 dataset selections for a single region impact each other due to the requirement for distinct zpools.

ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in 9000 dataset selections across 27 datasets. I got these distributions, counting how many times each dataset was selected.

[351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319]
[329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345]
[352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344]

This seems convincingly uniform to me.

Currently nexus allocates crucible regions on the least used datasets.
This leads to repeating failures (see #3416 ). This change introduces
the concept of region allocation strategies at the database layer.
Currently only one strategy is used: the random strategy. However this
could be expanded.

The random strategy picks 3 distinct datasets from zpools with enough
space to hold a copy of the region being allocated. datasets are
shuffled using the md5 of the dataset UUID appended to some value. This
value can be specified to get a deterministic allocation, mainly for
test purposes, but in prod it simply uses the current time in
nanoseconds. Because the md5 function has a uniformly random output
distribution, sorting on this provides a random shuffling of the
datasets, while allowing more control than simply using `RANDOM()`.

Due to SQL limitations, to the best of my knowledge, we cannot make the
selection both random and distinct on the zpool or sled IDs, therefore
an allocation may select multiple datasets in the same sled or zpool.
But, we can detect that this has occurred, and fail the query. I have
included some code which will retry a few times if two datasets on the
same zpool are selected, as a future shuffling is likely to result in a
good selection. Note that currently in production, we don't ever have two
Crucible datasets on the same zpool, but it was raised as a future
possibility.

I have not done anything to ensure we are spread across 3 sleds. We
could attempt to implement that using the same retry-approach, but it
feels a bit hacky to me as-is, and feels moreso if we were to rely on it
for that. Regardless of how we solve the problem of distributing across
3 distinct sleds, we should plumb the allocation strategy through more
parts of Nexus when moving to a 3-sled policy so that we can relax it
to a 1-sled requirement for development/testing.

Testing whether the allocation distribution is truly uniform is
difficult to do in a reproducible manner in CI. I made some attempts at
doing some statistical analysis, but to get a fully deterministic region
allocation we would need to allocate all the dataset Uuids
deterministically, which would require pulling in a direct dependency on
the chacha crate, and then hooking that up. Doing analysis on anything
other than perfectly deterministic data will eventually result in false
failures given enough CI runs. That's just the nature of measuring
whether the data is random. Additionally, a simple chi analysis isn't
quite appopriate here since the 3 dataset selections for a single region
impact each other due to the requirement for distinct zpools.

ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in
9000 dataset selections across 27 datasets. I got these distributions,
counting how many times each dataset was selected.

```
[351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319]
[329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345]
[352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344]
```

This seems convincingly uniform to me.
nexus/db-model/src/queries/region_allocation.rs Outdated Show resolved Hide resolved
let logctx = dev::test_setup_log("test_region_allocation");
let mut db = test_setup_database(&logctx.log).await;
let (opctx, datastore) = datastore_test(&logctx, &db).await;
async fn create_test_datasets_for_region_allocation(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

);
assert_eq!(extent_count, region.extent_count());
}
let volume1_id = Uuid::new_v4();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just volume_id? I don't see a non-1 volume

nexus/db-queries/src/db/datastore/mod.rs Outdated Show resolved Hide resolved
Comment on lines 138 to 141
// can_be_retried(error) will return true. It is highly likely that if
// we simply retry the query with a different shuffle-ordering, we will
// get a valid result. However, we should no retry forever, in the event
// that something else is going on, so we limit it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a pretty strong aversion to this approach. Constructing a query that we know might return incorrect results, and hoping it randomly selects a valid selection is a recipe for significantly increasing the rate of traffic to the DB to perform an allocation.

I particularly have aversion to this approach because I think it's avoidable:

(Compared with main)

--- a/nexus/db-queries/src/db/queries/region_allocation.rs
+++ b/nexus/db-queries/src/db/queries/region_allocation.rs
@@ -88,6 +88,7 @@ struct CandidateDatasets {
 impl CandidateDatasets {
     fn new() -> Self {
         use crate::db::schema::dataset::dsl as dataset_dsl;
+        use crate::db::schema::zpool::dsl as zpool_dsl;
 
         Self {
             query: Box::new(
@@ -95,7 +96,12 @@ impl CandidateDatasets {
                     .filter(dataset_dsl::time_deleted.is_null())
                     .filter(dataset_dsl::size_used.is_not_null())
                     .filter(dataset_dsl::kind.eq(DatasetKind::Crucible))
-                    .order(dataset_dsl::size_used.asc())
+                    .inner_join(
+                        zpool_dsl::zpool
+                            .on(zpool_dsl::id.eq(dataset_dsl::pool_id))
+                    )
+                    .distinct_on(zpool_dsl::sled_id)
+                    .order((zpool_dsl::sled_id.asc(), dataset_dsl::size_used.asc()))
                     .limit(REGION_REDUNDANCY_THRESHOLD.try_into().unwrap())
                     .select((dataset_dsl::id, dataset_dsl::pool_id)),
             ),
diff --git a/smf/sled-agent/non-gimlet/config-rss.toml b/smf/sled-agent/non-gimlet/config-rss.toml
index bd02ec95e..0851aab5a 100644

Using roughly this raw SQL: https://www.db-fiddle.com/f/kKyiEESGGx5GUaviTAzkuB/1

This is sufficient to select dataset which are unique to each sled -- it basically makes candidate datasets "give me the best dataset each sled could offer, by size".

I do think we would need one more subquery to then order again by space, since the DISTINCT ON clause forces us to sort by sled UUID first, when we really want to sort by least-used-size first.

Copy link
Contributor Author

@faithanalog faithanalog Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sufficient to select dataset which are unique to each sled -- it basically makes candidate datasets "give me the best dataset each sled could offer, by size".

yes, but it also eliminates the random ordering, defeating the point of this PR. We can use distinct_on but we can'd do random ordering. We can do random ordering but we can't use distinct_on. Best I can tell, it's one or the other. I would like to be proven wrong about this, because doing the distinction in the SQL would be a lot nicer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following-up from a discussion in meet:

We should be able to do the following:

  1. Use DISTINCT ON sled with ORDER BY sled, random() to pick random (but valid!) allocation targets from each sled, and
  2. Use the output of the subquery from (1) to randomly select sleds to use for provisioning.

Basically: "Randomly pick a valid target from all sleds, then randomly pick one of those sleds"

However, we can do that as a follow-up to this PR.

Copy link
Contributor Author

@faithanalog faithanalog Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add on, I'm convinced after talking with sean that it may be possible to do that in the SQL, and we'll find out for sure once we try it. I'll be opening an issue to track picking across multiple sleds. It's important to reach the level of redundancy we want. But, it's potentially more complicated than this change since we may need to wire up a way to allocate within just one sled as well to avoid breaking dev workflows, and we may find more tests that need to be updated. This PR is pretty important to get in sooner than later to avoid the error conditions it addresses, which is why we're keeping allocations across sleds as future work.

nexus/db-queries/src/db/queries/region_allocation.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/region.rs Outdated Show resolved Hide resolved
Comment on lines 138 to 141
// can_be_retried(error) will return true. It is highly likely that if
// we simply retry the query with a different shuffle-ordering, we will
// get a valid result. However, we should no retry forever, in the event
// that something else is going on, so we limit it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following-up from a discussion in meet:

We should be able to do the following:

  1. Use DISTINCT ON sled with ORDER BY sled, random() to pick random (but valid!) allocation targets from each sled, and
  2. Use the output of the subquery from (1) to randomly select sleds to use for provisioning.

Basically: "Randomly pick a valid target from all sleds, then randomly pick one of those sleds"

However, we can do that as a follow-up to this PR.

nexus/db-queries/src/db/queries/region_allocation.rs Outdated Show resolved Hide resolved
@faithanalog faithanalog enabled auto-merge (squash) July 17, 2023 02:49
@faithanalog faithanalog merged commit d986cdc into main Jul 17, 2023
@faithanalog faithanalog deleted the artemis/random-region-allocation branch July 17, 2023 04:31
faithanalog added a commit that referenced this pull request Aug 9, 2023
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups. This PR adds the ability to configure the allocation
strategy in the Nexus PackageConfig toml. Anyone running in a one-sled
setup will need to configure that to one-sled mode (as is done for the
integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

`smf/nexus/config-partial.toml` is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds
allocation strategy. I attempted to add 2 extra sleds to the simulated
environment but found that this broke more things than I had the
understanding to fix in this PR. It would be nice in the future for the
sim environment to have 3 sleds in it though, not just for this but for
anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.
faithanalog added a commit that referenced this pull request Aug 18, 2023
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups. This PR adds the ability to configure the allocation
strategy in the Nexus PackageConfig toml. Anyone running in a one-sled
setup will need to configure that to one-sled mode (as is done for the
integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

`smf/nexus/config-partial.toml` is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds
allocation strategy. I attempted to add 2 extra sleds to the simulated
environment but found that this broke more things than I had the
understanding to fix in this PR. It would be nice in the future for the
sim environment to have 3 sleds in it though, not just for this but for
anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.
faithanalog added a commit that referenced this pull request Sep 2, 2023
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups. This PR adds the ability to configure the allocation
strategy in the Nexus PackageConfig toml. Anyone running in a one-sled
setup will need to configure that to one-sled mode (as is done for the
integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

`smf/nexus/config-partial.toml` is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds
allocation strategy. I attempted to add 2 extra sleds to the simulated
environment but found that this broke more things than I had the
understanding to fix in this PR. It would be nice in the future for the
sim environment to have 3 sleds in it though, not just for this but for
anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.
faithanalog added a commit that referenced this pull request Sep 17, 2023
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups. This PR adds the ability to configure the allocation
strategy in the Nexus PackageConfig toml. Anyone running in a one-sled
setup will need to configure that to one-sled mode (as is done for the
integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

`smf/nexus/config-partial.toml` is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds
allocation strategy. I attempted to add 2 extra sleds to the simulated
environment but found that this broke more things than I had the
understanding to fix in this PR. It would be nice in the future for the
sim environment to have 3 sleds in it though, not just for this but for
anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.
faithanalog added a commit that referenced this pull request Sep 29, 2023
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups. This PR adds the ability to configure the allocation
strategy in the Nexus PackageConfig toml. Anyone running in a one-sled
setup will need to configure that to one-sled mode (as is done for the
integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

`smf/nexus/config-partial.toml` is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds
allocation strategy. I attempted to add 2 extra sleds to the simulated
environment but found that this broke more things than I had the
understanding to fix in this PR. It would be nice in the future for the
sim environment to have 3 sleds in it though, not just for this but for
anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.
faithanalog added a commit that referenced this pull request Oct 3, 2023
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups.

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.

Adds the `-r` / `--rack-topology` command line argument to omicron-package target create. Use this to specify whether you are packaging for a single-sled or multi-sled environment. Under single-sled environments, the requirement for 3 distinct sleds is removed.

Fixes #3702

---------

Co-authored-by: iliana etaoin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

region allocation can consistently fail when there is one non-functional crucible
2 participants