Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Allocate Crucible regions randomly across zpools (#3650)
Currently nexus allocates crucible regions on the least used datasets. This leads to repeating failures (see #3416 ). This change introduces the concept of region allocation strategies at the database layer. It replaces the previously used approach of allocating on the least-used dataset with a "random" strategy that selects randomly from datasets with enough capacity for the requested region. We can expand this to support multiple configurable allocation strategies. The random strategy picks 3 distinct datasets from zpools with enough space to hold a copy of the region being allocated. datasets are shuffled using the md5 of a number appended to the dataset UUID. This number can be specified as part of the allocation strategy to get a deterministic allocation, mainly for test purposes. When unspecified, as in production, it simply uses the current time in nanoseconds. Because the md5 function has a uniformly random output distribution, sorting on this provides a random shuffling of the datasets, while allowing more control than simply using `RANDOM()`. At present, allocation selects 3 distinct datasets from zpools that have enough space for the region. Since there is currently only one crucible dataset per zpool, this selects 3 distinct zpools. If a future change to the rack adds additional crucible datasets to zpools, the code may select multiple datasets on the same zpool, however it will detect this and produce an error instead of performing the allocation. In a future change we will improve the allocation strategy to pick from 3 distinct sleds and eliminate this problem in the process, but that is not part of this commit. We will plumb the allocation strategy through more parts of Nexus when moving to a 3-sled policy so that we can relax it to a 1-sled requirement for development/testing. Testing whether the allocation distribution is truly uniform is difficult to do in a reproducible manner in CI. I made some attempts at doing some statistical analysis, but to get a fully deterministic region allocation we would need to allocate all the dataset Uuids deterministically, which would require pulling in a direct dependency on the chacha crate, and then hooking that up. Doing analysis on anything other than perfectly deterministic data will eventually result in false failures given enough CI runs. That's just the nature of measuring whether the data is random. Additionally, a simple chi analysis isn't quite appropriate here: The 3 dataset selections for a single region are dependent on each other, because each dataset can only be chosen once. I ran 3 sets of 3000 region allocations, each resulting in 9000 dataset selections across 27 datasets. I got these distributions, counting how many times each dataset was selected. ``` [351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319] [329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345] [352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344] ``` This seems convincingly uniform to me.
- Loading branch information