Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sled Agent] Expunged disks are not in use after omicron_physical_disks_ensure #5965

Merged
merged 70 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
cee78bc
Start requiring zone filesystem argument
smklein Jun 20, 2024
3be4b6e
Deprecate the old service format
smklein Jun 20, 2024
8756076
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jun 20, 2024
0aac450
Merge branch 'deprecate-services-migration' into nexus-zone-filesyste…
smklein Jun 20, 2024
3833549
Plumbing through filesystem_pool, still need to make it optional
smklein Jun 21, 2024
aea4bdb
Merge branch 'main' into deprecate-services-migration
smklein Jun 21, 2024
b58352f
review feedback
smklein Jun 21, 2024
a04e9c7
no bail just warn
smklein Jun 21, 2024
4615f1b
Merge branch 'deprecate-services-migration' into nexus-zone-filesyste…
smklein Jun 21, 2024
9f09c32
Merge branch 'main' into deprecate-services-migration
smklein Jun 21, 2024
9db3042
Merge branch 'deprecate-services-migration' into nexus-zone-filesyste…
smklein Jun 21, 2024
a96fc81
optional value
smklein Jun 21, 2024
9858dbf
are we optional yet
smklein Jun 21, 2024
f1e6f7a
lie about filesystem_pools for simulated sled agent
smklein Jun 21, 2024
8a9ade7
Patch test_builder_zones
smklein Jun 24, 2024
d7c462c
Fix test_silos_external_dns_end_to_end
smklein Jun 24, 2024
3c59610
patch v3 schema
smklein Jun 24, 2024
1270098
Patch blueprint edit
smklein Jun 24, 2024
f48fba3
Add schema change
smklein Jun 24, 2024
684932d
fmt
smklein Jun 24, 2024
87b8df9
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jun 24, 2024
52406a6
helios tests
smklein Jun 24, 2024
acaf91f
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jun 24, 2024
b1339d4
Cleanup
smklein Jun 24, 2024
fcea2f1
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jun 25, 2024
53027a3
only pick in-service zpools from reconfigurator - regression test wanted
smklein Jun 25, 2024
ae41399
Merge zpool selection fns
smklein Jun 25, 2024
5b38070
Add colocation test
smklein Jun 25, 2024
f0ab1c2
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jun 26, 2024
b883eec
Ensure expunged disks are not in use after omicron_physical_disks_ensure
smklein Jun 27, 2024
83c7cdf
Fix tests, add comments
smklein Jun 27, 2024
6869d92
Zone bundler
smklein Jun 28, 2024
4292158
Plumb 'PathInPool' structure
smklein Jul 1, 2024
17db428
Destroy instances
smklein Jul 1, 2024
32596df
Remove unused zone code
smklein Jul 1, 2024
d83a553
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jul 1, 2024
d6618e7
Merge branch 'nexus-zone-filesystems-2' into physical_disks_ensure_le…
smklein Jul 1, 2024
2c6eb01
fix helios tests
smklein Jul 1, 2024
e4123a9
Add TODO, re: concurrency safety
smklein Jul 1, 2024
98278d4
Merge branch 'main' into nexus-zone-filesystems-2
smklein Jul 1, 2024
fa91e75
Merge branch 'nexus-zone-filesystems-2' into physical_disks_ensure_le…
smklein Jul 1, 2024
1207c9e
very WIP - adjusting generation
smklein Jul 2, 2024
892a7ca
Stop self-managing disks
smklein Jul 2, 2024
15b8d21
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 2, 2024
c0e8e07
Fix imports
smklein Jul 2, 2024
654a4ce
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 2, 2024
187aea3
generation number unity
smklein Jul 2, 2024
7c5a67f
Merge branch 'main' into stop-self-managing-disks
smklein Jul 2, 2024
b50007b
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 2, 2024
c2ee842
Remove self-managing test too
smklein Jul 2, 2024
a437cc2
imports
smklein Jul 2, 2024
d9ab0e2
Merge branch 'main' into stop-self-managing-disks
smklein Jul 2, 2024
3d91d67
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 2, 2024
c7d4e2e
Merge branch 'main' into stop-self-managing-disks
smklein Jul 2, 2024
7751f12
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 2, 2024
8f2301d
Safe against concurrent updates
smklein Jul 2, 2024
48c3578
Patch firmware tests
smklein Jul 2, 2024
e933a46
Add a bunch of logging
smklein Jul 2, 2024
154a071
review feedback
smklein Jul 3, 2024
691bc85
tx naming
smklein Jul 5, 2024
e360dae
more explicit instance termination
smklein Jul 5, 2024
ec013d9
better handling of oneshot tx in instance manager
smklein Jul 5, 2024
a818de2
use_only_these_disks
smklein Jul 5, 2024
f242e0a
Mark vmm failed
smklein Jul 5, 2024
77931fd
Merge branch 'main' into stop-self-managing-disks
smklein Jul 12, 2024
6babd19
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 12, 2024
d8a5465
Merge branch 'main' into stop-self-managing-disks
smklein Jul 12, 2024
d57ec70
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 12, 2024
9e1d729
Merge branch 'main' into stop-self-managing-disks
smklein Jul 15, 2024
426daf1
Merge branch 'stop-self-managing-disks' into physical_disks_ensure_le…
smklein Jul 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions sled-agent/src/instance_manager.rs
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,10 @@ impl InstanceManager {
///
/// This function looks for transient zone filesystem usage on expunged
/// zpools.
pub async fn only_use_disks(&self, disks: AllDisks) -> Result<(), Error> {
pub async fn use_only_these_disks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

&self,
disks: AllDisks,
) -> Result<(), Error> {
let (tx, rx) = oneshot::channel();
self.inner
.tx
Expand Down Expand Up @@ -517,7 +520,7 @@ impl InstanceManagerRunner {
self.get_instance_state(tx, instance_id).await
},
Some(OnlyUseDisks { disks, tx } ) => {
self.only_use_disks(disks).await;
self.use_only_these_disks(disks).await;
tx.send(Ok(())).map_err(|_| Error::FailedSendClientClosed)
},
None => {
Expand Down Expand Up @@ -802,23 +805,20 @@ impl InstanceManagerRunner {
Ok(())
}

async fn only_use_disks(
&mut self,
disks: AllDisks,
) {
async fn use_only_these_disks(&mut self, disks: AllDisks) {
// Consider the generation number on the incoming request to avoid
// applying old requests.
let requested_generation = *disks.generation();
if let Some(last_gen) = self.storage_generation {
if last_gen >= requested_generation {
// This request looks old, ignore it.
info!(self.log, "only_use_disks: Ignoring request";
info!(self.log, "use_only_these_disks: Ignoring request";
"last_gen" => ?last_gen, "requested_gen" => ?requested_generation);
return;
}
}
self.storage_generation = Some(requested_generation);
info!(self.log, "only_use_disks: Processing new request";
info!(self.log, "use_only_these_disks: Processing new request";
"gen" => ?requested_generation);

let u2_set: HashSet<_> = disks.all_u2_zpools().into_iter().collect();
Expand All @@ -830,7 +830,7 @@ impl InstanceManagerRunner {
let Ok(Some(filesystem_pool)) =
instance.get_filesystem_zpool().await
else {
info!(self.log, "only_use_disks: Cannot read filesystem pool"; "instance_id" => ?id);
info!(self.log, "use_only_these_disks: Cannot read filesystem pool"; "instance_id" => ?id);
continue;
};
if !u2_set.contains(&filesystem_pool) {
Expand All @@ -839,16 +839,16 @@ impl InstanceManagerRunner {
}

for id in to_remove {
info!(self.log, "only_use_disks: Removing instance"; "instance_id" => ?id);
info!(self.log, "use_only_these_disks: Removing instance"; "instance_id" => ?id);
if let Some((_, instance)) = self.instances.remove(&id) {
let (tx, rx) = oneshot::channel();
if let Err(e) = instance.terminate(tx).await {
warn!(self.log, "only_use_disks: Failed to request instance removal"; "err" => ?e);
warn!(self.log, "use_only_these_disks: Failed to request instance removal"; "err" => ?e);
continue;
}

if let Err(e) = rx.await {
warn!(self.log, "only_use_disks: Failed while removing instance"; "err" => ?e);
warn!(self.log, "use_only_these_disks: Failed while removing instance"; "err" => ?e);
}
}
}
Expand Down
10 changes: 5 additions & 5 deletions sled-agent/src/probe_manager.rs
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ impl ProbeManager {

/// Removes any probes using filesystem roots on zpools that are not
/// contained in the set of "disks".
pub(crate) async fn only_use_disks(&self, disks: &AllDisks) {
pub(crate) async fn use_only_these_disks(&self, disks: &AllDisks) {
let u2_set: HashSet<_> = disks.all_u2_zpools().into_iter().collect();
let mut probes = self.inner.running_probes.lock().await;

Expand All @@ -107,13 +107,13 @@ impl ProbeManager {
if let Some(last_gen) = probes.storage_generation {
if last_gen >= requested_generation {
// This request looks old, ignore it.
info!(self.inner.log, "only_use_disks: Ignoring request";
info!(self.inner.log, "use_only_these_disks: Ignoring request";
"last_gen" => ?last_gen, "requested_gen" => ?requested_generation);
return;
}
}
probes.storage_generation = Some(requested_generation);
info!(self.inner.log, "only_use_disks: Processing new request";
info!(self.inner.log, "use_only_these_disks: Processing new request";
"gen" => ?requested_generation);

let to_remove = probes
Expand All @@ -122,7 +122,7 @@ impl ProbeManager {
.filter_map(|(id, probe)| {
let Some(probe_pool) = probe.root_zpool() else {
// No known pool for this probe
info!(self.inner.log, "only_use_disks: Cannot read filesystem pool"; "id" => ?id);
info!(self.inner.log, "use_only_these_disks: Cannot read filesystem pool"; "id" => ?id);
return None;
};

Expand All @@ -135,7 +135,7 @@ impl ProbeManager {
.collect::<Vec<_>>();

for probe_id in to_remove {
info!(self.inner.log, "only_use_disks: Removing probe"; "probe_id" => ?probe_id);
info!(self.inner.log, "use_only_these_disks: Removing probe"; "probe_id" => ?probe_id);
self.inner.remove_probe_locked(&mut probes, probe_id).await;
}
}
Expand Down
4 changes: 2 additions & 2 deletions sled-agent/src/sled_agent.rs
Original file line number Diff line number Diff line change
Expand Up @@ -864,12 +864,12 @@ impl SledAgent {
// Ensure that all probes, at least after our call to
// "omicron_physical_disks_ensure", stop using any disks that
// may have been in-service from before that request.
self.inner.probes.only_use_disks(&latest_disks).await;
self.inner.probes.use_only_these_disks(&latest_disks).await;
info!(self.log, "physical disks ensure: Updated probes");

// Do the same for instances - mark them failed if they were using
// expunged disks.
self.inner.instances.only_use_disks(latest_disks).await?;
self.inner.instances.use_only_these_disks(latest_disks).await?;
info!(self.log, "physical disks ensure: Updated instances");

Ok(disk_result)
Expand Down
Loading