-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Support Bundle background task #7063
base: support-bundle-simulated-implementation
Are you sure you want to change the base?
[nexus] Support Bundle background task #7063
Conversation
- Made an API abstraction above support bundles so it can be shared by the real/simulated sled agent - To use this API: - Refactored simulated sled agent storage APIs (internal mutability) - Changed a bunch of sled agent locks to std mutex - Implemented a "nested dataset" API in the simulated sled agent storage - Started using this simulated sled agent API in the support bundle collector background task tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few smaller comments from me, thanks @smklein
@@ -104,6 +104,8 @@ phantom_disks.period_secs = 30 | |||
physical_disk_adoption.period_secs = 30 | |||
# Disable automatic disk adoption to avoid interfering with tests. | |||
physical_disk_adoption.disable = true | |||
support_bundle_collector.period_secs = 30 | |||
support_bundle_collector.disable = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we comment on why we're disabling the bundle collector here, to be consistent with the surrounding items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, added in 917759d
); | ||
return Ok(DatabaseBundleCleanupResult::BadState); | ||
} else { | ||
anyhow::bail!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log the error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, done in 7d166f1
info!( | ||
&opctx.log, | ||
"SupportBundleCollector: Concurrent state change activating bundle"; | ||
"bundle" => %bundle.id, | ||
"err" => ?err, | ||
); | ||
return Ok(Some(report)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a connection error, it would be better to match on UpdateStatus::NotUpdatedButExists
for this, and treat other errors as a real failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, done in bcf6f8c
// as it's being collected. | ||
let dir = tempdir()?; | ||
|
||
println!("created tempdir, starting collection"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debug print?
println!("created tempdir, starting collection"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 76be221
info!( | ||
&self.log, | ||
"Checking if Bundle Collection cancelled"; | ||
"bundle" => %self.bundle.id | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it will be quite noisy. I'd be inclined to either demote this to a debug or trace event, or remove it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it'll be terribly noisy - it'll be only firing when a bundle is being collected, and it'll fire every five seconds afterwards.
That said, I'm happy to demote this to trace
let src = entry.path(); | ||
|
||
zip.start_file_from_path(dst, opts)?; | ||
let buf = std::fs::read(&src)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be quite a lot of memory when we're eventually dealing with a big core file, perhaps use a BufReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done in d23826e
{ | ||
report.listed_in_service_sleds = true; | ||
|
||
// NOTE: This could be, and probably should be, done concurrently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a follow-up issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #7314
) -> anyhow::Result<CollectionReport> { | ||
// Create a temporary directory where we'll store the support bundle | ||
// as it's being collected. | ||
let dir = tempdir()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set TMPDIR
for omicron anywhere or is this going to /tmp
and using tmpfs? I ask because these bundles can potentially be pretty large one day and I am not sure how much memory the system will have for us to use that's not in use by the VMM reservoir or the control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use /var/tmp
in 53dcbe6
continue; | ||
}; | ||
|
||
write_command_result_or_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR but I wish we had something like #[must_use]
for the sled-agent things being added that told you to come here and add them to the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can talk about this in the support bundle sync? Are you trying to trigger a compiler error in Nexus if a Sled Agent API hasn't been called here?
It might be easier to make a test for this behavior, rather than trying to make it a compile error.
entries.sort_by(|a, b| a.file_name().cmp(&b.file_name())); | ||
|
||
for entry in &entries { | ||
// Remove the "/tmp/..." prefix from the path when we're storing it in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this answers my question about where the tmpdir is being created. What I still don't understand is if that's the best location for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just dug this up from my system which I based on some discussion in chat:
2024-11-04.21:41:38 zfs create -o atime=off -o sync=disabled rpool/faketmpfs
2024-11-04.21:41:56 zfs set quota=20G rpool/faketmpfs
2024-11-04.21:42:39 zfs set mountpoint=/faketmpfs rpool/faketmpfs
2024-11-04.21:43:24 zfs snapshot rpool/faketmpfs@empty
Wondering if we should create a dataset similar to this on the active m.2 device if it's missing at sled-agent startup, or if it's found rewind the snapshot to the empty state.
Maybe @jclulow or @davepacheco could weigh in on this (which is where I think I got the above from) vs storing the temporary bundle collection in /tmp
where we may have limited memory due to the VMM reservoir and control plane processes. Or perhaps I am being overly cautious...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not to speed on any of this so I'm not sure what to suggest, but I'm happy to talk through it if it's helpful. Generally I avoid using /tmp
/tmpfs for just about anything. When figuring out where to store stuff I think about what data's being stored, how big it might be, and what its lifecycle is (i.e., how do we make sure we don't leak it, does it need to survive reboot, can it become incorrect or inconsistent with some other copies, etc.) to figure out where it should go. I'm not sure how helpful that is but I'm happy to talk through it if that's useful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was covered a bit in https://rfd.shared.oxide.computer/rfd/0496#_storage
this RFD proposes storing support bundles within the Nexus zone filesystem during collection, but transferring them into a dataset within a dedicated U.2 while the bundle is active, and accessible for download by the Nexus API.
That was my intent, at least - that this data would go through the following steps:
- Collected into the Nexus transient zone filesystem (limited to one bundle at a time, thanks to the structure of this background task).
- Aggregated into a zipfile by the Nexus collection task.
- Transferred to a U.2, which may be running on a separate sled from Nexus entirely. It's only at this point that the bundle is made "active" and is durable.
I do think it's a mistake for this to be using /tmp
-- this code is running in a Nexus zone, so I think I need to be using /var/tmp
to actually be using that transient zone storage, which was my goal, rather than using a RAM-backed filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense I think. The only catch with the root filesystem is that it will disappear if the system reboots and then there might be some database-level thing to clean up (e.g., mark the bundle failed? I don't know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense I think. The only catch with the root filesystem is that it will disappear if the system reboots and then there might be some database-level thing to clean up (e.g., mark the bundle failed? I don't know).
The current implementation of support bundles assigns them to a single Nexus, and it's the responsibility of that individual Nexus to perform collection.
- If the Nexus hasn't been expunged: When it reboots, it'll see the bundle in state
SupportBundleState::Collecting
, and will restart collection.
omicron/nexus/src/app/background/tasks/support_bundle_collector.rs
Lines 336 to 344 in d23826e
let result = self | |
.datastore | |
.support_bundle_list_assigned_to_nexus( | |
opctx, | |
&pagparams, | |
self.nexus_id, | |
vec![SupportBundleState::Collecting], | |
) | |
.await; |
- If either Nexus was expunged, or the support bundle storage was expunged, then the reconfigurator has an execution step which invokes
support_bundle_fail_expunged
:
omicron/nexus/db-queries/src/db/datastore/support_bundle.rs
Lines 219 to 221 in 0603f0b
/// Marks support bundles as failed if their assigned Nexus or backing | |
/// dataset has been destroyed. | |
pub async fn support_bundle_fail_expunged( |
This will update database records, and possibly re-assign the Nexus which "owns" the bundle so that storage on the final U.2 can be freed:
omicron/nexus/src/app/background/tasks/support_bundle_collector.rs
Lines 229 to 247 in d23826e
// Monitors all bundles that are "destroying" or "failing" and assigned to | |
// this Nexus, and attempts to clear their storage from Sled Agents. | |
async fn cleanup_destroyed_bundles( | |
&self, | |
opctx: &OpContext, | |
) -> anyhow::Result<CleanupReport> { | |
let pagparams = DataPageParams::max_page(); | |
let result = self | |
.datastore | |
.support_bundle_list_assigned_to_nexus( | |
opctx, | |
&pagparams, | |
self.nexus_id, | |
vec![ | |
SupportBundleState::Destroying, | |
SupportBundleState::Failing, | |
], | |
) | |
.await; |
(this shares some logic for the "bundle was deleted manually by an operator" and "bundle was failed because something was expunged" pathways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the drive-by comment - I don't have much context here, but I skimmed over support_bundle_fail_expunged
and had a question. It looks like that's checking the current blueprint for expunged zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the drive-by comment - I don't have much context here, but I skimmed over
support_bundle_fail_expunged
and had a question. It looks like that's checking the current blueprint for expunged zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?
Filed #7319 - I think this could result in us "missing" marking a bundle as failed if we quickly transition a zone from "expunged" to "pruned" without execution ever completing successfully in-between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the tempdir side of this, fixed in 53dcbe6
PR 4 / ???
Adds a background task to manage support bundle lifecycle, and perform collection
In this PR: