Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Support Bundle background task #7063

Open
wants to merge 41 commits into
base: support-bundle-simulated-implementation
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Nov 14, 2024

PR 4 / ???

Adds a background task to manage support bundle lifecycle, and perform collection

In this PR:

- Made an API abstraction above support bundles so it can be shared
  by the real/simulated sled agent
- To use this API:
    - Refactored simulated sled agent storage APIs (internal mutability)
    - Changed a bunch of sled agent locks to std mutex
    - Implemented a "nested dataset" API in the simulated sled agent
      storage
- Started using this simulated sled agent API in the support bundle
  collector background task tests
@smklein smklein changed the base branch from support-bundles-crdb to support-bundle-simulated-implementation December 17, 2024 22:07
@smklein smklein marked this pull request as ready for review December 17, 2024 22:20
Copy link
Contributor

@wfchandler wfchandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few smaller comments from me, thanks @smklein

@@ -104,6 +104,8 @@ phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
# Disable automatic disk adoption to avoid interfering with tests.
physical_disk_adoption.disable = true
support_bundle_collector.period_secs = 30
support_bundle_collector.disable = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we comment on why we're disabling the bundle collector here, to be consistent with the surrounding items?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, added in 917759d

);
return Ok(DatabaseBundleCleanupResult::BadState);
} else {
anyhow::bail!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log the error here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, done in 7d166f1

Comment on lines 384 to 391
info!(
&opctx.log,
"SupportBundleCollector: Concurrent state change activating bundle";
"bundle" => %bundle.id,
"err" => ?err,
);
return Ok(Some(report));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a connection error, it would be better to match on UpdateStatus::NotUpdatedButExists for this, and treat other errors as a real failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, done in bcf6f8c

// as it's being collected.
let dir = tempdir()?;

println!("created tempdir, starting collection");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debug print?

Suggested change
println!("created tempdir, starting collection");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 76be221

Comment on lines 433 to 437
info!(
&self.log,
"Checking if Bundle Collection cancelled";
"bundle" => %self.bundle.id
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will be quite noisy. I'd be inclined to either demote this to a debug or trace event, or remove it entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'll be terribly noisy - it'll be only firing when a bundle is being collected, and it'll fire every five seconds afterwards.

That said, I'm happy to demote this to trace

let src = entry.path();

zip.start_file_from_path(dst, opts)?;
let buf = std::fs::read(&src)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be quite a lot of memory when we're eventually dealing with a big core file, perhaps use a BufReader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in d23826e

{
report.listed_in_service_sleds = true;

// NOTE: This could be, and probably should be, done concurrently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a follow-up issue for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #7314

smklein added a commit that referenced this pull request Jan 6, 2025
PR 2 / ???

Adds Support Bundles to the database, along with datastore-specific
tests.

These implementations are necessary to implement both the Support Bundle
background task,
in #7063 , as well as the
HTTP interface for support
bundles, in #7187
) -> anyhow::Result<CollectionReport> {
// Create a temporary directory where we'll store the support bundle
// as it's being collected.
let dir = tempdir()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set TMPDIR for omicron anywhere or is this going to /tmp and using tmpfs? I ask because these bundles can potentially be pretty large one day and I am not sure how much memory the system will have for us to use that's not in use by the VMM reservoir or the control plane.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use /var/tmp in 53dcbe6

continue;
};

write_command_result_or_error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but I wish we had something like #[must_use] for the sled-agent things being added that told you to come here and add them to the collection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can talk about this in the support bundle sync? Are you trying to trigger a compiler error in Nexus if a Sled Agent API hasn't been called here?

It might be easier to make a test for this behavior, rather than trying to make it a compile error.

entries.sort_by(|a, b| a.file_name().cmp(&b.file_name()));

for entry in &entries {
// Remove the "/tmp/..." prefix from the path when we're storing it in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this answers my question about where the tmpdir is being created. What I still don't understand is if that's the best location for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dug this up from my system which I based on some discussion in chat:

2024-11-04.21:41:38 zfs create -o atime=off -o sync=disabled rpool/faketmpfs
2024-11-04.21:41:56 zfs set quota=20G rpool/faketmpfs
2024-11-04.21:42:39 zfs set mountpoint=/faketmpfs rpool/faketmpfs
2024-11-04.21:43:24 zfs snapshot rpool/faketmpfs@empty

Wondering if we should create a dataset similar to this on the active m.2 device if it's missing at sled-agent startup, or if it's found rewind the snapshot to the empty state.

Maybe @jclulow or @davepacheco could weigh in on this (which is where I think I got the above from) vs storing the temporary bundle collection in /tmp where we may have limited memory due to the VMM reservoir and control plane processes. Or perhaps I am being overly cautious...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not to speed on any of this so I'm not sure what to suggest, but I'm happy to talk through it if it's helpful. Generally I avoid using /tmp/tmpfs for just about anything. When figuring out where to store stuff I think about what data's being stored, how big it might be, and what its lifecycle is (i.e., how do we make sure we don't leak it, does it need to survive reboot, can it become incorrect or inconsistent with some other copies, etc.) to figure out where it should go. I'm not sure how helpful that is but I'm happy to talk through it if that's useful!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was covered a bit in https://rfd.shared.oxide.computer/rfd/0496#_storage

this RFD proposes storing support bundles within the Nexus zone filesystem during collection, but transferring them into a dataset within a dedicated U.2 while the bundle is active, and accessible for download by the Nexus API.

That was my intent, at least - that this data would go through the following steps:

  1. Collected into the Nexus transient zone filesystem (limited to one bundle at a time, thanks to the structure of this background task).
  2. Aggregated into a zipfile by the Nexus collection task.
  3. Transferred to a U.2, which may be running on a separate sled from Nexus entirely. It's only at this point that the bundle is made "active" and is durable.

I do think it's a mistake for this to be using /tmp -- this code is running in a Nexus zone, so I think I need to be using /var/tmp to actually be using that transient zone storage, which was my goal, rather than using a RAM-backed filesystem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes sense I think. The only catch with the root filesystem is that it will disappear if the system reboots and then there might be some database-level thing to clean up (e.g., mark the bundle failed? I don't know).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes sense I think. The only catch with the root filesystem is that it will disappear if the system reboots and then there might be some database-level thing to clean up (e.g., mark the bundle failed? I don't know).

The current implementation of support bundles assigns them to a single Nexus, and it's the responsibility of that individual Nexus to perform collection.

  • If the Nexus hasn't been expunged: When it reboots, it'll see the bundle in state SupportBundleState::Collecting, and will restart collection.

let result = self
.datastore
.support_bundle_list_assigned_to_nexus(
opctx,
&pagparams,
self.nexus_id,
vec![SupportBundleState::Collecting],
)
.await;

  • If either Nexus was expunged, or the support bundle storage was expunged, then the reconfigurator has an execution step which invokes support_bundle_fail_expunged:

/// Marks support bundles as failed if their assigned Nexus or backing
/// dataset has been destroyed.
pub async fn support_bundle_fail_expunged(

This will update database records, and possibly re-assign the Nexus which "owns" the bundle so that storage on the final U.2 can be freed:

// Monitors all bundles that are "destroying" or "failing" and assigned to
// this Nexus, and attempts to clear their storage from Sled Agents.
async fn cleanup_destroyed_bundles(
&self,
opctx: &OpContext,
) -> anyhow::Result<CleanupReport> {
let pagparams = DataPageParams::max_page();
let result = self
.datastore
.support_bundle_list_assigned_to_nexus(
opctx,
&pagparams,
self.nexus_id,
vec![
SupportBundleState::Destroying,
SupportBundleState::Failing,
],
)
.await;

(this shares some logic for the "bundle was deleted manually by an operator" and "bundle was failed because something was expunged" pathways)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by comment - I don't have much context here, but I skimmed over support_bundle_fail_expunged and had a question. It looks like that's checking the current blueprint for expunged zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by comment - I don't have much context here, but I skimmed over support_bundle_fail_expunged and had a question. It looks like that's checking the current blueprint for expunged zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?

Filed #7319 - I think this could result in us "missing" marking a bundle as failed if we quickly transition a zone from "expunged" to "pruned" without execution ever completing successfully in-between.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7325 should fix #7319

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the tempdir side of this, fixed in 53dcbe6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants