Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[repo depot 2/n] sled agent APIs to manage update artifact storage #6764

Merged
merged 33 commits into from
Oct 31, 2024

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Oct 3, 2024

These are the Sled Agent APIs for the TUF Repo Depot (RFD 424). The basic idea is that Nexus will manage a set of TUF repos it's aware of and can use for updates, storing the metadata in CockroachDB and the artifacts in a special dataset present on all sleds.

This implementation diverges slightly from the current determination in the RFD. The "simple replication, take two" option (which is what we're going with) suggests that Sled Agent manages objects at the TUF repo level. I think I disagree with this; Sled Agent should not know or care what a TUF repo is, as the presence of an artifact with some metadata in a TUF repo is knowledge only Nexus needs. This implementation detail doesn't matter a whole lot; from the operator's perspective, Nexus is managing things at the repo level.

There are four new endpoints for managing artifacts, which are all addressed by their sha256 hash:

  • a "PUT artifact" endpoint (which verifies the body with the provided checksum)
  • a "DELETE artifact" endpoint
  • a "copy artifact from another sled" endpoint
  • a "GET artifact" endpoint

(This smells like an object store, although it's not nearly as complex/general-purpose as the one suggested in RFD 424's "mini object store" option; part of the reason to make it content-addressable is to make its purpose bound to storing TUF repo artifacts.)

The reason to have the get and copy endpoints is to let us avoid adding endpoints for serving artifacts from Nexus to Sled Agents, so that Nexus doesn't need to manage any local storage beyond the initial repository verification. Nexus can instead instruct Sled Agents to retrieve artifacts it is missing from other Sled Agents. However, this creates a dependency on the Sled Agent API by Sled Agent, which is something we want to try to avoid in order to reason about upgrade ordering.

After some discussion in #oxide-update we seemed to settle on creating a new Repo Depot API definition which only has the "GET artifact" endpoint. The "copy artifact" endpoint becomes "copy artifact from another Repo Depot". In this implementation, Sled Agent still is the binary that spawns the Repo Depot service, but it could be another global zone daemon (or possibly a zone that can be independently updated, if we can share the storage between the global zone and the repo depot zone). This is still a circular dependency of sorts but it's one we can more easily build tools to reason about, and the API surface is deliberately quite small.

The expected flow using these APIs is:

  • An operator uploads a TUF repo to Nexus, which unpacks and validates the repo in its local storage and puts the repository in a "pending" state
  • Nexus PUTs artifacts to some number of sleds to ensure reasonable replication, then puts the repo in "available" state
  • A Nexus background task checks the list of artifacts on each sled and compares it to the list of artifacts in the database. It tells sleds missing artifacts to get them from sleds that have them, and it tells sleds with extra artifacts to delete them. (Note the "list of artifacts" API is not yet present; I expect this will happen in the next PR.)

The update artifacts dataset is not created by Sled Agent. (We only want one, and the current facilities in sled-storage can only add datasets across all M.2 or U.2 devices.) When #6229 lands we'll be able to use blueprints to ensure there's (ideally only) one update artifacts dataset per sled.

@iliana iliana requested a review from davepacheco October 3, 2024 17:21
@smklein smklein self-requested a review October 3, 2024 17:29
sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
sled-agent/api/src/lib.rs Outdated Show resolved Hide resolved
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Manages update artifacts stored on this sled. The implementation is a very
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Manages update artifacts stored on this sled. The implementation is a very
//! Manages TUF artifacts stored on this sled. The implementation is a very

This is probably not the place to suggest this but in general I think it's been really confusing that "update" can refer to both "the process of deploying new software to the system" and "a software artifact [that you might deploy via the update process]". I think maybe we talked about using a term like "software artifact" or "TUF artifact" for the noun, since it exists independent of any ongoing update. Then we can reserve "update" to be a verb (or a noun referring to the process of deploying artifacts, not the artifacts themselves). What do you think?

As an example, I think it's really surprising as a reader that update_dataset_mountpoints() doesn't update anything.

sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
}
}

pub(crate) trait StorageBackend {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to suggest a comment and a new name here but I struggled a bit with what this is/does exactly.

It's not really a "backend" because it doesn't do the work of storing or fetching the data. It's more of a source of configuration. It could almost just be one function called async fn artifact_storage_paths(&self) -> Result<Vec<&Utf8Path>, Error>. This would decouple it from the DatasetsConfig stuff as well as the fact that these are mountpoints, which I think is nice, at the cost of duplicating the couple of lines of code that iterates over the datasets, filters by type, and gets the mountpoints. From this view you could call this trait ArtifactStoragePaths.

You could also view this as abstracting over "what kind of sled agent this is"; it's really just a thing that gets the current dataset config for whatever kind of sled agent you have. From that perspective, maybe this is GetDatasetsConfig? If you keep it this way, I'd suggest renaming datasets_config_list() to datasets_config.

Here's another idea and feel free to ignore it: instead of using a trait at all, have the caller of ArtifactStore::new() provide either artifact_storage_paths: tokio::sync::watch::Receiver<Vec<Utf8PathBuf>> or datasets_config: tokio::sync::watch::Receiver<DatasetsConfig> and mountpoint root (whichever of the above interpretations you like better). Personally I think this would be easier to follow but it may be a matter of preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using tokio::sync::watch, that would imply code going into the StorageHandle to update the value in the channel whenever the datasets are written to the ledger I think? I don't have a full understanding of the sled-storage crate yet but it seems like it'd be difficult to write that in a way that ensures the Receiver and the dataset configuration written to the ledger remain in sync.

I did originally have one function but found I was duplicating the code for taking the list of datasets, filtering it by update datasets, and turning it into a set of mountpoints, for each implementation of the trait. The two separate functions is a clearer set of operations describing what we need from whatever datasets we have.

A different name that comes to mind is StorageInterface or DatasetsInterface.

sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
sled-agent/src/artifact_store.rs Show resolved Hide resolved
sled-agent/src/artifact_store.rs Show resolved Hide resolved
sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved
sled-agent/src/artifact_store.rs Show resolved Hide resolved
@davepacheco
Copy link
Collaborator

Also, thank you for the detailed PR description. That made this much easier to review!

@davepacheco
Copy link
Collaborator

Oh, I'm also a little alarmed that cargo xtask ls-apis did not bail out when it found a new Progenitor client that's not listed in api-manifest.toml. I'll try to see what's going on there.

@davepacheco
Copy link
Collaborator

Well, that uncovered a bunch of problems. Most relevant here is #6828. I will try to get a fix in, but it'd be great if you could update the API manifest (dev-tools/ls-apis/api-manifest.toml) so that running cargo xtask ls-apis apis doesn't produce a warning about the new repo-depo-client package.

@iliana
Copy link
Contributor Author

iliana commented Oct 11, 2024

While I'm updating this PR I'm going to go ahead and write the "list all the artifacts I've got" API that Nexus will also need, because I might as well do it now.

@iliana
Copy link
Contributor Author

iliana commented Oct 24, 2024

At the office today we discussed this as part of the larger upgrade work and decided to go ahead and store the repo depot on the M.2s, so I'll transform this PR a bit to do that and also set up the datasets with the existing setup code we've got.

@iliana
Copy link
Contributor Author

iliana commented Oct 29, 2024

Two questions I'm working through as I move to storing the depot on the M.2s:

  • There aren't any encrypted datasets on the M.2s. What's our threat model here? Today we're not reading the control plane zones from an authenticated source but it seems like that'd be a nice feature to have, and I don't think we need we need to start any of the zones until we've unlocked the keyshare... But we don't need ZFS encryption to get this as long as we are verifying checksums as we read from this dataset. (Or do we even need to do that?)
  • I'm changing the code to write artifacts to all (nominally 2) artifact datasets; if there's an I/O error it will ignore that dataset but continue writing to the other. Should any service be responsible for periodically trying to ensure it has two copies of every artifact? If so, should that be Nexus or Sled Agent?

@smklein
Copy link
Collaborator

smklein commented Oct 29, 2024

  • There aren't any encrypted datasets on the M.2s. What's our threat model here? Today we're not reading the control plane zones from an authenticated source but it seems like that'd be a nice feature to have, and I don't think we need we need to start any of the zones until we've unlocked the keyshare... But we don't need ZFS encryption to get this as long as we are verifying checksums as we read from this dataset. (Or do we even need to do that?)

This feels a little bit like "trying to solve the end of the chain of measured boot" - we don't need the secrecy of the encrypted partition, but we do want authentication, somehow, even if we aren't baking it in right now.

My two cents: I don't think placement of these zone images prevents us from adding authentication at any point in the future. We could authenticate zones as we read them (if we use an approach more baked into the OS), or add an encrypted partition to the M.2s too.

  • I'm changing the code to write artifacts to all (nominally 2) artifact datasets; if there's an I/O error it will ignore that dataset but continue writing to the other. Should any service be responsible for periodically trying to ensure it has two copies of every artifact? If so, should that be Nexus or Sled Agent?

Is Nexus the entity that'll periodically send requests to sled agents to "please get these TUF artifacts"? If so, it seems like we could re-try placement at that time, which would avoid the need for do self-validation in the Sled Agent.

As long as we can succeed when only one of the two M.2s has the TUF repo, I think we're in a good spot.

If Nexus is re-issuing the "get all TUF artifacts" command, this would also solve the problem of "one sled was offline when the request was out" case too.

(Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?)

@iliana
Copy link
Contributor Author

iliana commented Oct 29, 2024

(Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?)

The way I see it this is something only Nexus needs to know, based on information about the uploaded repos in CockroachDB. So if it's Nexus's job to maintain that each sled has 2 copies of each artifact it's supposed to have, we could change the "list artifacts" API to return the count of each artifact that was found, and Nexus could choose to re-send the request later on.

@smklein
Copy link
Collaborator

smklein commented Oct 29, 2024

(Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?)

The way I see it this is something only Nexus needs to know, based on information about the uploaded repos in CockroachDB. So if it's Nexus's job to maintain that each sled has 2 copies of each artifact it's supposed to have, we could change the "list artifacts" API to return the count of each artifact that was found, and Nexus could choose to re-send the request later on.

This sounds great; it's very similar to how our "ensure zones" / "ensure datasets" APIs work in the sled agent today. We have a background task that periodically ensures they are what they should be, and calling them should be idempotent, but the goal is basically "conform to whatever Nexus thinks you should be".

@iliana iliana force-pushed the iliana/tuf-repo-depot branch from ddf375d to 58b9cbe Compare October 29, 2024 06:27
@iliana
Copy link
Contributor Author

iliana commented Oct 29, 2024

I think this is in a re-reviewable state now.

//! it does not have from another Repo Depot that does have them (at Nexus's
//! direction). This API's implementation is also part of this module.
//!
//! POST, PUT, and DELETE operations are handled by the Sled Agent API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... because these are called by Nexus, and not other sled agents, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I'll update the comment to that effect.

/// errors, Nexus should still be aware of the artifacts we think we have.
pub(crate) async fn list(
&self,
) -> Result<BTreeMap<ArtifactHash, usize>, Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the usize value here -- is this the number of times the hash appears?

If this is a content-addressable object store, shouldn't this always be one?

(in other words, why not use a BTreeSet?))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the number of times that artifact appears across all of the datasets, and we nominally have 2 datasets, so this number should generally be 2. (If it's 1, Nexus should tell this sled to try writing it again so it has two copies.)

Ok(file) => file,
Err(err) => {
if err.kind() == ErrorKind::AlreadyExists {
return Err(Error::AlreadyInProgress { sha256 });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the write gets interrupted, and dropped, presumably we'd unlink the temporary file, so "partial writes" wouldn't be a problem, correct?

On the other hand, if we only wrote to one of the M.2s, would this early exit prevent us from updating the other one? Noticing that this is a returned error, rather than a continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yes, the Drop implementation on the various camino-tempfile structs will unlink that temporary file if any writes are interrupted. (Unless sled agent abruptly stops, but when it's restarted it'll remove all the temporary directories.)

If we take this branch and return an error, the files we've created so far will be dropped and unlinked (as we previously created a Utf8TempPath for them and then used that to create a NamedUtf8TempFile).

&self,
sha256: ArtifactHash,
) -> Result<ArtifactWriter, Error> {
let mut inner = Vec::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/inner/files?

log_and_store!(last_error, &self.log, "sync", mountpoint, err);
continue;
}
if let Err(err) = temp_dir.sync_all().await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually care about the temp_dir being synced? If we reboot, we'll just clear it during startup anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose not!

Comment on lines 611 to 617
if any_success {
info!(
&self.log,
"Wrote artifact";
"sha256" => &self.sha256.to_string(),
);
Ok(())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little odd to me that this returns "Ok()" if any files were written. E.g., if we failed to write to both M.2s, but wrote a single file, this would return "success", which seems misleading IMO.

Do you think we should return an error, if any error is returned, and return "Ok()" at a higher level if either write is successfully propagated to one of the two M.2s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right; I'm going to verbosely type out the two cases where I intend this API to be used based on the design in my head so far.

  1. When an operator uploads a repo, it's unpacked in Nexus-local storage and verified, then the repository is added to the database in a replicating status. That Nexus starts a saga to replicate all the artifacts across some minimum number of sleds (probably 3); once this is complete, the repository is set to available. If that particular Nexus dies and loses its storage, some other Nexus will need to fail the saga and set the repository to failed, and the operator will need to reupload the repository. (I am not sure if a saga is the right primitive here but we can talk about the larger design I have in chat somewhere.)

    During this saga Nexus probably needs sled agent to return an error in this partial write situation so that Nexus can pick another sled to try and write the artifacts to.

  2. As a background task, Nexus will periodically ask all the sleds about all the artifacts they have, and compare that to the list of artifacts that they should have based on the database. It will then submit requests to sled agents to delete superfluous artifacts or copy missing artifacts from other sleds. In this situation, it doesn't matter whether the sled agent returns an error here or not; if we successfully write to only a single M.2, Nexus will eventually see that and submit another request to rectify that during its background task.

Comment on lines 546 to 548
/// Errors in this method are considered non-fatal errors, but this method
/// will return the most recently-seen error by any method in the write
/// process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to clarify end-to-end behavior for the PUT operation from Nexus:

  • If we fail to write to either of the M.2s, Nexus will see an error
  • If we successfully write to one of the M.2s, we'll still see an error, but it'll finish the write to one of them?

From Nexus's perspective, it seems like we can't really distinguish between these cases through the PUT API. Do you think this matters?

My concerns is mostly "do we keep operating successfully, even in a scenario where we have reduced M.2 capacity".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW one possible solution here would be to propagate a result back through the API, that indicates:

  • "We wrote to two M.2s successfully"
  • "We wrote to one M.2 successfully, and the other failed (here's the error)"
  • "Neither write completed successfully, here are the errors (or the most recent error)"

Then this decision is punted up to Nexus, and Nexus could decide "at least one write count as success, but I'll log the error and keep moving".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nexus could call the list API to see if there was a partial success, I suppose; or maybe it's worth instead returning something like {"datasets": 2, "successful_writes": 1} as a non-4xx error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual answer to this question depends on the exact design of how Nexus is going to replicate artifacts across sleds. If Nexus is able to try another sled, maybe this current design is fine. But if it's a saga where the sleds are picked out in advance and there's no retry flow (this seems like a poor design) then it would be better to return OK here; at least there's a copy on this one sled.

Copy link
Collaborator

@smklein smklein Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a hypothetical world where we have all sleds operating with one M.2 - shouldn't this be able to succeed? We are operating at reduced redundancy, but we do have a copy that got successfully written.

(Agreed that we could make all PUT calls query the list API afterwards? But if that's our inclination, this also seems like it should be part of the error result)

basically, I think it's critical for Nexus to be able to distinguish between the cases of:

  • We successfully wrote to at least one M.2, and
  • Every other possible outcome

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll implement the {"datasets": 2, "successful_writes": 1} 200 OK response so that Nexus can make a decision with that information. I don't think it matters to return the error since sled agent is logging all I/O errors it runs into.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR LGTM, modulo our discussion about error results!

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New response looks great!

continue;
}

any_success = true;
successful_writes += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an expectation here that "we're writing a single file" to each M.2, right?

I think that's okay, just clarifying, because, I think our expected output is:

  • datasets = 2, successful_writes = 2

And it would be a little confusing to suddenly see:

  • datasets = 2, successful_writes = 4

or something like that, if we started writing / syncing multiple files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think you could use ArtifactWriter or any of the endpoints in a way where you're writing more than one artifact; everything is keyed by sha256.

Comment on lines 605 to 606
pub datasets: usize,
pub successful_writes: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great as a response; I might add some docs indicating what these fields mean to the caller.

Maybe:

Suggested change
pub datasets: usize,
pub successful_writes: usize,
/// The number of valid M.2 artifact datasets we found on the sled.
/// There is typically one of these datasets for each functional M.2.
pub datasets: usize,
/// The number of valid writes to the M.2 artifact datasets. This should
/// be less than or equal to the number of artifact datasets.
pub successful_writes: usize,

@iliana iliana enabled auto-merge (squash) October 31, 2024 19:40
@iliana iliana merged commit d1ff1bb into main Oct 31, 2024
19 checks passed
@iliana iliana deleted the iliana/tuf-repo-depot branch October 31, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants