[repo depot 2/n] sled agent APIs to manage update artifact storage #6764

iliana · 2024-10-03T17:21:56Z

These are the Sled Agent APIs for the TUF Repo Depot (RFD 424). The basic idea is that Nexus will manage a set of TUF repos it's aware of and can use for updates, storing the metadata in CockroachDB and the artifacts in a special dataset present on all sleds.

This implementation diverges slightly from the current determination in the RFD. The "simple replication, take two" option (which is what we're going with) suggests that Sled Agent manages objects at the TUF repo level. I think I disagree with this; Sled Agent should not know or care what a TUF repo is, as the presence of an artifact with some metadata in a TUF repo is knowledge only Nexus needs. This implementation detail doesn't matter a whole lot; from the operator's perspective, Nexus is managing things at the repo level.

There are four new endpoints for managing artifacts, which are all addressed by their sha256 hash:

a "PUT artifact" endpoint (which verifies the body with the provided checksum)
a "DELETE artifact" endpoint
a "copy artifact from another sled" endpoint
a "GET artifact" endpoint

(This smells like an object store, although it's not nearly as complex/general-purpose as the one suggested in RFD 424's "mini object store" option; part of the reason to make it content-addressable is to make its purpose bound to storing TUF repo artifacts.)

The reason to have the get and copy endpoints is to let us avoid adding endpoints for serving artifacts from Nexus to Sled Agents, so that Nexus doesn't need to manage any local storage beyond the initial repository verification. Nexus can instead instruct Sled Agents to retrieve artifacts it is missing from other Sled Agents. However, this creates a dependency on the Sled Agent API by Sled Agent, which is something we want to try to avoid in order to reason about upgrade ordering.

After some discussion in #oxide-update we seemed to settle on creating a new Repo Depot API definition which only has the "GET artifact" endpoint. The "copy artifact" endpoint becomes "copy artifact from another Repo Depot". In this implementation, Sled Agent still is the binary that spawns the Repo Depot service, but it could be another global zone daemon (or possibly a zone that can be independently updated, if we can share the storage between the global zone and the repo depot zone). This is still a circular dependency of sorts but it's one we can more easily build tools to reason about, and the API surface is deliberately quite small.

The expected flow using these APIs is:

An operator uploads a TUF repo to Nexus, which unpacks and validates the repo in its local storage and puts the repository in a "pending" state
Nexus PUTs artifacts to some number of sleds to ensure reasonable replication, then puts the repo in "available" state
A Nexus background task checks the list of artifacts on each sled and compares it to the list of artifacts in the database. It tells sleds missing artifacts to get them from sleds that have them, and it tells sleds with extra artifacts to delete them. (Note the "list of artifacts" API is not yet present; I expect this will happen in the next PR.)

The update artifacts dataset is not created by Sled Agent. (We only want one, and the current facilities in sled-storage can only add datasets across all M.2 or U.2 devices.) When #6229 lands we'll be able to use blueprints to ensure there's (ideally only) one update artifacts dataset per sled.

sled-agent/src/artifact_store.rs

sled-agent/api/src/lib.rs

davepacheco · 2024-10-10T17:50:52Z

sled-agent/src/artifact_store.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Manages update artifacts stored on this sled. The implementation is a very


Suggested change

//! Manages update artifacts stored on this sled. The implementation is a very

//! Manages TUF artifacts stored on this sled. The implementation is a very

This is probably not the place to suggest this but in general I think it's been really confusing that "update" can refer to both "the process of deploying new software to the system" and "a software artifact [that you might deploy via the update process]". I think maybe we talked about using a term like "software artifact" or "TUF artifact" for the noun, since it exists independent of any ongoing update. Then we can reserve "update" to be a verb (or a noun referring to the process of deploying artifacts, not the artifacts themselves). What do you think?

As an example, I think it's really surprising as a reader that update_dataset_mountpoints() doesn't update anything.

sled-agent/src/artifact_store.rs

davepacheco · 2024-10-10T18:05:25Z

sled-agent/src/artifact_store.rs

+    }
+}
+
+pub(crate) trait StorageBackend {


I wanted to suggest a comment and a new name here but I struggled a bit with what this is/does exactly.

It's not really a "backend" because it doesn't do the work of storing or fetching the data. It's more of a source of configuration. It could almost just be one function called async fn artifact_storage_paths(&self) -> Result<Vec<&Utf8Path>, Error>. This would decouple it from the DatasetsConfig stuff as well as the fact that these are mountpoints, which I think is nice, at the cost of duplicating the couple of lines of code that iterates over the datasets, filters by type, and gets the mountpoints. From this view you could call this trait ArtifactStoragePaths.

You could also view this as abstracting over "what kind of sled agent this is"; it's really just a thing that gets the current dataset config for whatever kind of sled agent you have. From that perspective, maybe this is GetDatasetsConfig? If you keep it this way, I'd suggest renaming datasets_config_list() to datasets_config.

Here's another idea and feel free to ignore it: instead of using a trait at all, have the caller of ArtifactStore::new() provide either artifact_storage_paths: tokio::sync::watch::Receiver<Vec<Utf8PathBuf>> or datasets_config: tokio::sync::watch::Receiver<DatasetsConfig> and mountpoint root (whichever of the above interpretations you like better). Personally I think this would be easier to follow but it may be a matter of preference.

Using tokio::sync::watch, that would imply code going into the StorageHandle to update the value in the channel whenever the datasets are written to the ledger I think? I don't have a full understanding of the sled-storage crate yet but it seems like it'd be difficult to write that in a way that ensures the Receiver and the dataset configuration written to the ledger remain in sync.

I did originally have one function but found I was duplicating the code for taking the list of datasets, filtering it by update datasets, and turning it into a set of mountpoints, for each implementation of the trait. The two separate functions is a clearer set of operations describing what we need from whatever datasets we have.

A different name that comes to mind is StorageInterface or DatasetsInterface.

sled-agent/src/artifact_store.rs

davepacheco · 2024-10-10T19:54:59Z

Also, thank you for the detailed PR description. That made this much easier to review!

davepacheco · 2024-10-10T22:19:32Z

Oh, I'm also a little alarmed that cargo xtask ls-apis did not bail out when it found a new Progenitor client that's not listed in api-manifest.toml. I'll try to see what's going on there.

davepacheco · 2024-10-10T22:52:05Z

Well, that uncovered a bunch of problems. Most relevant here is #6828. I will try to get a fix in, but it'd be great if you could update the API manifest (dev-tools/ls-apis/api-manifest.toml) so that running cargo xtask ls-apis apis doesn't produce a warning about the new repo-depo-client package.

iliana · 2024-10-11T16:58:13Z

While I'm updating this PR I'm going to go ahead and write the "list all the artifacts I've got" API that Nexus will also need, because I might as well do it now.

Co-authored-by: David Pacheco <[email protected]>

iliana · 2024-10-24T19:12:02Z

At the office today we discussed this as part of the larger upgrade work and decided to go ahead and store the repo depot on the M.2s, so I'll transform this PR a bit to do that and also set up the datasets with the existing setup code we've got.

iliana · 2024-10-29T00:07:26Z

Two questions I'm working through as I move to storing the depot on the M.2s:

There aren't any encrypted datasets on the M.2s. What's our threat model here? Today we're not reading the control plane zones from an authenticated source but it seems like that'd be a nice feature to have, and I don't think we need we need to start any of the zones until we've unlocked the keyshare... But we don't need ZFS encryption to get this as long as we are verifying checksums as we read from this dataset. (Or do we even need to do that?)
I'm changing the code to write artifacts to all (nominally 2) artifact datasets; if there's an I/O error it will ignore that dataset but continue writing to the other. Should any service be responsible for periodically trying to ensure it has two copies of every artifact? If so, should that be Nexus or Sled Agent?

smklein · 2024-10-29T00:20:27Z

There aren't any encrypted datasets on the M.2s. What's our threat model here? Today we're not reading the control plane zones from an authenticated source but it seems like that'd be a nice feature to have, and I don't think we need we need to start any of the zones until we've unlocked the keyshare... But we don't need ZFS encryption to get this as long as we are verifying checksums as we read from this dataset. (Or do we even need to do that?)

This feels a little bit like "trying to solve the end of the chain of measured boot" - we don't need the secrecy of the encrypted partition, but we do want authentication, somehow, even if we aren't baking it in right now.

My two cents: I don't think placement of these zone images prevents us from adding authentication at any point in the future. We could authenticate zones as we read them (if we use an approach more baked into the OS), or add an encrypted partition to the M.2s too.

I'm changing the code to write artifacts to all (nominally 2) artifact datasets; if there's an I/O error it will ignore that dataset but continue writing to the other. Should any service be responsible for periodically trying to ensure it has two copies of every artifact? If so, should that be Nexus or Sled Agent?

Is Nexus the entity that'll periodically send requests to sled agents to "please get these TUF artifacts"? If so, it seems like we could re-try placement at that time, which would avoid the need for do self-validation in the Sled Agent.

As long as we can succeed when only one of the two M.2s has the TUF repo, I think we're in a good spot.

If Nexus is re-issuing the "get all TUF artifacts" command, this would also solve the problem of "one sled was offline when the request was out" case too.

(Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?)

iliana · 2024-10-29T00:37:04Z

(Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?)

The way I see it this is something only Nexus needs to know, based on information about the uploaded repos in CockroachDB. So if it's Nexus's job to maintain that each sled has 2 copies of each artifact it's supposed to have, we could change the "list artifacts" API to return the count of each artifact that was found, and Nexus could choose to re-send the request later on.

smklein · 2024-10-29T00:40:29Z

(Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?)

The way I see it this is something only Nexus needs to know, based on information about the uploaded repos in CockroachDB. So if it's Nexus's job to maintain that each sled has 2 copies of each artifact it's supposed to have, we could change the "list artifacts" API to return the count of each artifact that was found, and Nexus could choose to re-send the request later on.

This sounds great; it's very similar to how our "ensure zones" / "ensure datasets" APIs work in the sled agent today. We have a background task that periodically ensures they are what they should be, and calling them should be idempotent, but the goal is basically "conform to whatever Nexus thinks you should be".

iliana · 2024-10-29T16:52:57Z

I think this is in a re-reviewable state now.

smklein · 2024-10-29T18:36:54Z

sled-agent/src/artifact_store.rs

+//! it does not have from another Repo Depot that does have them (at Nexus's
+//! direction). This API's implementation is also part of this module.
+//!
+//! POST, PUT, and DELETE operations are handled by the Sled Agent API.


.... because these are called by Nexus, and not other sled agents, right?

Correct, I'll update the comment to that effect.

smklein · 2024-10-29T18:51:35Z

sled-agent/src/artifact_store.rs

+    /// errors, Nexus should still be aware of the artifacts we think we have.
+    pub(crate) async fn list(
+        &self,
+    ) -> Result<BTreeMap<ArtifactHash, usize>, Error> {


I'm a little confused by the usize value here -- is this the number of times the hash appears?

If this is a content-addressable object store, shouldn't this always be one?

(in other words, why not use a BTreeSet?))

It is the number of times that artifact appears across all of the datasets, and we nominally have 2 datasets, so this number should generally be 2. (If it's 1, Nexus should tell this sled to try writing it again so it has two copies.)

smklein · 2024-10-29T19:02:13Z

sled-agent/src/artifact_store.rs

+                Ok(file) => file,
+                Err(err) => {
+                    if err.kind() == ErrorKind::AlreadyExists {
+                        return Err(Error::AlreadyInProgress { sha256 });


If the write gets interrupted, and dropped, presumably we'd unlink the temporary file, so "partial writes" wouldn't be a problem, correct?

On the other hand, if we only wrote to one of the M.2s, would this early exit prevent us from updating the other one? Noticing that this is a returned error, rather than a continue.

In theory, yes, the Drop implementation on the various camino-tempfile structs will unlink that temporary file if any writes are interrupted. (Unless sled agent abruptly stops, but when it's restarted it'll remove all the temporary directories.)

If we take this branch and return an error, the files we've created so far will be dropped and unlinked (as we previously created a Utf8TempPath for them and then used that to create a NamedUtf8TempFile).

smklein · 2024-10-29T19:04:51Z

sled-agent/src/artifact_store.rs

+        &self,
+        sha256: ArtifactHash,
+    ) -> Result<ArtifactWriter, Error> {
+        let mut inner = Vec::new();


nit: s/inner/files?

smklein · 2024-10-29T19:10:11Z

sled-agent/src/artifact_store.rs

+                log_and_store!(last_error, &self.log, "sync", mountpoint, err);
+                continue;
+            }
+            if let Err(err) = temp_dir.sync_all().await {


Do we actually care about the temp_dir being synced? If we reboot, we'll just clear it during startup anyway

I suppose not!

smklein · 2024-10-29T19:15:49Z

sled-agent/src/artifact_store.rs

+        if any_success {
+            info!(
+                &self.log,
+                "Wrote artifact";
+                "sha256" => &self.sha256.to_string(),
+            );
+            Ok(())


Seems a little odd to me that this returns "Ok()" if any files were written. E.g., if we failed to write to both M.2s, but wrote a single file, this would return "success", which seems misleading IMO.

Do you think we should return an error, if any error is returned, and return "Ok()" at a higher level if either write is successfully propagated to one of the two M.2s?

I think you're right; I'm going to verbosely type out the two cases where I intend this API to be used based on the design in my head so far.

When an operator uploads a repo, it's unpacked in Nexus-local storage and verified, then the repository is added to the database in a replicating status. That Nexus starts a saga to replicate all the artifacts across some minimum number of sleds (probably 3); once this is complete, the repository is set to available. If that particular Nexus dies and loses its storage, some other Nexus will need to fail the saga and set the repository to failed, and the operator will need to reupload the repository. (I am not sure if a saga is the right primitive here but we can talk about the larger design I have in chat somewhere.)

During this saga Nexus probably needs sled agent to return an error in this partial write situation so that Nexus can pick another sled to try and write the artifacts to.

As a background task, Nexus will periodically ask all the sleds about all the artifacts they have, and compare that to the list of artifacts that they should have based on the database. It will then submit requests to sled agents to delete superfluous artifacts or copy missing artifacts from other sleds. In this situation, it doesn't matter whether the sled agent returns an error here or not; if we successfully write to only a single M.2, Nexus will eventually see that and submit another request to rectify that during its background task.

smklein · 2024-10-30T18:22:09Z

sled-agent/src/artifact_store.rs

+    /// Errors in this method are considered non-fatal errors, but this method
+    /// will return the most recently-seen error by any method in the write
+    /// process.


So, to clarify end-to-end behavior for the PUT operation from Nexus:

If we fail to write to either of the M.2s, Nexus will see an error

If we successfully write to one of the M.2s, we'll still see an error, but it'll finish the write to one of them?

From Nexus's perspective, it seems like we can't really distinguish between these cases through the PUT API. Do you think this matters?

My concerns is mostly "do we keep operating successfully, even in a scenario where we have reduced M.2 capacity".

FWIW one possible solution here would be to propagate a result back through the API, that indicates:

"We wrote to two M.2s successfully"

"We wrote to one M.2 successfully, and the other failed (here's the error)"

"Neither write completed successfully, here are the errors (or the most recent error)"

Then this decision is punted up to Nexus, and Nexus could decide "at least one write count as success, but I'll log the error and keep moving".

Nexus could call the list API to see if there was a partial success, I suppose; or maybe it's worth instead returning something like {"datasets": 2, "successful_writes": 1} as a non-4xx error?

I think the actual answer to this question depends on the exact design of how Nexus is going to replicate artifacts across sleds. If Nexus is able to try another sled, maybe this current design is fine. But if it's a saga where the sleds are picked out in advance and there's no retry flow (this seems like a poor design) then it would be better to return OK here; at least there's a copy on this one sled.

In a hypothetical world where we have all sleds operating with one M.2 - shouldn't this be able to succeed? We are operating at reduced redundancy, but we do have a copy that got successfully written.

(Agreed that we could make all PUT calls query the list API afterwards? But if that's our inclination, this also seems like it should be part of the error result)

basically, I think it's critical for Nexus to be able to distinguish between the cases of:

We successfully wrote to at least one M.2, and

Every other possible outcome

I'll implement the {"datasets": 2, "successful_writes": 1} 200 OK response so that Nexus can make a decision with that information. I don't think it matters to return the error since sled agent is logging all I/O errors it runs into.

smklein

PR LGTM, modulo our discussion about error results!

smklein

New response looks great!

smklein · 2024-10-31T18:07:52Z

sled-agent/src/artifact_store.rs

                continue;
            }

-            any_success = true;
+            successful_writes += 1;


There's an expectation here that "we're writing a single file" to each M.2, right?

I think that's okay, just clarifying, because, I think our expected output is:

datasets = 2, successful_writes = 2

And it would be a little confusing to suddenly see:

datasets = 2, successful_writes = 4

or something like that, if we started writing / syncing multiple files

Yeah, I don't think you could use ArtifactWriter or any of the endpoints in a way where you're writing more than one artifact; everything is keyed by sha256.

smklein · 2024-10-31T18:10:01Z

sled-agent/api/src/lib.rs

+    pub datasets: usize,
+    pub successful_writes: usize,


This looks great as a response; I might add some docs indicating what these fields mean to the caller.

Maybe:

Suggested change

pub datasets: usize,

pub successful_writes: usize,

/// The number of valid M.2 artifact datasets we found on the sled.

/// There is typically one of these datasets for each functional M.2.

pub datasets: usize,

/// The number of valid writes to the M.2 artifact datasets. This should

/// be less than or equal to the number of artifact datasets.

pub successful_writes: usize,

sled agent APIs to manage update artifact storage

a5052d0

iliana requested a review from davepacheco October 3, 2024 17:21

smklein self-requested a review October 3, 2024 17:29

smklein reviewed Oct 3, 2024

View reviewed changes

sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved

sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved

sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved

smklein reviewed Oct 3, 2024

View reviewed changes

sled-agent/src/artifact_store.rs Outdated Show resolved Hide resolved

iliana added 5 commits October 4, 2024 01:21

fn datasets -> fn dataset_mountpoints

ce1bc42

be more resilient in the face of io errors

5ad16a1

clean up temporary files on startup

485ee40

naming consistency

26f4107

log.cleanup_successful();

893980e

iliana mentioned this pull request Oct 8, 2024

test failed in CI: deploy: Failed to reach switch zone after 30 seconds #6802

Open

davepacheco reviewed Oct 10, 2024

View reviewed changes

iliana mentioned this pull request Oct 11, 2024

add x-rust-type to ArtifactHash #6838

Open

iliana and others added 13 commits October 15, 2024 16:58

Merge remote-tracking branch 'origin/main' into iliana/tuf-repo-depot

03a51c7

document ArtifactStore

efcfb92

Co-authored-by: David Pacheco <[email protected]>

fn put -> put_impl

e8b2673

copy_from_depot should take a URL

3b15f3b

reduce semantic satiation

c909649

remove default type parameter

4325077

StorageBackend -> DatasetsManager; attempt clean up

86b8047

create reqwest client at startup, not on first use

599089a

don't embed source error strings

f624e5d

fewer contextless errors

b44d06b

another docstring

013e67f

add the repo depot API to api-manifest.toml

5eefb6e

add list artifacts operation

aebfe32

Merge remote-tracking branch 'origin/main' into iliana/tuf-repo-depot

d743727

iliana added 5 commits October 29, 2024 05:14

create an update artifact dataset on both M.2s

2ae3743

PUT artifacts to all artifact datasets

45c4cac

change list API to return a count of each artifact

3c2f866

make copy_from_depot create a task and return

743a67b

ls-apis expectorate

58b9cbe

iliana force-pushed the iliana/tuf-repo-depot branch from ddf375d to 58b9cbe Compare October 29, 2024 06:27

smklein reviewed Oct 29, 2024

View reviewed changes

iliana added 4 commits October 30, 2024 17:07

Merge remote-tracking branch 'origin/main' into iliana/tuf-repo-depot

6906679

review comments

cd7dc7b

propagate non-fatal write errors to finalize()

e967f02

expectoraaaaate

508861d

smklein reviewed Oct 30, 2024

View reviewed changes

smklein approved these changes Oct 30, 2024

View reviewed changes

iliana added 2 commits October 31, 2024 16:45

Merge remote-tracking branch 'origin/main' into iliana/tuf-repo-depot

9f96f71

improved API responses for PUT/POST

32afac5

smklein approved these changes Oct 31, 2024

View reviewed changes

iliana added 2 commits October 31, 2024 19:01

document ArtifactPutResponse fields

4d00c16

Merge remote-tracking branch 'origin/main' into iliana/tuf-repo-depot

9b3691a

iliana enabled auto-merge (squash) October 31, 2024 19:40

iliana merged commit d1ff1bb into main Oct 31, 2024
19 checks passed

iliana deleted the iliana/tuf-repo-depot branch October 31, 2024 20:29

iliana mentioned this pull request Nov 21, 2024

[repo depot 3/n] nexus background task to replicate TUF artifacts across sleds #7129

Open

	//! Manages update artifacts stored on this sled. The implementation is a very
	//! Manages TUF artifacts stored on this sled. The implementation is a very

[repo depot 2/n] sled agent APIs to manage update artifact storage #6764

[repo depot 2/n] sled agent APIs to manage update artifact storage #6764

Conversation

iliana commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Oct 10, 2024

davepacheco commented Oct 10, 2024

davepacheco commented Oct 10, 2024

iliana commented Oct 11, 2024 • edited Loading

iliana commented Oct 24, 2024

iliana commented Oct 29, 2024 • edited Loading

smklein commented Oct 29, 2024

iliana commented Oct 29, 2024

smklein commented Oct 29, 2024

iliana commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iliana commented Oct 11, 2024 •

edited

Loading

iliana commented Oct 29, 2024 •

edited

Loading

smklein Oct 30, 2024 •

edited

Loading