-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[repo depot 2/n] sled agent APIs to manage update artifact storage #6764
Conversation
sled-agent/src/artifact_store.rs
Outdated
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
//! Manages update artifacts stored on this sled. The implementation is a very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! Manages update artifacts stored on this sled. The implementation is a very | |
//! Manages TUF artifacts stored on this sled. The implementation is a very |
This is probably not the place to suggest this but in general I think it's been really confusing that "update" can refer to both "the process of deploying new software to the system" and "a software artifact [that you might deploy via the update process]". I think maybe we talked about using a term like "software artifact" or "TUF artifact" for the noun, since it exists independent of any ongoing update. Then we can reserve "update" to be a verb (or a noun referring to the process of deploying artifacts, not the artifacts themselves). What do you think?
As an example, I think it's really surprising as a reader that update_dataset_mountpoints()
doesn't update anything.
sled-agent/src/artifact_store.rs
Outdated
} | ||
} | ||
|
||
pub(crate) trait StorageBackend { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to suggest a comment and a new name here but I struggled a bit with what this is/does exactly.
It's not really a "backend" because it doesn't do the work of storing or fetching the data. It's more of a source of configuration. It could almost just be one function called async fn artifact_storage_paths(&self) -> Result<Vec<&Utf8Path>, Error>
. This would decouple it from the DatasetsConfig
stuff as well as the fact that these are mountpoints, which I think is nice, at the cost of duplicating the couple of lines of code that iterates over the datasets, filters by type, and gets the mountpoints. From this view you could call this trait ArtifactStoragePaths
.
You could also view this as abstracting over "what kind of sled agent this is"; it's really just a thing that gets the current dataset config for whatever kind of sled agent you have. From that perspective, maybe this is GetDatasetsConfig
? If you keep it this way, I'd suggest renaming datasets_config_list()
to datasets_config
.
Here's another idea and feel free to ignore it: instead of using a trait at all, have the caller of ArtifactStore::new()
provide either artifact_storage_paths: tokio::sync::watch::Receiver<Vec<Utf8PathBuf>>
or datasets_config: tokio::sync::watch::Receiver<DatasetsConfig>
and mountpoint root (whichever of the above interpretations you like better). Personally I think this would be easier to follow but it may be a matter of preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using tokio::sync::watch
, that would imply code going into the StorageHandle
to update the value in the channel whenever the datasets are written to the ledger I think? I don't have a full understanding of the sled-storage crate yet but it seems like it'd be difficult to write that in a way that ensures the Receiver and the dataset configuration written to the ledger remain in sync.
I did originally have one function but found I was duplicating the code for taking the list of datasets, filtering it by update
datasets, and turning it into a set of mountpoints, for each implementation of the trait. The two separate functions is a clearer set of operations describing what we need from whatever datasets we have.
A different name that comes to mind is StorageInterface
or DatasetsInterface
.
Also, thank you for the detailed PR description. That made this much easier to review! |
Oh, I'm also a little alarmed that |
Well, that uncovered a bunch of problems. Most relevant here is #6828. I will try to get a fix in, but it'd be great if you could update the API manifest (dev-tools/ls-apis/api-manifest.toml) so that running |
While I'm updating this PR I'm going to go ahead and write the "list all the artifacts I've got" API that Nexus will also need, because I might as well do it now. |
Co-authored-by: David Pacheco <[email protected]>
At the office today we discussed this as part of the larger upgrade work and decided to go ahead and store the repo depot on the M.2s, so I'll transform this PR a bit to do that and also set up the datasets with the existing setup code we've got. |
Two questions I'm working through as I move to storing the depot on the M.2s:
|
This feels a little bit like "trying to solve the end of the chain of measured boot" - we don't need the secrecy of the encrypted partition, but we do want authentication, somehow, even if we aren't baking it in right now. My two cents: I don't think placement of these zone images prevents us from adding authentication at any point in the future. We could authenticate zones as we read them (if we use an approach more baked into the OS), or add an encrypted partition to the M.2s too.
Is Nexus the entity that'll periodically send requests to sled agents to "please get these TUF artifacts"? If so, it seems like we could re-try placement at that time, which would avoid the need for do self-validation in the Sled Agent. As long as we can succeed when only one of the two M.2s has the TUF repo, I think we're in a good spot. If Nexus is re-issuing the "get all TUF artifacts" command, this would also solve the problem of "one sled was offline when the request was out" case too. (Presumably, the set of "TUF artifacts I should have" would also be stored to storage somewhere?) |
The way I see it this is something only Nexus needs to know, based on information about the uploaded repos in CockroachDB. So if it's Nexus's job to maintain that each sled has 2 copies of each artifact it's supposed to have, we could change the "list artifacts" API to return the count of each artifact that was found, and Nexus could choose to re-send the request later on. |
This sounds great; it's very similar to how our "ensure zones" / "ensure datasets" APIs work in the sled agent today. We have a background task that periodically ensures they are what they should be, and calling them should be idempotent, but the goal is basically "conform to whatever Nexus thinks you should be". |
ddf375d
to
58b9cbe
Compare
I think this is in a re-reviewable state now. |
sled-agent/src/artifact_store.rs
Outdated
//! it does not have from another Repo Depot that does have them (at Nexus's | ||
//! direction). This API's implementation is also part of this module. | ||
//! | ||
//! POST, PUT, and DELETE operations are handled by the Sled Agent API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.... because these are called by Nexus, and not other sled agents, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I'll update the comment to that effect.
/// errors, Nexus should still be aware of the artifacts we think we have. | ||
pub(crate) async fn list( | ||
&self, | ||
) -> Result<BTreeMap<ArtifactHash, usize>, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the usize
value here -- is this the number of times the hash appears?
If this is a content-addressable object store, shouldn't this always be one?
(in other words, why not use a BTreeSet
?))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the number of times that artifact appears across all of the datasets, and we nominally have 2 datasets, so this number should generally be 2. (If it's 1, Nexus should tell this sled to try writing it again so it has two copies.)
Ok(file) => file, | ||
Err(err) => { | ||
if err.kind() == ErrorKind::AlreadyExists { | ||
return Err(Error::AlreadyInProgress { sha256 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the write gets interrupted, and dropped, presumably we'd unlink the temporary file, so "partial writes" wouldn't be a problem, correct?
On the other hand, if we only wrote to one of the M.2s, would this early exit prevent us from updating the other one? Noticing that this is a returned error, rather than a continue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, yes, the Drop
implementation on the various camino-tempfile structs will unlink that temporary file if any writes are interrupted. (Unless sled agent abruptly stops, but when it's restarted it'll remove all the temporary directories.)
If we take this branch and return an error, the files we've created so far will be dropped and unlinked (as we previously created a Utf8TempPath
for them and then used that to create a NamedUtf8TempFile
).
sled-agent/src/artifact_store.rs
Outdated
&self, | ||
sha256: ArtifactHash, | ||
) -> Result<ArtifactWriter, Error> { | ||
let mut inner = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/inner/files?
sled-agent/src/artifact_store.rs
Outdated
log_and_store!(last_error, &self.log, "sync", mountpoint, err); | ||
continue; | ||
} | ||
if let Err(err) = temp_dir.sync_all().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually care about the temp_dir
being synced? If we reboot, we'll just clear it during startup anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose not!
sled-agent/src/artifact_store.rs
Outdated
if any_success { | ||
info!( | ||
&self.log, | ||
"Wrote artifact"; | ||
"sha256" => &self.sha256.to_string(), | ||
); | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a little odd to me that this returns "Ok()" if any files were written. E.g., if we failed to write to both M.2s, but wrote a single file, this would return "success", which seems misleading IMO.
Do you think we should return an error, if any error is returned, and return "Ok()" at a higher level if either write is successfully propagated to one of the two M.2s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right; I'm going to verbosely type out the two cases where I intend this API to be used based on the design in my head so far.
-
When an operator uploads a repo, it's unpacked in Nexus-local storage and verified, then the repository is added to the database in a
replicating
status. That Nexus starts a saga to replicate all the artifacts across some minimum number of sleds (probably 3); once this is complete, the repository is set toavailable
. If that particular Nexus dies and loses its storage, some other Nexus will need to fail the saga and set the repository tofailed
, and the operator will need to reupload the repository. (I am not sure if a saga is the right primitive here but we can talk about the larger design I have in chat somewhere.)During this saga Nexus probably needs sled agent to return an error in this partial write situation so that Nexus can pick another sled to try and write the artifacts to.
-
As a background task, Nexus will periodically ask all the sleds about all the artifacts they have, and compare that to the list of artifacts that they should have based on the database. It will then submit requests to sled agents to delete superfluous artifacts or copy missing artifacts from other sleds. In this situation, it doesn't matter whether the sled agent returns an error here or not; if we successfully write to only a single M.2, Nexus will eventually see that and submit another request to rectify that during its background task.
sled-agent/src/artifact_store.rs
Outdated
/// Errors in this method are considered non-fatal errors, but this method | ||
/// will return the most recently-seen error by any method in the write | ||
/// process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to clarify end-to-end behavior for the PUT
operation from Nexus:
- If we fail to write to either of the M.2s, Nexus will see an error
- If we successfully write to one of the M.2s, we'll still see an error, but it'll finish the write to one of them?
From Nexus's perspective, it seems like we can't really distinguish between these cases through the PUT API. Do you think this matters?
My concerns is mostly "do we keep operating successfully, even in a scenario where we have reduced M.2 capacity".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW one possible solution here would be to propagate a result back through the API, that indicates:
- "We wrote to two M.2s successfully"
- "We wrote to one M.2 successfully, and the other failed (here's the error)"
- "Neither write completed successfully, here are the errors (or the most recent error)"
Then this decision is punted up to Nexus, and Nexus could decide "at least one write count as success, but I'll log the error and keep moving".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nexus could call the list
API to see if there was a partial success, I suppose; or maybe it's worth instead returning something like {"datasets": 2, "successful_writes": 1}
as a non-4xx error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual answer to this question depends on the exact design of how Nexus is going to replicate artifacts across sleds. If Nexus is able to try another sled, maybe this current design is fine. But if it's a saga where the sleds are picked out in advance and there's no retry flow (this seems like a poor design) then it would be better to return OK here; at least there's a copy on this one sled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a hypothetical world where we have all sleds operating with one M.2 - shouldn't this be able to succeed? We are operating at reduced redundancy, but we do have a copy that got successfully written.
(Agreed that we could make all PUT
calls query the list API afterwards? But if that's our inclination, this also seems like it should be part of the error result)
basically, I think it's critical for Nexus to be able to distinguish between the cases of:
- We successfully wrote to at least one M.2, and
- Every other possible outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll implement the {"datasets": 2, "successful_writes": 1}
200 OK response so that Nexus can make a decision with that information. I don't think it matters to return the error since sled agent is logging all I/O errors it runs into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR LGTM, modulo our discussion about error results!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New response looks great!
continue; | ||
} | ||
|
||
any_success = true; | ||
successful_writes += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an expectation here that "we're writing a single file" to each M.2, right?
I think that's okay, just clarifying, because, I think our expected output is:
- datasets = 2, successful_writes = 2
And it would be a little confusing to suddenly see:
- datasets = 2, successful_writes = 4
or something like that, if we started writing / syncing multiple files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think you could use ArtifactWriter
or any of the endpoints in a way where you're writing more than one artifact; everything is keyed by sha256
.
sled-agent/api/src/lib.rs
Outdated
pub datasets: usize, | ||
pub successful_writes: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great as a response; I might add some docs indicating what these fields mean to the caller.
Maybe:
pub datasets: usize, | |
pub successful_writes: usize, | |
/// The number of valid M.2 artifact datasets we found on the sled. | |
/// There is typically one of these datasets for each functional M.2. | |
pub datasets: usize, | |
/// The number of valid writes to the M.2 artifact datasets. This should | |
/// be less than or equal to the number of artifact datasets. | |
pub successful_writes: usize, |
These are the Sled Agent APIs for the TUF Repo Depot (RFD 424). The basic idea is that Nexus will manage a set of TUF repos it's aware of and can use for updates, storing the metadata in CockroachDB and the artifacts in a special dataset present on all sleds.
This implementation diverges slightly from the current determination in the RFD. The "simple replication, take two" option (which is what we're going with) suggests that Sled Agent manages objects at the TUF repo level. I think I disagree with this; Sled Agent should not know or care what a TUF repo is, as the presence of an artifact with some metadata in a TUF repo is knowledge only Nexus needs. This implementation detail doesn't matter a whole lot; from the operator's perspective, Nexus is managing things at the repo level.
There are four new endpoints for managing artifacts, which are all addressed by their sha256 hash:
(This smells like an object store, although it's not nearly as complex/general-purpose as the one suggested in RFD 424's "mini object store" option; part of the reason to make it content-addressable is to make its purpose bound to storing TUF repo artifacts.)
The reason to have the get and copy endpoints is to let us avoid adding endpoints for serving artifacts from Nexus to Sled Agents, so that Nexus doesn't need to manage any local storage beyond the initial repository verification. Nexus can instead instruct Sled Agents to retrieve artifacts it is missing from other Sled Agents. However, this creates a dependency on the Sled Agent API by Sled Agent, which is something we want to try to avoid in order to reason about upgrade ordering.
After some discussion in #oxide-update we seemed to settle on creating a new Repo Depot API definition which only has the "GET artifact" endpoint. The "copy artifact" endpoint becomes "copy artifact from another Repo Depot". In this implementation, Sled Agent still is the binary that spawns the Repo Depot service, but it could be another global zone daemon (or possibly a zone that can be independently updated, if we can share the storage between the global zone and the repo depot zone). This is still a circular dependency of sorts but it's one we can more easily build tools to reason about, and the API surface is deliberately quite small.
The expected flow using these APIs is:
The update artifacts dataset is not created by Sled Agent. (We only want one, and the current facilities in sled-storage can only add datasets across all M.2 or U.2 devices.) When #6229 lands we'll be able to use blueprints to ensure there's (ideally only) one update artifacts dataset per sled.