-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Store service configuration data in duplicate in M.2s #2972
Conversation
…zones, zone_name -> zone_type, config -> ledger
## Before this PR Running on rack2 and calling `omicron-package uninstall` would involve a fatal termination of the connection, as it would delete the `cxgbe0/ll` and `cxgbe1/ll` IP addresses necessary for contacting the sled. ## After this PR Those addresses are left alone. This is pretty useful for development, as it allows us to run `uninstall` to cleanly wipe a Gimlet, preparing it for future "clean installs".
FWIW, I've tested this on I am seeing them in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smklein Nice work! I like the approach taken very much.
Unfortunately, I have one somewhat major concern. I'm not sure how practical of a concern it is as it depends on hardware behavior. There is the possibility of data loss for a ledger in the following scenario:
- Generation 1 is written to both M.2s(
A
andB
) - Generation 2 is written only to
A
-B
fails to write - Sled-agent reboots
Ledger::new
reads fromB
, but fails to read fromA
- The ledger now points, incorrectly, to Generation 1.
At this point we have data loss, but things can get even more confusing, depending upon the failure modes of the M.2s. Let's say we continue with the following steps:
- Generation 2 (with different data) is written to A.
- Sled-agent reboots
Ledger::new
reads both versions at generation 2 and picksA
given its current logic as long asA
is first in the path list:
// Return the ledger with the highest generation number.
let ledger = ledgers.into_iter().reduce(|prior, ledger| {
if ledger.is_newer_than(&prior) {
ledger
} else {
prior
}
});
This problem is baked into the fact that you can't do consensus with only 2 nodes if those nodes can fail in arbitrary ways. Whether the M.2s can fail in arbitrary ways is unknown to me, but I'd really like to preclude the possibility of doing the wrong thing without relying on hardware behavior if at all possible. I can see two possible ways of going about resolving this issue:
- If we ever fail to read or write from an M.2 we refuse to ever bring that M.2 back online. In short we tolerate the failure by making it permanent.
- We do not bump ledger generation numbers in sled-agent, but instead bump them in Nexus, along with saving the latest ledger configuration in Nexus. We then would know if we read back stale data, which we could go ahead and rewrite based on what was put in CockroachDB via Nexus.
It's unclear to me if either of these is actually feasible, as presumably the zones must come up in order to be able to talk to Nexus in the first place. However, maybe updates can go to nexus as in option 2.
CC @rmustacc
So, first of all, I totally agree with you about this mismatch being possible. In a longer-term plan, I would like for Nexus to be able to send the request for services / datasets to the sled as: "Here are all the services you should run, with a generation number" (I've updated #732 to include this implementation detail) Such an API means that this ledger becomes a cache of data that's stored in CRDB, and which can get updated when Nexus comes online.
I 100% think this is feasible, and is something we should do. It's easier for "non-dataset services" than "dataset services" due to the existing shape of the internal API, but I think both can be vectorized + generation'd. I've created #2977 and #2978 -- sub-issues of #732 -- for us to track. |
Awesome! Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the understanding around Nexus issuing updates for services when online and the Ledger acting as a cache I think we should go ahead and merge this in once the test bug is fixed.
Ledger
structure which makes it easy to write toml-serializable data to and from M.2s.Ledger
structure to store all service configuration information in duplicate on the M.2s.Fixes #2969