Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Store service configuration data in duplicate in M.2s #2972

Merged
merged 31 commits into from
May 2, 2023

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented May 1, 2023

  • Creates a Ledger structure which makes it easy to write toml-serializable data to and from M.2s.
  • Uses this Ledger structure to store all service configuration information in duplicate on the M.2s.

Fixes #2969

@smklein smklein requested a review from andrewjstone May 1, 2023 17:00
@smklein
Copy link
Collaborator Author

smklein commented May 1, 2023

FWIW, I've tested this on BRM42220026 in rack2, and the configs aren't in /var/oxide anymore.

I am seeing them in /pool/int/0d8f680f-0907-4170-822b-5c49d43a7660/config/ and /pool/int/2acb2cf2-ed09-4009-b4ce-3b651552e166/config/ now.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein Nice work! I like the approach taken very much.

Unfortunately, I have one somewhat major concern. I'm not sure how practical of a concern it is as it depends on hardware behavior. There is the possibility of data loss for a ledger in the following scenario:

  1. Generation 1 is written to both M.2s( A and B)
  2. Generation 2 is written only to A - B fails to write
  3. Sled-agent reboots
  4. Ledger::new reads from B, but fails to read from A
  5. The ledger now points, incorrectly, to Generation 1.

At this point we have data loss, but things can get even more confusing, depending upon the failure modes of the M.2s. Let's say we continue with the following steps:

  1. Generation 2 (with different data) is written to A.
  2. Sled-agent reboots
  3. Ledger::new reads both versions at generation 2 and picks A given its current logic as long as A is first in the path list:
// Return the ledger with the highest generation number.
        let ledger = ledgers.into_iter().reduce(|prior, ledger| {
            if ledger.is_newer_than(&prior) {
                ledger
            } else {
                prior
            }
        });

This problem is baked into the fact that you can't do consensus with only 2 nodes if those nodes can fail in arbitrary ways. Whether the M.2s can fail in arbitrary ways is unknown to me, but I'd really like to preclude the possibility of doing the wrong thing without relying on hardware behavior if at all possible. I can see two possible ways of going about resolving this issue:

  1. If we ever fail to read or write from an M.2 we refuse to ever bring that M.2 back online. In short we tolerate the failure by making it permanent.
  2. We do not bump ledger generation numbers in sled-agent, but instead bump them in Nexus, along with saving the latest ledger configuration in Nexus. We then would know if we read back stale data, which we could go ahead and rewrite based on what was put in CockroachDB via Nexus.

It's unclear to me if either of these is actually feasible, as presumably the zones must come up in order to be able to talk to Nexus in the first place. However, maybe updates can go to nexus as in option 2.

CC @rmustacc

sled-agent/src/ledger.rs Show resolved Hide resolved
sled-agent/src/ledger.rs Show resolved Hide resolved
sled-agent/src/ledger.rs Outdated Show resolved Hide resolved
@smklein
Copy link
Collaborator Author

smklein commented May 1, 2023

This problem is baked into the fact that you can't do consensus with only 2 nodes if those nodes can fail in arbitrary ways. Whether the M.2s can fail in arbitrary ways is unknown to me, but I'd really like to preclude the possibility of doing the wrong thing without relying on hardware behavior if at all possible. I can see two possible ways of going about resolving this issue:

So, first of all, I totally agree with you about this mismatch being possible. In a longer-term plan, I would like for Nexus to be able to send the request for services / datasets to the sled as:

"Here are all the services you should run, with a generation number"

(I've updated #732 to include this implementation detail)

Such an API means that this ledger becomes a cache of data that's stored in CRDB, and which can get updated when Nexus comes online.

  1. We do not bump ledger generation numbers in sled-agent, but instead bump them in Nexus, along with saving the latest ledger configuration in Nexus. We then would know if we read back stale data, which we could go ahead and rewrite based on what was put in CockroachDB via Nexus.

I 100% think this is feasible, and is something we should do. It's easier for "non-dataset services" than "dataset services" due to the existing shape of the internal API, but I think both can be vectorized + generation'd.

I've created #2977 and #2978 -- sub-issues of #732 -- for us to track.

@andrewjstone
Copy link
Contributor

I 100% think this is feasible, and is something we should do. It's easier for "non-dataset services" than "dataset services" due to the existing shape of the internal API, but I think both can be vectorized + generation'd.

I've created #2977 and #2978 -- sub-issues of #732 -- for us to track.

Awesome! Thank you!

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the understanding around Nexus issuing updates for services when online and the Ledger acting as a cache I think we should go ahead and merge this in once the test bug is fixed.

Base automatically changed from rss-explicit to main May 2, 2023 21:58
@smklein smklein merged commit a477ac2 into main May 2, 2023
@smklein smklein deleted the service-ledger branch May 2, 2023 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sled agent should move all service-related information into the M.2 partitions, duplicating them on write.
2 participants