-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract storage manager related code from sled-agent #4332
Conversation
@smklein @jgallagher I believe I resolved all your comments. Please take another look. There are two remaining issues that I left open with responses that I'd like to defer. The one about the test method I think is relatively minor and I'm not sure that it could be fixed in a clearer manner, unless I inlined a cfg into the |
One more thing: I plan to test this on madrid tomorrow. |
Tested on madrid and RSS worked, and I was able to add sled 15 to the cluster using the other new code recently merged in. I will also do a cold-boot test to double-check after demos, but I think this is good to go. |
Cold boot succeeded. |
Damn, this only worked on the initial nodes / scrimlet. Rebooting the newly added node results in a crash due to the FSM already being initialized. I need to ignore that error in the cold boot path. That is not related to this PR though, but to already merged in code. I'll open an issue to fix it. |
The cold-boot fix has been merged into main and I just merged main into this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on following up here - the changes look great. Left a couple more nits; the only real concern is the oneshot channels interacting with failure (I really like the oneshot channel change in general though!). Happy to spitball how to address that if I understood it correctly and there's a real problem there.
sled-agent/src/bootstrap/server.rs
Outdated
underlay_available_tx, | ||
) => { | ||
let request_id = request.body.id; | ||
// Extract from an option to satisfy the borrow checker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this safe? Two possible issues:
- Cancellation issues (if
start_sled_agent
is cancelled, we don't updateself.state
but have stolen the channel from it) - probably fine to just comment that this can't happen if that's true - If
start_sled_agent
fails, we don't changeself.state
, so we're left inBootstrapping
with no channel to take next time
Can we address both by moving this take()
down into the Ok(_)
branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely not safe. @jgallagher and I talked about this and it's also non-trivial to make it such that start_sled_agent
can fail at all and the system still works. We need to rework the persistent state management to allow that to work, and that is going to be very tricky to get right. This also affects RSS since the code to add a sled reuses the bulk of the RSS path.
In the short term I'm going to change this error path to panic explicitly and add some comments around why, and open an issue for the larger problem of persistent state / idempotence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented and issue opened in cfc3ef7
sled-agent/src/bootstrap/server.rs
Outdated
&self.bootstore_handles.node_handle, | ||
&self.managers, | ||
self.long_running_task_handles.clone(), | ||
underlay_available_tx.take().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have the same issue above where if we fail, we'll panic the next time through this function? Except I don't have a great suggestion here since we need this channel before we know whether this will succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented and issue opened in cfc3ef7
// The sending side never disappears because we hold a copy | ||
let req = self.rx.recv().await.unwrap(); | ||
info!(self.log, "Received {:?}", req); | ||
let should_send_updates = match req { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update organization here makes a lot of sense to me - it's clear, and they propagate in one spot.
Right now, the Crucible upstairs spends a lot of time fighting over the lock on the single `Mutex<Downstairs>`. Eventually, I'd like to move data into separate per-downstairs tasks that own their data and communicate via message-passing (see oxidecomputer/omicron#4332). This PR is a (mostly) purely mechanical step in that direction: it converts a bunch of individual `ClientData<T>` into a single `ClientData<DownstairsClient>`; basically a struct-of-arrays → array-of-structs transform. Then, functions which only use data within a single `DownstairsClient` are moved into members functions on that struct.
This is a large PR. The changes are in many ways just moving code around,
and at a cursory glance may appear to be a giant waste of time. To justify
these changes and my rationale I will first present the goals I believe we
should strive to meet in general sled-agent development. I'll then go into
how I believe these goals can be realized via code patterns. I'll then discuss
the concrete changes in this PR and how they implement the discussed patterns.
Lastly, I'll list a few open questions about the implementation in the PR.
I want to emphasize that a lot of what I talk about below is already done in
sled-agent, at least part way. We can pick the best patterns already in use and
standardize on them to build a more coherent and yet decoupled architecture for
easier testing and understanding.
Goals
This PR is an attempt to start making sled-agent easier to maintain. In order to
make things easier to maintain, I believe we should attempt to realize a few key
goals:
jump around to follow the gist of what's going on.
what state the task is managing and how it gets mutated.
functionality should live and be able to clearly follow existing patterns such
that their new code doesn't break or cross abstractions that make following the
code harder in the future.
write. You shouldn't need to spin up the whole sled-agent in order to test an
algorithm or behavior.
deal with hardware directly.
Patterns
Task/Handle
In my opinion, the most important pattern to follow for async rust software
is what I call the "task/handle" pattern. This pattern helps code maintenance
by explicitly managing state and making task driven code easier to test. All
tasks that provide services to other tasks should follow this pattern. In this
pattern, all state is contained in a type owned by the task and inaccessible
directly to other tasks. The task has a corresponding handle which must be
Send
andClone
, and can be used to make requests of the task. The handleinteracts with the task solely via message passing. Mutexes are never used.
A few things fall directly out of this pattern:
await points, etc...
cases. We just directly construct the state object and call functions,
whether sync or async.
we know when they have been processed by the task loop,
because our next request will only run after those operations.
This is a direct result of the fifo channel between handle and task.
This is not possible with state shared outside the task via a mutex.
We must poll that mutex protected state until it either changes to
the value we expect or we get a timeout. If we expect no change, we
must use a side-channel.
maghemite or bootstore, in a deterministic, synchronous style that
allows property based testing and model checking in as straightforward
a manner as possible.
without changing any of the calling code except the constructor. The real
handle can instead send messages to a fake task. Importantly, this strategy
can also be used to abstract out hardware for unit tests and simulation.
Beyond all that though, the understanding of how state is mutated and accessed
is clear. State is only mutated inside the task, and can only be retrieved from
code that has a copy of the handle. There is no need for separate
_locked
methods, and no need to worry about order of locking for different bits of task
state. This can make the task code itself much shorter and easier to read as
demonstrated by the new
StorageManager
code insled_storage
. We can alsomaintain internal stats for the task, and all of this can be retrieved from the
handle for reporting in
omdb
.There is one common question/problem with this pattern. How do you choose a
bounded channel size for handle to task messages?
This can be tricky, but in general, you want the channel to act unbounded,
such that sends never fail. This makes the channels reliable, such that we never
drop messages inside the process, and the caller doesn't have to choose what to
do when overloaded. This simplifies things drastically for developers. However,
you also don't want to make the channel actually unbounded, because that can
lead to run-away memory growth and pathological behaviors, such that requests
get slower over time until the system crashes.
After discussions with @jgallagher and reflections from @sunshowers and
@davepacheco on RFD 400, the solution is to choose a large enough bound
such that we should never hit it in practice unless we are truly overloaded.
If we hit the bound it means that beyond that requests will start to build
up and we will eventually topple over. So when we hit this bound, we just
go ahead and panic.
Picking a channel bound is hard to do empirically, but practically, if requests are
mostly mutating task local state, a bound of 1024 or even 8192 should be plenty.
Tasks that must perform longer running ops can spawn helper tasks as necessary
or include their own handles for replies rather than synchronously waiting. Memory
for the queue can be kept small with boxing of large messages.
It should also be noted that mutexes also have queues that can build up and
cause pathological slowness. The difference is that these queues are implicit
and hard to track.
Isolate subsystems via the message driven decoupling
We have a bunch of managers (
HardwareManager
,StorageManager
,ServiceManager
,InstanceManager
) that interact with each other toprovide sled-agent services. However, much of that interaction is ad-
hoc with state shared between tasks via an
Inner
struct protectedby a mutex. It's hard to isolate each of these managers and test
them independently. By ensuring their API is well defined via a
Handle
,we can fake out different managers as needed for independent testing.
However, we can go even farther, without the need for dependency injection
by having different managers announce their updates via broadcast or watch
channels.
With this strategy, we can drive tests for a task by using the handle to both
trigger operations, and then wait for the outflow of messages that should occur
rather than mocking out the handle API of another task directly and checking
the interaction via the fake. This is especially useful for lower level services
that others depend upon such as
StorageManager
, and we should do this when wecan, rather than having tasks talk directly to each other. This strategy leads
directly to the last pattern I really want to mention, the
monitor
pattern.High level interaction via monitors
Remember that a primary goal of these patterns is to make the sled-agent easier
to read and discover what is happening at any given point in time. This has to
happen with the inherent concurrency of all the separate behaviors occurring
in the sled-agent such as monitoring for hardware and faults, or handling user
requests from Nexus. If our low level managers announce updates via channels
we can have high level
Monitors
that wait for those updates and then dispatchthem to other subsystems or across clients to other sleds or services. This
helps further decouple services for easier testing, and also allows us to
understand all the asynchrony in the system in fewer spots. We can put RAS code
in these monitors and use them as central points to maintain stats and track the
overall load on the system.
The punchline
How do the patterns above satisfy the goals? By decoupling tasks and interacting
solely via message passing we make testing easier(goal 4). By separating out
managers/ services into separate tasks and maintaining state locally we satisfy
goals 2 and 5. By making the tasks clear in their responsibilities, and their
APIs evident via their handles, we help satisfy goal 3. Goal 3 is also satisfied
to a degree by the elimination of sharing state and the decoupling via monitors.
Lastly, by combining all the patterns, we can spawn all our top level tasks and
monitors in one place after initializing any global state. This allows the code
to flow linearly.
Importantly, it's also easy to violate goal 3 by dropping some mutexes into the
middle of a task and oversharing handles between subsystems. I believe we can
prevent this, and also make testing and APIs clearer, by separating subsystems
into their own rust packages and then adding monitors for them in the sled-agent.
I took a first cut at this with the
StorageManager
, as that was the subsystem I wasmost familiar with.
Concrete Code changes
Keeping in line with my stated goals and patterns, I abstracted the storage
manager code into its own package called
sled-storage
. TheStorageManager
uses the
task/handle
pattern, and there is a monitor for it in sled-agentthat takes notifications and updates nexus and the zone bundler. There are also a
bunch of unit tests where none existed before. The bulk of the rest of the code
changes fell out of this. In particular, now that we have a separate
sled-storage
package, all high level disk management code lives there. Construction and
formatting of partitions still happens in
sled-hardware
, but anything above thezpool level occurs in
sled-storage
. This allows testing of real storage code onany illumos based system using file backed zpools. Importantly, the key-management
code now lives at this abstraction level, rather than in
sled-hardware
, whichmakes more sense since encryption only operates on datasets in ZFS.
I'd like to do similar things to the other managers, and think that's a good way
to continue cleaning up sled-agent.
The other large change in this code base is simplifying the startup of
the bootstrap agent such that all tasks that run for the lifetime of
the sled-agent process are spawned in
long_running_tasks
. There is astruct called
LongRunningTaskHandles
that allows interaction with all the"managers" spawned here. This change also has the side effect of removing the
wait_while_handling_hardware_updates
concurrent future code, since we havea hardware monitor now running at the beginning of time and can never miss
updates. This also has the effect of negating the need to start a separate
hardware manager when the sled-agent starts up. Because we now know which
tasks are long running, we don't have to save their tokio
JoinHandle
s to displayownership and thus gain clonability.
Open Questions
long_running_task_handles
. Is there a bettername for these?
spawn_all_longrunning_tasks
should we just call theindividual spawn functions directly in
bootstrap/pre_server
? Doing it thisway would allow the
ServiceManager
to also live in the long running handlesand remove some ordering issues. For instance, we wait for the bootdisk inside
spawn_all_longrunning_tasks
and we also upsert synthetic disks. Neitherof these is really part of spawning tasks.