Extract storage manager related code from sled-agent #4332

andrewjstone · 2023-10-24T17:51:34Z

This is a large PR. The changes are in many ways just moving code around,
and at a cursory glance may appear to be a giant waste of time. To justify
these changes and my rationale I will first present the goals I believe we
should strive to meet in general sled-agent development. I'll then go into
how I believe these goals can be realized via code patterns. I'll then discuss
the concrete changes in this PR and how they implement the discussed patterns.
Lastly, I'll list a few open questions about the implementation in the PR.

I want to emphasize that a lot of what I talk about below is already done in
sled-agent, at least part way. We can pick the best patterns already in use and
standardize on them to build a more coherent and yet decoupled architecture for
easier testing and understanding.

Goals

This PR is an attempt to start making sled-agent easier to maintain. In order to
make things easier to maintain, I believe we should attempt to realize a few key
goals:

Initial startup and control code should read linearly. You shouldn't have to
jump around to follow the gist of what's going on.
Tasks, like functions, should live at a certain abstraction level. It must be clear
what state the task is managing and how it gets mutated.
A developer must be able to come into the codebase and know where a given
functionality should live and be able to clearly follow existing patterns such
that their new code doesn't break or cross abstractions that make following the
code harder in the future.
Tests for individual abstractions and task driven code should be easy to
write. You shouldn't need to spin up the whole sled-agent in order to test an
algorithm or behavior.
Hardware should be abstracted out, and much of the code shouldn't
deal with hardware directly.

Patterns

Task/Handle

In my opinion, the most important pattern to follow for async rust software
is what I call the "task/handle" pattern. This pattern helps code maintenance
by explicitly managing state and making task driven code easier to test. All
tasks that provide services to other tasks should follow this pattern. In this
pattern, all state is contained in a type owned by the task and inaccessible
directly to other tasks. The task has a corresponding handle which must be
Send and Clone, and can be used to make requests of the task. The handle
interacts with the task solely via message passing. Mutexes are never used.

A few things fall directly out of this pattern:

Code that manipulates state inside and outside the task can be synchronous
We don't have to worry about mutex behavior with regards to cancellation,
await points, etc...
Testing becomes easier:
- We can test a bunch of the code without spawning a task in many
  cases. We just directly construct the state object and call functions,
  whether sync or async.
- When testing sequential operations that are fire and forget,
  we know when they have been processed by the task loop,
  because our next request will only run after those operations.
  This is a direct result of the fifo channel between handle and task.
  This is not possible with state shared outside the task via a mutex.
  We must poll that mutex protected state until it either changes to
  the value we expect or we get a timeout. If we expect no change, we
  must use a side-channel.
- We can write complex state handling algorithms, such as those in
  maghemite or bootstore, in a deterministic, synchronous style that
  allows property based testing and model checking in as straightforward
  a manner as possible.
- We can completely fake the task
  without changing any of the calling code except the constructor. The real
  handle can instead send messages to a fake task. Importantly, this strategy
  can also be used to abstract out hardware for unit tests and simulation.

Beyond all that though, the understanding of how state is mutated and accessed
is clear. State is only mutated inside the task, and can only be retrieved from
code that has a copy of the handle. There is no need for separate _locked
methods, and no need to worry about order of locking for different bits of task
state. This can make the task code itself much shorter and easier to read as
demonstrated by the new StorageManager code in sled_storage. We can also
maintain internal stats for the task, and all of this can be retrieved from the
handle for reporting in omdb.

There is one common question/problem with this pattern. How do you choose a
bounded channel size for handle to task messages?

This can be tricky, but in general, you want the channel to act unbounded,
such that sends never fail. This makes the channels reliable, such that we never
drop messages inside the process, and the caller doesn't have to choose what to
do when overloaded. This simplifies things drastically for developers. However,
you also don't want to make the channel actually unbounded, because that can
lead to run-away memory growth and pathological behaviors, such that requests
get slower over time until the system crashes.

After discussions with @jgallagher and reflections from @sunshowers and
@davepacheco on RFD 400, the solution is to choose a large enough bound
such that we should never hit it in practice unless we are truly overloaded.
If we hit the bound it means that beyond that requests will start to build
up and we will eventually topple over. So when we hit this bound, we just
go ahead and panic.

Picking a channel bound is hard to do empirically, but practically, if requests are
mostly mutating task local state, a bound of 1024 or even 8192 should be plenty.
Tasks that must perform longer running ops can spawn helper tasks as necessary
or include their own handles for replies rather than synchronously waiting. Memory
for the queue can be kept small with boxing of large messages.

It should also be noted that mutexes also have queues that can build up and
cause pathological slowness. The difference is that these queues are implicit
and hard to track.

Isolate subsystems via the message driven decoupling

We have a bunch of managers (HardwareManager, StorageManager,
ServiceManager, InstanceManager) that interact with each other to
provide sled-agent services. However, much of that interaction is ad-
hoc with state shared between tasks via an Inner struct protected
by a mutex. It's hard to isolate each of these managers and test
them independently. By ensuring their API is well defined via a Handle,
we can fake out different managers as needed for independent testing.
However, we can go even farther, without the need for dependency injection
by having different managers announce their updates via broadcast or watch
channels.

With this strategy, we can drive tests for a task by using the handle to both
trigger operations, and then wait for the outflow of messages that should occur
rather than mocking out the handle API of another task directly and checking
the interaction via the fake. This is especially useful for lower level services
that others depend upon such as StorageManager, and we should do this when we
can, rather than having tasks talk directly to each other. This strategy leads
directly to the last pattern I really want to mention, the monitor pattern.

High level interaction via monitors

Remember that a primary goal of these patterns is to make the sled-agent easier
to read and discover what is happening at any given point in time. This has to
happen with the inherent concurrency of all the separate behaviors occurring
in the sled-agent such as monitoring for hardware and faults, or handling user
requests from Nexus. If our low level managers announce updates via channels
we can have high level Monitors that wait for those updates and then dispatch
them to other subsystems or across clients to other sleds or services. This
helps further decouple services for easier testing, and also allows us to
understand all the asynchrony in the system in fewer spots. We can put RAS code
in these monitors and use them as central points to maintain stats and track the
overall load on the system.

The punchline

How do the patterns above satisfy the goals? By decoupling tasks and interacting
solely via message passing we make testing easier(goal 4). By separating out
managers/ services into separate tasks and maintaining state locally we satisfy
goals 2 and 5. By making the tasks clear in their responsibilities, and their
APIs evident via their handles, we help satisfy goal 3. Goal 3 is also satisfied
to a degree by the elimination of sharing state and the decoupling via monitors.
Lastly, by combining all the patterns, we can spawn all our top level tasks and
monitors in one place after initializing any global state. This allows the code
to flow linearly.

Importantly, it's also easy to violate goal 3 by dropping some mutexes into the
middle of a task and oversharing handles between subsystems. I believe we can
prevent this, and also make testing and APIs clearer, by separating subsystems
into their own rust packages and then adding monitors for them in the sled-agent.
I took a first cut at this with the StorageManager, as that was the subsystem I was
most familiar with.

Concrete Code changes

Keeping in line with my stated goals and patterns, I abstracted the storage
manager code into its own package called sled-storage. The StorageManager
uses the task/handle pattern, and there is a monitor for it in sled-agent
that takes notifications and updates nexus and the zone bundler. There are also a
bunch of unit tests where none existed before. The bulk of the rest of the code
changes fell out of this. In particular, now that we have a separate sled-storage
package, all high level disk management code lives there. Construction and
formatting of partitions still happens in sled-hardware, but anything above the
zpool level occurs in sled-storage. This allows testing of real storage code on
any illumos based system using file backed zpools. Importantly, the key-management
code now lives at this abstraction level, rather than in sled-hardware, which
makes more sense since encryption only operates on datasets in ZFS.

I'd like to do similar things to the other managers, and think that's a good way
to continue cleaning up sled-agent.

The other large change in this code base is simplifying the startup of
the bootstrap agent such that all tasks that run for the lifetime of
the sled-agent process are spawned in long_running_tasks. There is a
struct called LongRunningTaskHandles that allows interaction with all the
"managers" spawned here. This change also has the side effect of removing the
wait_while_handling_hardware_updates concurrent future code, since we have
a hardware monitor now running at the beginning of time and can never miss
updates. This also has the effect of negating the need to start a separate
hardware manager when the sled-agent starts up. Because we now know which
tasks are long running, we don't have to save their tokio JoinHandles to display
ownership and thus gain clonability.

Open Questions

I'm not really a fan of the name long_running_task_handles. Is there a better
name for these?
Instead of calling spawn_all_longrunning_tasks should we just call the
individual spawn functions directly in bootstrap/pre_server? Doing it this
way would allow the ServiceManager to also live in the long running handles
and remove some ordering issues. For instance, we wait for the bootdisk inside
spawn_all_longrunning_tasks and we also upsert synthetic disks. Neither
of these is really part of spawning tasks.

andrewjstone · 2023-11-10T06:26:06Z

@smklein @jgallagher I believe I resolved all your comments. Please take another look.

There are two remaining issues that I left open with responses that I'd like to defer. The one about the test method I think is relatively minor and I'm not sure that it could be fixed in a clearer manner, unless I inlined a cfg into the ensure_using_exactly_these_disks production method, which I like even less than what exists now. The other is a global regarding mocks that I opened an issue for. The global no longer gets compiled in in production code so that seems to take my concern down a notch at least.

andrewjstone · 2023-11-10T06:29:45Z

One more thing: I plan to test this on madrid tomorrow.

andrewjstone · 2023-11-10T18:27:08Z

Tested on madrid and RSS worked, and I was able to add sled 15 to the cluster using the other new code recently merged in. I will also do a cold-boot test to double-check after demos, but I think this is good to go.

andrewjstone · 2023-11-10T20:34:08Z

Cold boot succeeded.

andrewjstone · 2023-11-10T20:46:27Z

Cold boot succeeded.

Damn, this only worked on the initial nodes / scrimlet. Rebooting the newly added node results in a crash due to the FSM already being initialized. I need to ignore that error in the cold boot path. That is not related to this PR though, but to already merged in code. I'll open an issue to fix it.

andrewjstone · 2023-11-13T19:53:15Z

Cold boot succeeded.

Damn, this only worked on the initial nodes / scrimlet. Rebooting the newly added node results in a crash due to the FSM already being initialized. I need to ignore that error in the cold boot path. That is not related to this PR though, but to already merged in code. I'll open an issue to fix it.

The cold-boot fix has been merged into main and I just merged main into this branch.

jgallagher

Sorry for the delay on following up here - the changes look great. Left a couple more nits; the only real concern is the oneshot channels interacting with failure (I really like the oneshot channel change in general though!). Happy to spitball how to address that if I understood it correctly and there's a real problem there.

sled-storage/src/manager.rs

sled-storage/src/resources.rs

jgallagher · 2023-11-14T16:17:42Z

sled-agent/src/bootstrap/server.rs

+                underlay_available_tx,
+            ) => {
+                let request_id = request.body.id;
+                // Extract from an option to satisfy the borrow checker


Hmm, is this safe? Two possible issues:

Cancellation issues (if start_sled_agent is cancelled, we don't update self.state but have stolen the channel from it) - probably fine to just comment that this can't happen if that's true

If start_sled_agent fails, we don't change self.state, so we're left in Bootstrapping with no channel to take next time

Can we address both by moving this take() down into the Ok(_) branch?

This is absolutely not safe. @jgallagher and I talked about this and it's also non-trivial to make it such that start_sled_agent can fail at all and the system still works. We need to rework the persistent state management to allow that to work, and that is going to be very tricky to get right. This also affects RSS since the code to add a sled reuses the bulk of the RSS path.

In the short term I'm going to change this error path to panic explicitly and add some comments around why, and open an issue for the larger problem of persistent state / idempotence.

Documented and issue opened in cfc3ef7

jgallagher · 2023-11-14T16:20:41Z

sled-agent/src/bootstrap/server.rs

-                    &self.bootstore_handles.node_handle,
-                    &self.managers,
+                    self.long_running_task_handles.clone(),
+                    underlay_available_tx.take().unwrap(),


Does this have the same issue above where if we fail, we'll panic the next time through this function? Except I don't have a great suggestion here since we need this channel before we know whether this will succeed.

Documented and issue opened in cfc3ef7

sled-storage/src/manager.rs

smklein · 2023-11-14T19:19:40Z

sled-storage/src/manager.rs

+        // The sending side never disappears because we hold a copy
+        let req = self.rx.recv().await.unwrap();
+        info!(self.log, "Received {:?}", req);
+        let should_send_updates = match req {


This update organization here makes a lot of sense to me - it's clear, and they propagate in one spot.

Right now, the Crucible upstairs spends a lot of time fighting over the lock on the single `Mutex<Downstairs>`. Eventually, I'd like to move data into separate per-downstairs tasks that own their data and communicate via message-passing (see oxidecomputer/omicron#4332). This PR is a (mostly) purely mechanical step in that direction: it converts a bunch of individual `ClientData<T>` into a single `ClientData<DownstairsClient>`; basically a struct-of-arrays → array-of-structs transform. Then, functions which only use data within a single `DownstairsClient` are moved into members functions on that struct.

andrewjstone added 30 commits September 28, 2023 05:12

wip

7facfdf

wip

f927d02

wip

9818c05

wip

ab57c46

wip

90ec972

wip

5c169ac

wip

255155c

wip

115510a

wip

0bc3aa0

wip

e2e7dc8

wip

9a1e916

wip

30e16c8

wip

51dcbcd

wip

9651217

wip

4a38f90

wip

98cc812

wip

8c38e8d

wip

b2f3146

wip

1352bcf

wip

12245d5

wip

e8afd42

wip

f540c08

wip

1e34e7c

wip

9a3380a

wip

b2c01e7

wip

327578c

wip

9593047

wip

e64b569

wip

e0b4b26

wip

950a1d4

andrewjstone added 10 commits November 8, 2023 20:38

Some review fixes

a7a1971

Fix subtle bugs wrt watch channels

7d457ce

some more review fixes

c895657

Use oneshot channels for HardwareMonitor setup

ed7f059

Use oneshot for UnderlayAccess

c0e3c71

more review fixes

a7848f9

more review fixes

3998708

Some more review updates

9b5816a

Review fixes

34fe6cc

USE_MOCKS only exists during testing now

7884213

andrewjstone mentioned this pull request Nov 9, 2023

static USE_MOCKS is gross and should be removed #4481

Open

andrewjstone added 4 commits November 9, 2023 22:31

remove unnecessary ref

49f2656

Merge branch 'main' into sled-storage

80baaf1

remove autogenerated file

35e252b

More review cleanup

11f1b36

Merge branch 'main' into sled-storage

713243a

jgallagher reviewed Nov 14, 2023

View reviewed changes

andrewjstone added 2 commits November 14, 2023 18:36

review fixes

cfc3ef7

fix nit

6b7845c

smklein approved these changes Nov 14, 2023

View reviewed changes

andrewjstone enabled auto-merge (squash) November 14, 2023 19:32

andrewjstone merged commit 2fc0dfd into main Nov 14, 2023
21 of 22 checks passed

andrewjstone deleted the sled-storage branch November 14, 2023 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract storage manager related code from sled-agent #4332

Extract storage manager related code from sled-agent #4332

andrewjstone commented Oct 24, 2023 •

edited

Loading

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 13, 2023

jgallagher left a comment

jgallagher Nov 14, 2023

andrewjstone Nov 14, 2023

andrewjstone Nov 14, 2023

jgallagher Nov 14, 2023

andrewjstone Nov 14, 2023

smklein Nov 14, 2023

Extract storage manager related code from sled-agent #4332

Extract storage manager related code from sled-agent #4332

Conversation

andrewjstone commented Oct 24, 2023 • edited Loading

Goals

Patterns

Task/Handle

Isolate subsystems via the message driven decoupling

High level interaction via monitors

The punchline

Concrete Code changes

Open Questions

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 10, 2023

andrewjstone commented Nov 13, 2023

jgallagher left a comment

Choose a reason for hiding this comment

jgallagher Nov 14, 2023

Choose a reason for hiding this comment

andrewjstone Nov 14, 2023

Choose a reason for hiding this comment

andrewjstone Nov 14, 2023

Choose a reason for hiding this comment

jgallagher Nov 14, 2023

Choose a reason for hiding this comment

andrewjstone Nov 14, 2023

Choose a reason for hiding this comment

smklein Nov 14, 2023

Choose a reason for hiding this comment

andrewjstone commented Oct 24, 2023 •

edited

Loading