Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract storage manager related code from sled-agent #4332

Merged
merged 69 commits into from
Nov 14, 2023
Merged

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Oct 24, 2023

This is a large PR. The changes are in many ways just moving code around,
and at a cursory glance may appear to be a giant waste of time. To justify
these changes and my rationale I will first present the goals I believe we
should strive to meet in general sled-agent development. I'll then go into
how I believe these goals can be realized via code patterns. I'll then discuss
the concrete changes in this PR and how they implement the discussed patterns.
Lastly, I'll list a few open questions about the implementation in the PR.

I want to emphasize that a lot of what I talk about below is already done in
sled-agent, at least part way. We can pick the best patterns already in use and
standardize on them to build a more coherent and yet decoupled architecture for
easier testing and understanding.

Goals

This PR is an attempt to start making sled-agent easier to maintain. In order to
make things easier to maintain, I believe we should attempt to realize a few key
goals:

  1. Initial startup and control code should read linearly. You shouldn't have to
    jump around to follow the gist of what's going on.
  2. Tasks, like functions, should live at a certain abstraction level. It must be clear
    what state the task is managing and how it gets mutated.
  3. A developer must be able to come into the codebase and know where a given
    functionality should live and be able to clearly follow existing patterns such
    that their new code doesn't break or cross abstractions that make following the
    code harder in the future.
  4. Tests for individual abstractions and task driven code should be easy to
    write. You shouldn't need to spin up the whole sled-agent in order to test an
    algorithm or behavior.
  5. Hardware should be abstracted out, and much of the code shouldn't
    deal with hardware directly.

Patterns

Task/Handle

In my opinion, the most important pattern to follow for async rust software
is what I call the "task/handle" pattern. This pattern helps code maintenance
by explicitly managing state and making task driven code easier to test. All
tasks that provide services to other tasks should follow this pattern. In this
pattern, all state is contained in a type owned by the task and inaccessible
directly to other tasks. The task has a corresponding handle which must be
Send and Clone, and can be used to make requests of the task. The handle
interacts with the task solely via message passing. Mutexes are never used.

A few things fall directly out of this pattern:

  • Code that manipulates state inside and outside the task can be synchronous
  • We don't have to worry about mutex behavior with regards to cancellation,
    await points, etc...
  • Testing becomes easier:
    • We can test a bunch of the code without spawning a task in many
      cases. We just directly construct the state object and call functions,
      whether sync or async.
    • When testing sequential operations that are fire and forget,
      we know when they have been processed by the task loop,
      because our next request will only run after those operations.
      This is a direct result of the fifo channel between handle and task.
      This is not possible with state shared outside the task via a mutex.
      We must poll that mutex protected state until it either changes to
      the value we expect or we get a timeout. If we expect no change, we
      must use a side-channel.
    • We can write complex state handling algorithms, such as those in
      maghemite or bootstore, in a deterministic, synchronous style that
      allows property based testing and model checking in as straightforward
      a manner as possible.
    • We can completely fake the task
      without changing any of the calling code except the constructor. The real
      handle can instead send messages to a fake task. Importantly, this strategy
      can also be used to abstract out hardware for unit tests and simulation.

Beyond all that though, the understanding of how state is mutated and accessed
is clear. State is only mutated inside the task, and can only be retrieved from
code that has a copy of the handle. There is no need for separate _locked
methods, and no need to worry about order of locking for different bits of task
state. This can make the task code itself much shorter and easier to read as
demonstrated by the new StorageManager code in sled_storage. We can also
maintain internal stats for the task, and all of this can be retrieved from the
handle for reporting in omdb.

There is one common question/problem with this pattern. How do you choose a
bounded channel size for handle to task messages?

This can be tricky, but in general, you want the channel to act unbounded,
such that sends never fail. This makes the channels reliable, such that we never
drop messages inside the process, and the caller doesn't have to choose what to
do when overloaded. This simplifies things drastically for developers. However,
you also don't want to make the channel actually unbounded, because that can
lead to run-away memory growth and pathological behaviors, such that requests
get slower over time until the system crashes.

After discussions with @jgallagher and reflections from @sunshowers and
@davepacheco on RFD 400, the solution is to choose a large enough bound
such that we should never hit it in practice unless we are truly overloaded.
If we hit the bound it means that beyond that requests will start to build
up and we will eventually topple over. So when we hit this bound, we just
go ahead and panic.

Picking a channel bound is hard to do empirically, but practically, if requests are
mostly mutating task local state, a bound of 1024 or even 8192 should be plenty.
Tasks that must perform longer running ops can spawn helper tasks as necessary
or include their own handles for replies rather than synchronously waiting. Memory
for the queue can be kept small with boxing of large messages.

It should also be noted that mutexes also have queues that can build up and
cause pathological slowness. The difference is that these queues are implicit
and hard to track.

Isolate subsystems via the message driven decoupling

We have a bunch of managers (HardwareManager, StorageManager,
ServiceManager, InstanceManager) that interact with each other to
provide sled-agent services. However, much of that interaction is ad-
hoc with state shared between tasks via an Inner struct protected
by a mutex. It's hard to isolate each of these managers and test
them independently. By ensuring their API is well defined via a Handle,
we can fake out different managers as needed for independent testing.
However, we can go even farther, without the need for dependency injection
by having different managers announce their updates via broadcast or watch
channels
.

With this strategy, we can drive tests for a task by using the handle to both
trigger operations, and then wait for the outflow of messages that should occur
rather than mocking out the handle API of another task directly and checking
the interaction via the fake. This is especially useful for lower level services
that others depend upon such as StorageManager, and we should do this when we
can, rather than having tasks talk directly to each other. This strategy leads
directly to the last pattern I really want to mention, the monitor pattern.

High level interaction via monitors

Remember that a primary goal of these patterns is to make the sled-agent easier
to read and discover what is happening at any given point in time. This has to
happen with the inherent concurrency of all the separate behaviors occurring
in the sled-agent such as monitoring for hardware and faults, or handling user
requests from Nexus. If our low level managers announce updates via channels
we can have high level Monitors that wait for those updates and then dispatch
them to other subsystems or across clients to other sleds or services. This
helps further decouple services for easier testing, and also allows us to
understand all the asynchrony in the system in fewer spots. We can put RAS code
in these monitors and use them as central points to maintain stats and track the
overall load on the system.

The punchline

How do the patterns above satisfy the goals? By decoupling tasks and interacting
solely via message passing we make testing easier(goal 4). By separating out
managers/ services into separate tasks and maintaining state locally we satisfy
goals 2 and 5. By making the tasks clear in their responsibilities, and their
APIs evident via their handles, we help satisfy goal 3. Goal 3 is also satisfied
to a degree by the elimination of sharing state and the decoupling via monitors.
Lastly, by combining all the patterns, we can spawn all our top level tasks and
monitors in one place after initializing any global state. This allows the code
to flow linearly.

Importantly, it's also easy to violate goal 3 by dropping some mutexes into the
middle of a task and oversharing handles between subsystems. I believe we can
prevent this, and also make testing and APIs clearer, by separating subsystems
into their own rust packages and then adding monitors for them in the sled-agent.
I took a first cut at this with the StorageManager, as that was the subsystem I was
most familiar with.

Concrete Code changes

Keeping in line with my stated goals and patterns, I abstracted the storage
manager code into its own package called sled-storage. The StorageManager
uses the task/handle pattern, and there is a monitor for it in sled-agent
that takes notifications and updates nexus and the zone bundler. There are also a
bunch of unit tests where none existed before. The bulk of the rest of the code
changes fell out of this. In particular, now that we have a separate sled-storage
package, all high level disk management code lives there. Construction and
formatting of partitions still happens in sled-hardware, but anything above the
zpool level occurs in sled-storage. This allows testing of real storage code on
any illumos based system using file backed zpools. Importantly, the key-management
code now lives at this abstraction level, rather than in sled-hardware, which
makes more sense since encryption only operates on datasets in ZFS.

I'd like to do similar things to the other managers, and think that's a good way
to continue cleaning up sled-agent.

The other large change in this code base is simplifying the startup of
the bootstrap agent such that all tasks that run for the lifetime of
the sled-agent process are spawned in long_running_tasks. There is a
struct called LongRunningTaskHandles that allows interaction with all the
"managers" spawned here. This change also has the side effect of removing the
wait_while_handling_hardware_updates concurrent future code, since we have
a hardware monitor now running at the beginning of time and can never miss
updates. This also has the effect of negating the need to start a separate
hardware manager when the sled-agent starts up. Because we now know which
tasks are long running, we don't have to save their tokio JoinHandles to display
ownership and thus gain clonability.

Open Questions

  • I'm not really a fan of the name long_running_task_handles. Is there a better
    name for these?
  • Instead of calling spawn_all_longrunning_tasks should we just call the
    individual spawn functions directly in bootstrap/pre_server? Doing it this
    way would allow the ServiceManager to also live in the long running handles
    and remove some ordering issues. For instance, we wait for the bootdisk inside
    spawn_all_longrunning_tasks and we also upsert synthetic disks. Neither
    of these is really part of spawning tasks.

@andrewjstone
Copy link
Contributor Author

@smklein @jgallagher I believe I resolved all your comments. Please take another look.

There are two remaining issues that I left open with responses that I'd like to defer. The one about the test method I think is relatively minor and I'm not sure that it could be fixed in a clearer manner, unless I inlined a cfg into the ensure_using_exactly_these_disks production method, which I like even less than what exists now. The other is a global regarding mocks that I opened an issue for. The global no longer gets compiled in in production code so that seems to take my concern down a notch at least.

@andrewjstone
Copy link
Contributor Author

One more thing: I plan to test this on madrid tomorrow.

@andrewjstone
Copy link
Contributor Author

Tested on madrid and RSS worked, and I was able to add sled 15 to the cluster using the other new code recently merged in. I will also do a cold-boot test to double-check after demos, but I think this is good to go.

@andrewjstone
Copy link
Contributor Author

Cold boot succeeded.

@andrewjstone
Copy link
Contributor Author

Cold boot succeeded.

Damn, this only worked on the initial nodes / scrimlet. Rebooting the newly added node results in a crash due to the FSM already being initialized. I need to ignore that error in the cold boot path. That is not related to this PR though, but to already merged in code. I'll open an issue to fix it.

@andrewjstone
Copy link
Contributor Author

Cold boot succeeded.

Damn, this only worked on the initial nodes / scrimlet. Rebooting the newly added node results in a crash due to the FSM already being initialized. I need to ignore that error in the cold boot path. That is not related to this PR though, but to already merged in code. I'll open an issue to fix it.

The cold-boot fix has been merged into main and I just merged main into this branch.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on following up here - the changes look great. Left a couple more nits; the only real concern is the oneshot channels interacting with failure (I really like the oneshot channel change in general though!). Happy to spitball how to address that if I understood it correctly and there's a real problem there.

sled-storage/src/manager.rs Outdated Show resolved Hide resolved
sled-storage/src/resources.rs Outdated Show resolved Hide resolved
underlay_available_tx,
) => {
let request_id = request.body.id;
// Extract from an option to satisfy the borrow checker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this safe? Two possible issues:

  1. Cancellation issues (if start_sled_agent is cancelled, we don't update self.state but have stolen the channel from it) - probably fine to just comment that this can't happen if that's true
  2. If start_sled_agent fails, we don't change self.state, so we're left in Bootstrapping with no channel to take next time

Can we address both by moving this take() down into the Ok(_) branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely not safe. @jgallagher and I talked about this and it's also non-trivial to make it such that start_sled_agent can fail at all and the system still works. We need to rework the persistent state management to allow that to work, and that is going to be very tricky to get right. This also affects RSS since the code to add a sled reuses the bulk of the RSS path.

In the short term I'm going to change this error path to panic explicitly and add some comments around why, and open an issue for the larger problem of persistent state / idempotence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented and issue opened in cfc3ef7

&self.bootstore_handles.node_handle,
&self.managers,
self.long_running_task_handles.clone(),
underlay_available_tx.take().unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have the same issue above where if we fail, we'll panic the next time through this function? Except I don't have a great suggestion here since we need this channel before we know whether this will succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented and issue opened in cfc3ef7

sled-storage/src/manager.rs Outdated Show resolved Hide resolved
// The sending side never disappears because we hold a copy
let req = self.rx.recv().await.unwrap();
info!(self.log, "Received {:?}", req);
let should_send_updates = match req {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update organization here makes a lot of sense to me - it's clear, and they propagate in one spot.

@andrewjstone andrewjstone enabled auto-merge (squash) November 14, 2023 19:32
@andrewjstone andrewjstone merged commit 2fc0dfd into main Nov 14, 2023
21 of 22 checks passed
@andrewjstone andrewjstone deleted the sled-storage branch November 14, 2023 20:08
mkeeter added a commit to oxidecomputer/crucible that referenced this pull request Nov 20, 2023
Right now, the Crucible upstairs spends a lot of time fighting over the lock on
the single `Mutex<Downstairs>`.

Eventually, I'd like to move data into separate per-downstairs tasks that own
their data and communicate via message-passing (see oxidecomputer/omicron#4332).

This PR is a (mostly) purely mechanical step in that direction: it converts a
bunch of individual `ClientData<T>` into a single
`ClientData<DownstairsClient>`; basically a struct-of-arrays → array-of-structs
transform. Then, functions which only use data within a single
`DownstairsClient` are moved into members functions on that struct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants