-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload write encryption #1066
Offload write encryption #1066
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to preserve the ordering of incoming BlockReq
?
This failed a CI check. Since the Github UI makes it easy to lose this things, I'm linking it here for reference: Edit: this is #1077 , so unrelated to this branch |
This is preserving the order of We need to preserve order for correctness. It would be wrong if the guest sent
and we reordered that to
|
ee6e51a
to
aa99cde
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
eb1a91d
to
63bdfe1
Compare
#1064 added new logic to the `BlockReq`: if the notification channel was dropped without a reply having been sent, Crucible would panic. After a bunch of discussion, I no longer think this extra complexity is worth it, and would like to remove it. ## Why did we do this in the first place? In general, we expect a `BlockReq` to _always_ be answered by Crucible. If one is dropped without a reply being sent, it indicates a programmer error. This behavior was inspired by similar logic in Propolis (https://github.com/oxidecomputer/propolis/blob/85dc6db36709981c1d35b0651a18e20caee0b73f/lib/propolis/src/block/mod.rs#L176-L182). ## Propolis and Crucible are different The `Request` in Propolis is built around a `device::TrackingMarker`. If it is dropped without a reply, then the listening side will **wait forever**; this is a potential liveness issue, so turning it into a panic is sensible. The `BlockReq` in Crucible is built around a `tokio::sync::oneshot` channel. If it is dropped without a reply, then the `oneshot::Receiver` will return immediately with an error. In fact, we handle that case in our `BlockReqWaiter`, translating this error into `CrucibleError::RecvDisconnected` (https://github.com/oxidecomputer/crucible/blob/e813da3f/upstairs/src/block_req.rs#L67). In other words, we're _already_ handling the case of a `BlockReq` being dropped without a reply (by reporting an error). There is no potential liveness issue. (This also means that the block comment in `BlockRes::drop`, https://github.com/oxidecomputer/crucible/blob/e813da3f/upstairs/src/block_req.rs#L39-L40 is incorrect) It's arguably better to panic upon programmer error, versus going through normal error-handling paths (which defer handling to the caller of the `Guest` API). However... ## False positives during program exit There is one time when a `BlockReq` can be dropped without a reply: during program exit (or scope exit, for unit tests). As far as I can tell, this panic **hasn't caught any true programmer errors**. Instead, it's found a bunch of cases where `BlockReq` are dropped during exit, and we've had to add `Drop` implementations to prevent it from firing. The original PR added `impl Drop for Guest` (https://github.com/oxidecomputer/crucible/blob/e813da3f/upstairs/src/lib.rs#L2424-L2446), but it missed a sneaky case where `BlockReq` were stored in the `UpstairsState`; a later PR added `impl Drop for Upstairs` (https://github.com/oxidecomputer/crucible/blob/e813da3f/upstairs/src/upstairs.rs#L194-L207) to avoid panics during certain unit tests. ## Future false positives This check also makes it easy to accidentally create false positives when writing new code. The `offload-write-encryption` branch (#1066) moves some `BlockReq`s into a thread pool to perform encryption outside of the main task. However, because we didn't add special handling for program exit, this caused a panic at shutdown. ## This is hard to do by accident Remember what we're actually trying to protect against: someone dropping a `BlockReq` without replying. Because we've annotated relevant functions with `#[must_use]`, this would require either ignoring warnings or explicitly discarding the `BlockReq`. It's hard to do by accident! ## Summary - This `panic!` guards against something that's hard to do by accident - If we **do** make this mistake, it's reported as a distinct error to the caller - Unlike Propolis, there's no liveness concern - So far, 100% of issues it discovered are false positives at program exit - It is easy to introduce similar false positives when writing new code - New false positives will manifest as panics during program exit - If we introduce new false positives, we will not discover the issue until runtime (and it may be rare, because Tokio does not destroys tasks in a deterministic order) ## Proposal - Remove the `panic!` - Remove the `Drop` implementations which work around it at exit - Add logging for this case: - Fire a DTrace probe if a `BlockReq` is dropped without a reply - Log when we see a `CrucibleError::RecvError`
ha ha ha types! Docstrings, etc Move deferred stuff to a separate function Try to minimize the diff against `main` Fix typos Explain is_empty behavior and fix typo
63bdfe1
to
b83c1ca
Compare
upstairs/src/deferred.rs
Outdated
@@ -0,0 +1,186 @@ | |||
// Copyright 2023 Oxide Computer Company |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024!!!!!!!!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 85b6994, WELCOME TO THE FUTURE
Analogous to #1066 , this PR moves read decryption into the rayon thread pool. It uses exactly the same infrastructure (particularly `DeferredQueue`), and is implemented in a very similar way.
This PR moves write encryption out of the main task.
Doing so is harder than it sounds, because we want to preserve ordering of incoming
BlockReq
, but onlyBlockReq::Write
andBlockReq::WriteUnwritten
need this encryption offload.I turned to
FuturesOrdered
, which allows you to store a set of futures which resolve in FIFO order. For mostBlockReq
, we await afutures::future::ready(..)
, which simply provides the sameBlockReq
back to us. For writes, we offload the work usingrayon::spawn
, then return the result through atokio::sync::oneshot
channel. This strategy is recommended by this blog post by a Tokio maintainer.We can then store
Either<Ready, oneshot::Receiver>
in aFuturesOrdered
, which works out nicely.The PR introduces a new generic
DeferredQueue
to organize these futures. (TheDeferredQueue
will also be used to defer read decryption in a subsequent PR). When aBlockReq
comes in, we put it into this queue*; futures being available on the queue are another event source for theselect / apply
loop.This is a significant speedup for large random writes:
main
offload-write-encryption
I'm not sure if the performance delta for 4K writes is significant; if folks are worried, we could special-case sufficiently small writes to be executed in the main task.
*as an optimization, we execute non-write
BlockOp
s immediately if there is nothing in the queue, to avoid an extra event