-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement randomness server in rust #50
Conversation
The new version of the nitriding framework can proxy http traffic to a separate server running inside the enclave. This allows us to replace the ffi calls from golang with a pure-rust server implementation, which is less error-prone.
Use the `info!` convenience macro with the default logger, which reads the `RUST_LOG` environment variable. So run with: RUST_LOG=tower_http=trace,star_randsrv=debug cargo run for reasonable output. By default it's `ERROR` level only, so there's no convenience info about requests or listening port; would be nice to default to `star_randsrv=info` or something like that.
Define structs for the request and response and plumb them through with a dummy response value.
FIXME: only processes the first point. Note that the `State` argument must come *before* the request, not after!
Needs better error handling.
Use the current version of the ppoprf which cleans up some unneeded dependencies. Bump everything else to the latest compatible version.
This allows multiple concurrent reads to the ppoprf struct for evaluations. We only need exclusive access to reset or puncture.
Make the API docs slightly less sparse
Spawn a background task to advance the current epoch on a fixed schedule. When it runs out of epochs it eventually panics the main thread instead of re-initializing the ppoprf state.
Instead of leaving all epochs puctured and later panicking, create and fresh OPRFServer instance with the same set of epochs and continue with that. Abstract the struct initialization into OPRFServer::new() so it can be called from both locations, Move the `future_epochs` initialization inside the background loop so the same code can be used each time a new state is initialized.
Define a response type for errors and convert various unwrap() and expect() calls to return that instead. Replace the iterator chain in the randomness handler with an explicit loop so we can return early if there's an error in the request data.
Turns out enum variants with inner values don't just look like functions, they actually work as a FnOnce! Also stonger camelcasing.
The previous code reset the `future_epochs` vector before the empty check that was supposed to trigger re-initialization. So OprfServer.epoch was updated, but the actual set of viable epochs was punctured, resulting in perpetual "No prefix found" errors. Instead, assume we'll use contiguous blocks of epoch numbers and store them as a Range, which is smaller and more convenient. `EpochRange` never needs to change and can be passed to `OprfServer::new` each time the key is rotated. Note that because `Range` and `InclusiveRange` are different types, we can't both keep the range as `u8` and include epoch 255. I though code simplicity was more important here.
Instead of linking over ffi from the go nitriding framework, build it as a separate proxy daemon which runs inside the the same container as star-randsrv.
Having `nitriding` as a submodule complicates copying just that subtree to the go-builder, so we copy everything, which is slow if there are local build files.
Build and lint the local application. Assume the nitriding repo takes care of itself.
Rather than installing golangci-lint so the default Makefile target in nitriding/cmd can run it, just build the `nitriding` target directly. This is a faster build and follows our earlier policy of treating the submodule as a separate unit for testing.
Copy `star-randsrv` from the correct path so both executables are present in the final container.
This is 5x larger than the alpine containers, but provides the expected execution environment.
Go back to using alpine linux for the runtime, which requires building with CGO disabled, and on the rust:alpine build container to avoid dependencies on glibc. This is about 25% the size of the debian:slim runtime container.
Which to a revision of `nitriding` that builds a static executable by default so we can invoke the Makefile but still use it with our alpine runtime container.
Doesn't return the timestamp at the end of the epoch. We don't support returning the zk proofs, so returning the public key isn't useful, but neither does the current golang version, and I wanted to match its api. I expect that to change when we implement batch proofs anyway.
Instead of accepting any unpunctured epoch, and letting the ppoprf eval fail for punctured ones, filter out non-current epochs when the request is processed. Also enforce the MAX_POINTS limit on requests. By this point the point vec has already been constructed, so this doesn't affect the amount of memory consumed, but it still limits the cpu cost of each request, making any denial-of-service mitigation based on network flows more effective.
Set up a testing framework inside the main implementation to check the handlers without the overhead of a network socket. Based on the `testing` example from the axum repository. Test that the `/` route returns some text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments. I'm not in a good position to review the correctness of the Rust code. For that, @DJAndries is probably the best fit.
This commit makes two changes: 1. Invoke kaniko with the flag '--custom-platform linux/amd64'. This is necessary when building star-randsrv on non-Linux, non-amd64 platforms like macOS. 2. Use an intermediate build layer to add start.sh. If we don't do this, we may end up with a build layer that contains inconsistent file permissions from the host operating system. With the above two changes, it's now possible to arrive at identical image IDs, even when building star-randsrv on Linux (amd64) and macOS (arm64).
Make star-randsrv build reproducibly.
If this happens it indicates an internal problem with the state management, and the server likely needs to be restarted. Signal this so the more serious error case is obvious. Other error variants should be the result of bad client input and don't affect the usability of the service.
The ppoprf epoch tag is a `u8`. When using the full epoch range 0..=255, the final increment would overflow. In debug builds this would panic. In release builds, it would roll over unchecked, then panic at the end of the next interval when it tried to puncture the already-punctured starting epoch value. In either case, the panic happened while the write lock was held, poisoning it. The tokio thread runner would catch the panic, but with the `RwLock` poisoned, it couldn't respond to further queries. Instead, check for overflow and use that to trigger key rotation, just as we do when the epoch is out of the configure range. This is more localized than promoting the epoch counter and range to u16 and narrowing at the ppoprf calls.
Stumbled upon one more thing: we should modify (or remove) .github/workflows/makefile.yaml, which is based on the old Go code. |
Addresses a clippy lint.
Don't we still need a golang environment to build the nitriding daemon? The default makefile target runs |
Do we really need these lines when simply running |
The default makefile target just builds, tests, lints and audits the rust application, so we no longer need a go environment for this job.
Ah, you're right. The default makefile target doesn't build the nitriding daemon. so this is unnecessary. Removed in e75c152. |
|
||
# Build the web server application itself. | ||
# Use the -alpine variant so it will run in a alpine-based container. | ||
FROM public.ecr.aws/docker/library/rust:1.68.2-alpine as rust-builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the pin on the minor version needed here? Normally we'd ask to use 1.68-alpine
here in order to get any minor fixes / patches, but maybe there's attestation requirements that mean we need to pin to the full version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume we start an enclave based on x.y-alpine. Shortly after, x.y.z-alpine comes out. A user is trying to verify our deployment based on the newly-release x.y.z-alpine and notices that it fails because the enclave still runs x.y-alpine.
We could work around that by updating the enclave whenever a new patch comes out.
An even better solution would be to get rid of alpine altogether and only have the statically-compiled nitriding and star-randsrv in the image. For now, we only need alpine because of the shell script that starts nitriding and star-randsrv — a very heavy dependency for a simple task. I filed brave/nitriding#58 for this improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully tls-alpn means we don't need a trusted ca certificate package either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmarier Generally I agree that specifying a particular point release tag bitrots quickly and misses out on useful fixes. Because in the particular application reproducible builds are important, tracking the specific versions of our dependencies, as we do with go.sum
and Cargo.lock
is a worthwhile trade-off.
Hopefully we can use renovatebot or similar to make sure we bump regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let's lean on Renovate for now, and then hopefully eliminate the alpine image entirely later.
Configure Renovate
I think we're ready to merge here. Asking for final review; remaining issues should be follow-ups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments for a couple minor changes. Otherwise looks good!
This was for testing with the deployable container build pipeline during development. Remove before merging into the default branch since it will no longer be needed there.
I'm told this is no longer necessary as the pipeline is using a different solution.
Security ci job failure looks like problems with the ubuntu package mirror. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
ci failure resolved by brave/security-action#164. Green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. please address #54 soon as it will act as an important fail safe for crashes, shouldn't require much effort, just one small func.
Eliminate the C ABI ffi bindings between the golang nitriding daemon and the rust ppoprf implementation by serving the /randomness and /info endpoints over http from rust and using the nitriding v2 proxy to terminate tls and forward requests from the public interface.
/randomness
endpoint for ppoprf evaluation/info
endpoint with current epoch and zk public keys/info
endpointnitriding
daemonResolves #49, #16, and #14.