Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement randomness server in rust #50

Merged
merged 93 commits into from
Apr 24, 2023
Merged

Re-implement randomness server in rust #50

merged 93 commits into from
Apr 24, 2023

Conversation

rillian
Copy link
Contributor

@rillian rillian commented Feb 17, 2023

Eliminate the C ABI ffi bindings between the golang nitriding daemon and the rust ppoprf implementation by serving the /randomness and /info endpoints over http from rust and using the nitriding v2 proxy to terminate tls and forward requests from the public interface.

  • Implement a local webserver in Rust
  • Port /randomness endpoint for ppoprf evaluation
  • Port /info endpoint with current epoch and zk public keys
  • Port next epoch timestamp to /info endpoint
  • Replace old go webservice with the nitriding daemon
  • Command line switched for configuration
  • Tests

Resolves #49, #16, and #14.

renovate bot and others added 27 commits November 3, 2022 17:41
The new version of the nitriding framework can proxy http traffic
to a separate server running inside the enclave. This allows us
to replace the ffi calls from golang with a pure-rust server
implementation, which is less error-prone.
Use the `info!` convenience macro with the default logger,
which reads the `RUST_LOG` environment variable. So run with:

    RUST_LOG=tower_http=trace,star_randsrv=debug cargo run

for reasonable output.

By default it's `ERROR` level only, so there's no convenience
info about requests or listening port; would be nice to default
to `star_randsrv=info` or something like that.
Define structs for the request and response and plumb them
through with a dummy response value.
FIXME: only processes the first point.

Note that the `State` argument must come *before* the request,
not after!
Needs better error handling.
Use the current version of the ppoprf which cleans up some
unneeded dependencies. Bump everything else to the latest
compatible version.
This allows multiple concurrent reads to the ppoprf struct for
evaluations. We only need exclusive access to reset or puncture.
Make the API docs slightly less sparse
Spawn a background task to advance the current epoch on a fixed
schedule. When it runs out of epochs it eventually panics the
main thread instead of re-initializing the ppoprf state.
Instead of leaving all epochs puctured and later panicking, create
and fresh OPRFServer instance with the same set of epochs and
continue with that.

Abstract the struct initialization into OPRFServer::new() so it
can be called from both locations, Move the `future_epochs`
initialization inside the background loop so the same code
can be used each time a new state is initialized.
Define a response type for errors and convert various unwrap()
and expect() calls to return that instead.

Replace the iterator chain in the randomness handler with an
explicit loop so we can return early if there's an error in
the request data.
Turns out enum variants with inner values don't just look like
functions, they actually work as a FnOnce!

Also stonger camelcasing.
The previous code reset the `future_epochs` vector before the
empty check that was supposed to trigger re-initialization.
So OprfServer.epoch was updated, but the actual set of viable
epochs was punctured, resulting in perpetual "No prefix found"
errors.

Instead, assume we'll use contiguous blocks of epoch numbers
and store them as a Range, which is smaller and more convenient.
`EpochRange` never needs to change and can be passed to
`OprfServer::new` each time the key is rotated.

Note that because `Range` and `InclusiveRange` are different
types, we can't both keep the range as `u8` and include epoch
255. I though code simplicity was more important here.
Instead of linking over ffi from the go nitriding framework,
build it as a separate proxy daemon which runs inside the
the same container as star-randsrv.
Having `nitriding` as a submodule complicates copying just
that subtree to the go-builder, so we copy everything, which
is slow if there are local build files.
Build and lint the local application. Assume the nitriding
repo takes care of itself.
Rather than installing golangci-lint so the default Makefile
target in nitriding/cmd can run it, just build the `nitriding`
target directly. This is a faster build and follows our earlier
policy of treating the submodule as a separate unit for testing.
Copy `star-randsrv` from the correct path so both executables
are present in the final container.
This is 5x larger than the alpine containers, but provides
the expected execution environment.
Go back to using alpine linux for the runtime, which requires
building with CGO disabled, and on the rust:alpine build
container to avoid dependencies on glibc.

This is about 25% the size of the debian:slim runtime container.
Which to a revision of `nitriding` that builds a static executable
by default so we can invoke the Makefile but still use it with
our alpine runtime container.
Doesn't return the timestamp at the end of the epoch.

We don't support returning the zk proofs, so returning the public
key isn't useful, but neither does the current golang version,
and I wanted to match its api. I expect that to change when we
implement batch proofs anyway.
Instead of accepting any unpunctured epoch, and letting the
ppoprf eval fail for punctured ones, filter out non-current
epochs when the request is processed.

Also enforce the MAX_POINTS limit on requests. By this point
the point vec has already been constructed, so this doesn't
affect the amount of memory consumed, but it still limits
the cpu cost of each request, making any denial-of-service
mitigation based on network flows more effective.
Set up a testing framework inside the main implementation to check
the handlers without the overhead of a network socket. Based on
the `testing` example from the axum repository.

Test that the `/` route returns some text.
@rillian rillian self-assigned this Feb 17, 2023
Copy link
Contributor

@NullHypothesis NullHypothesis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments. I'm not in a good position to review the correctness of the Rust code. For that, @DJAndries is probably the best fit.

Cargo.toml Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
start.sh Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Show resolved Hide resolved
src/main.rs Show resolved Hide resolved
Philipp Winter and others added 3 commits April 11, 2023 11:39
This commit makes two changes:

1. Invoke kaniko with the flag '--custom-platform linux/amd64'.  This is
   necessary when building star-randsrv on non-Linux, non-amd64
   platforms like macOS.

2. Use an intermediate build layer to add start.sh.  If we don't do
   this, we may end up with a build layer that contains inconsistent
   file permissions from the host operating system.

With the above two changes, it's now possible to arrive at identical
image IDs, even when building star-randsrv on Linux (amd64) and macOS
(arm64).
Make star-randsrv build reproducibly.
If this happens it indicates an internal problem with the state
management, and the server likely needs to be restarted. Signal
this so the more serious error case is obvious.

Other error variants should be the result of bad client input
and don't affect the usability of the service.
@rillian rillian mentioned this pull request Apr 18, 2023
The ppoprf epoch tag is a `u8`. When using the full epoch range
0..=255, the final increment would overflow. In debug builds this
would panic. In release builds, it would roll over unchecked, then
panic at the end of the next interval when it tried to puncture
the already-punctured starting epoch value. In either case, the
panic happened while the write lock was held, poisoning it.

The tokio thread runner would catch the panic, but with the
`RwLock` poisoned, it couldn't respond to further queries.

Instead, check for overflow and use that to trigger key rotation,
just as we do when the epoch is out of the configure range. This
is more localized than promoting the epoch counter and range to
u16 and narrowing at the ppoprf calls.
@NullHypothesis
Copy link
Contributor

Stumbled upon one more thing: we should modify (or remove) .github/workflows/makefile.yaml, which is based on the old Go code.

Addresses a clippy lint.
@rillian
Copy link
Contributor Author

rillian commented Apr 18, 2023

Stumbled upon one more thing: we should modify (or remove) .github/workflows/makefile.yaml, which is based on the old Go code.

Don't we still need a golang environment to build the nitriding daemon?

The default makefile target runs cargo test, cargo clippy, and cargo audit so it should give adequate coverage for the new rust code, although it's just using the default rust toolchain in the image.

@NullHypothesis
Copy link
Contributor

Don't we still need a golang environment to build the nitriding daemon?

Do we really need these lines when simply running make?

The default makefile target just builds, tests, lints and
audits the rust application, so we no longer need a go
environment for this job.
@rillian
Copy link
Contributor Author

rillian commented Apr 19, 2023

Do we really need these lines when simply running make?

Ah, you're right. The default makefile target doesn't build the nitriding daemon. so this is unnecessary. Removed in e75c152.


# Build the web server application itself.
# Use the -alpine variant so it will run in a alpine-based container.
FROM public.ecr.aws/docker/library/rust:1.68.2-alpine as rust-builder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the pin on the minor version needed here? Normally we'd ask to use 1.68-alpine here in order to get any minor fixes / patches, but maybe there's attestation requirements that mean we need to pin to the full version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume we start an enclave based on x.y-alpine. Shortly after, x.y.z-alpine comes out. A user is trying to verify our deployment based on the newly-release x.y.z-alpine and notices that it fails because the enclave still runs x.y-alpine.

We could work around that by updating the enclave whenever a new patch comes out.

An even better solution would be to get rid of alpine altogether and only have the statically-compiled nitriding and star-randsrv in the image. For now, we only need alpine because of the shell script that starts nitriding and star-randsrv — a very heavy dependency for a simple task. I filed brave/nitriding#58 for this improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully tls-alpn means we don't need a trusted ca certificate package either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmarier Generally I agree that specifying a particular point release tag bitrots quickly and misses out on useful fixes. Because in the particular application reproducible builds are important, tracking the specific versions of our dependencies, as we do with go.sum and Cargo.lock is a worthwhile trade-off.

Hopefully we can use renovatebot or similar to make sure we bump regularly.

Copy link
Member

@fmarier fmarier Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's lean on Renovate for now, and then hopefully eliminate the alpine image entirely later.

Dockerfile Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
@rillian
Copy link
Contributor Author

rillian commented Apr 20, 2023

I think we're ready to merge here. Asking for final review; remaining issues should be follow-ups.

NullHypothesis
NullHypothesis previously approved these changes Apr 20, 2023
Copy link
Contributor

@NullHypothesis NullHypothesis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments for a couple minor changes. Otherwise looks good!

start.sh Show resolved Hide resolved
This was for testing with the deployable container build pipeline
during development. Remove before merging into the default branch
since it will no longer be needed there.
I'm told this is no longer necessary as the pipeline is using
a different solution.
@rillian
Copy link
Contributor Author

rillian commented Apr 20, 2023

Security ci job failure looks like problems with the ubuntu package mirror.

Copy link
Contributor

@NullHypothesis NullHypothesis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@rillian
Copy link
Contributor Author

rillian commented Apr 20, 2023

ci failure resolved by brave/security-action#164. Green now.

Copy link
Collaborator

@DJAndries DJAndries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. please address #54 soon as it will act as an important fail safe for crashes, shouldn't require much effort, just one small func.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support nitriding v2?
4 participants