Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to v4 WIP #2

Merged
merged 950 commits into from
Mar 24, 2024
Merged

Update to v4 WIP #2

merged 950 commits into from
Mar 24, 2024

Conversation

kotval
Copy link
Member

@kotval kotval commented Jan 29, 2024

This is still WIP and not ready to merge. While a relatively major refactor in terms of files changed, i.e literally every file, though semantically it is actually fairly straightforward. There were minimal changes to curve25519-dalek itself, though the structure of the backends changed and they moved to using Cargo workspaces. An acceptance test that this is ready to merge is at a minimum passing the wycheproof test in xous-core. Care should also be taken to avoid breaking hosted mode.

rozbb and others added 30 commits November 26, 2022 22:00
Also fixes CI not running on all branches

Co-authored-by: Anthony Ramine <[email protected]>
GitHub Actions runners are not guaranteed to have the necessary CPU
features in order for these tests to work.

Uses a `--target x86_64-unknown-linux-gnu` directive when compiling so
the `target_feature` flags don't apply to build scripts.
* Docs unlink from dalek.rs

* Link katex assets to jsdelivr

Co-authored-by: Michael Rosenberg <[email protected]>
…raphy#231)

Rust editions 2018+ do not require `extern crate` except for linking
`alloc` and `std`.
As suggested in dalek-cryptography#453 it is sometimes feasible to
select the backend bits via an override.

This change provides `cfg(curve25519_dalek_bits)`
to override the bits used in serial or fiat target backend.
As proposed in dalek-cryptography#442 this makes `rand_core` an
optional feature that is not covered by the
SemVer public API stability guarantees.

Co-authored-by: Michael Rosenberg <[email protected]>
As proposed in dalek-cryptography#442 this makes `digest` an
optional feature that is not covered by the
SemVer public API stability guarantees.

Co-authored-by: Michael Rosenberg <[email protected]>
All of the existing usages of `std` can be replaced with `alloc`.

They are legacy usages from before when liballoc was stabilized.
Gated random() construtors on cfg(test)
…tography#451)

Addresses issue dalek-cryptography#448 that Scalar::bits may leave unzeroed bits on the stack
…-cryptography#461)

Previously `cargo test --no-default-features` would succeed but with
warnings. This commit fixes all of those warnings and tests
`--no-default-features` in CI to ensure that in perpetuity.
* Restructure README and CHANGELOG
* Explain semver policy
* Specify feature flags and backends more explicitly
* Remove nightly from the CI bc that didn't belong there
* Add @pinkforest to thankyou list

Co-authored-by: pinkforest <[email protected]>
…raphy#455)

Crate features are intended to be additive, whereas only 1-of-N possible
backends can be selected.

Features can also be activated by transitive dependencies, which leads
to a problem of different dependences selecting conflicting backends.
Using `--cfg` instead moves all backend selection control to the
toplevel executable.

This commit switches to the following RUSTFLAGS to enable backends:

- `--cfg curve25519_dalek_backend="fiat"`: uses `fiat-crypto`
- `--cfg curve25519_dalek_backend="simd"`: uses nightly-only SIMD
…graphy#465)

build.rs was using cfg(target) but it has to evaluate this from env TARGET
as build.rs cfg(target) in build context is the builder host and not the target.

This change fixes curve25519_dalek_bits lottery to determine the correct
automatic curve25119_dalek_bits with the help of platforms crate.

As discussed in dalek-cryptography#456 this also prepares for well known defaults for wasm and
arm serial backend via cfg(curve25519_dalek_bits = "64")

If the wasm32 or armv7 are going to be u64 serial by default these will be
followed up on later.
dalek-cryptography#236)

curve25519-dalek:

- Enables `digest` and `rand_core` features
- Removes transitive `nightly`, `simd_backend`, and `std` features

ed25519:

- `AsRef` impl for `Signature` has been removed; uses `to_bytes`
- Uses `try_from` for `InternalSignature` conversion
This is a convenience/marker trait for types which impl `CryptoRng` +
`RngCore` which makes the type signatures a little more readable.

It was introduced in `rand_core` v0.6.4 (now pinned as the minimum
version)
…tography#470)

For the field element types `FieldElement` and `Scalar`, use inherent
constants instead of (non-const) functions to return these constant
values.

It's likely the original functions predate support for inherent
constants, but now that they're available, they're a better fit for
these sort of constant values.
…ptography#472)

This is helpful for implementing `ff::PrimeField::from_repr`.
Also changes `Scalar::is_canonical` to return `Choice`.
@pinkforest
Copy link

Note: it seems they broke the ability to use fiat backends....

The backends can be overriden via cfg curve25519_dalek_backend

Features are additive and problem was signalling them across dependency chain where often backends could not be used because transient dependencies didn't set them correctly and there could have been conflict when two backends were enabled by transient dependencies leaving worst a mis-compile.

See rationale from this long discussion: dalek-cryptography#414

EDIT: I see a reply here oops - dalek-cryptography#632

@kotval
Copy link
Member Author

kotval commented Mar 1, 2024

@pinkforest I apologize I missed the discussion from dalek-cryptography#414 when I filed dalek-cryptography#632. I was mistaken about how cfg flags are interpreted by cargo.

I was trying to keep this fork as close to upstream as possible so I introduced the cfg flags we needed and switched from features for specifying dependencies. I was unable to get our optional dependencies to be included when gate-ting them with cfg, but it was that I was misspecifying them. We're using this to cross compile for riscv so I needed to put rustflags under a target.riscv32imac-unknown-xous-elf directive in our .cargo/config to get the CARGO_CFG_CURVE25519_DALEK_BACKEND env variable set. Thanks for pointing this out, so I can keep this closer to upstream.

@kotval
Copy link
Member Author

kotval commented Mar 10, 2024

@bunnie Okay. I could use some help finishing this. I believe that I had set everything up so that we can select our backend by using cfg flags now. For example, if you add

[target.riscv32imac-unknown-xous-elf]
rustflags = ["--cfg",'curve25519_dalek_backend="u32e_backend"']

To the .cargo/config in the xous-core repo, then the correct backend will be selected without the need to use features. (I also modified the build.rs to default to u32e when building for xous. This shouldn't really be necessary since we can use rustflags, but at the time I did it I thought that if we could upstream we'd need to do something like that. If we are never going to try to upstream, we can simplify the build.rs.) I am stuck however in that while everything works in renode if I select the u32 backend, with the u32e backend root-keys panics fairly early on in renode. I have tried to debug this to no avail. Can you take a look at this?

@bunnie
Copy link
Member

bunnie commented Mar 11, 2024

sure thing, I'll have a look at it soon!

@bunnie
Copy link
Member

bunnie commented Mar 11, 2024

I'd like to sanity check my environment against how you're testing.

This is what I'm doing from a base xous-core image. Let's assume all paths are relative to ./xous-core

  1. I've checked out kotval:update_to_v4 in a directory that is located at ../curve25519-dalek
  2. In the top level Cargo.toml file, I've added this patch:
[patch.crates-io.curve25519-dalek]
path = "../curve25519-dalek/curve25519-dalek"
  1. I've modified ./.cargo/config to have this line (so the config inside xous-core):
[target.riscv32imac-unknown-xous-elf]
rustflags = ["--cfg", 'curve25519_dalek_backend="u32e_backend"']

I then run cargo xtask renode-image --feature benchmarks

In order to get this to build, I had to add the following to your patch under the [target.'cfg(curve25519_dalek_backend = "u32e_backend")'.dependencies] region:

zeroize = { version = "1", default-features = false }

I also get some warnings like this:

note: the lint level is defined here
  --> F:\code\curve25519-dalek\curve25519-dalek\src\lib.rs:36:5
   |
36 |     unused_qualifications
   |     ^^^^^^^^^^^^^^^^^^^^^
help: remove the unnecessary path segments
   |
48 -         pub type FieldElement = backend::serial::u32e::field::Engine25519;
48 +         pub type FieldElement = Engine25519;
   |

Does that look consistent with your environment?

@bunnie
Copy link
Member

bunnie commented Mar 11, 2024

OK, that was exciting.

The good news is, I have it working in Renode.

The also probably good news is, I think some performance problems that face the crate are overcome.

The bad news is, we have an API problem to solve.

So, the original problem that the engine25519 crate solves is "how do you share the hardware accelerator". The solution it adopted was to map the hardware into that crate, and have it exclusively owned there.

The problem with that was that it was totally incompatible with integrating into the curve25519-dalek crate. The problem @kotval hit in integration was exactly that -- how do you map the registers of the engine into the userspace of the caller using the engine.

This rework does the following:

  • On the first call via curve25519 crate, map the hardware by requesting it from the kernel
  • The user must request the hardware to unmap explicitly, otherwise, other users of the crate will fail

This is terrible for obvious reasons, but let's first get into why I'm doing this.

Previously, the the basic check test took 1250ms per iteration. With this optimization, it now takes....48.9ms. 25x speedup.

The diffie-hellman benchmark goes from 12.3ms per iteration to 8.36ms per iteration. 47% speedup. Not as impressive, but that's because a lot of effort went in to hand-code the DH case to speed it up specifically.

Why is it so much faster? it's faster because if the hardware can assume it exclusively owns the hardware, you don't have to use any syscalls to access it. You just deref a pointer, instead of, say, passing memory messages around to a helper in another process space to copy your data to a pointer deref.

So, I think this is a Good Thing, and generally, because we have only one application at a time really using Curve25519, the hardware resource lock problem is probably something we can manage.

I don't think the current solution on the table should be the final one -- we'll definitely trip into panics. But, it at least sets the stage for this next discussion: how should we implement the locks?

I see two obvious approaches:

  • We could scope the ownership of the hardware to match the lifetime of the Curve25519 object. This will be less performant if we're doing lots of small operations, because we would be incurring a syscall cost on every Drop. But, I suspect we actually almost never want to use Curve25519 for small ops, so maybe this is the way to go.
  • We could make it so that we have a Xous server that, by convention, programs have to request a lock from when they want to use the curve25519 crate. This gives us some additional benefits, namely, we can prevent the system from going into suspend during a hardware accelerated operation, and we can also program in some behaviors to notify the requestor when the hardware is free to use.

So the basic tension is, a "clean API" where we're using implicit hardware free onDrop, and lock on allocation of the object -- but this has the problem of not locking the system from a suspend/resume (which would corrupt any on-going curve25519 op), and also, if you have two separate processes that mutually depend on a curve25519 result to move ahead, you have a deadlock situation (if you implement a policy of busy-wait until hardware is free; alternatively, we could have a "try for hardware then fallback to software" policy, which would require a bit more work on the implementation to re-introduce the software fallback paths if the hardware is unavailable).

A "dirty API" makes it less nice to use the crate, but we have more explicit control over external factors like suspend and resume, and locking policies.

My current inclination is maybe to try for the "clean API" method, and:

  • have a "try for hardware then fail to software" policy
  • maybe try to introduce something where we attempt to detect if a suspend happened during the hardware path, and if it did happen, we automatically re-try the operation? The hardware does have a flag to indicate if a suspend was un-clean, so it would mean re-implementing the basic job scheduler to check that on completion, and if it comes out on the "unhappy path", retry the op.

Either way, I would propose perhaps to get @kotval moving forward, we do the following:

  1. @kotval confirms that the current patch allows him to move ahead on the Signal port
  2. we merge things as-is, with the ugliness of the manual lock management
  3. I open an issue to resolve this according to one of the two paths suggested above
  4. I close the issue, ideally, fairly soon, but I can't get to it today.

Thoughts?

@bunnie
Copy link
Member

bunnie commented Mar 11, 2024

@xobs would love to have your feedback on the above comment.

@xobs
Copy link
Member

xobs commented Mar 11, 2024

I'm not following how the API is implemented, and I only see curve25519-dalek-loader using engine25519. However, I do see that it loads microcode programs and manipulates the register file directly, and I wonder if that's the correct approach.

Isn't the list of usecases of Engine25519 relatively limited? Would it make more sense to bundle the microcode programs into engine25519 and have API calls for things like DifferentialAddAndDouble or ToAffine? That way you can have the server decide whether it's better to fallback to software. Or perhaps it makes sense to have a nonfatal error that curve25519-dalek can intercept and know to fallback on its own.

The current approach looks like it invokes the allocator on every operation, which is known to not be very fast. A common pattern is to allocate an array on the stack and copy onto that.

I'm not sure how long these operations will be in-use, and it sounds like the operation is heavyweight enough that a process will have exclusive access to the hardware for an extended period of time. If you'd like to transfer ownership of the actual hardware, which it sounds like you'd prefer, you might consider passing it via a MutableBorrow message. Have a message you can send to the engine25519 server and include a response server address. Have the engine25519 server mutably lend the hardware block to the target process. Prevent suspend as long as that lend is active, or perhaps let the process know that the operation will need to be retried. Return the memory block on Drop.

@bunnie
Copy link
Member

bunnie commented Mar 11, 2024

Hmm, I think maybe I was not clear.

The recommendations you laid out match up with what we were doing prior to these changes. However, it's pretty inefficient and heavyweight to do it that way. So, previously, this is how it went:

  1. engine-25519 claims exclusive access to the hardware CSR and memory region
  2. Calls to engine25519 are shimmed into curve25519-dalek, making engine-25519 a dependency
  3. client programs would initiate a curve operation. The library attempts to gain a lock on engine-25519
  4. client programs then build a message that contain the code to accelerate + arguments to the engine
  5. this would get sent to engine-25519 as a mutable lend
  6. engine-25519 would execute the microcode and return the result
  7. lock is released

Steps 3-7 repeat.

The problem is that this is very heavy weight, and in reality, only one operation (Scalar Multiplies) were ever accelerated -- everything else ran locally in software because by the time you did all the locking, lending, executing, returning, most 25519 ops could be executed on the CPU -- except the super-heavyweight ones like a Scalar Multipy.

The new proposal is this:

  1. the curve25519 accelerator is not mapped into any process space at boot
  2. The only dependency for curve25519-dalek is xous. Not even xous-ipc or xous-names is needed
  3. The local process memory and CSR space are stored in an Option in a static mut. This is "mostly OK" because the code is structured so that multiple attempts to remap the CSR just return an existing mapping; the main danger is if during the map operation there's a context-switch and another map operation is initiated, but I could add an Atomic lock around that structure to make it more thread safe
  4. A thread creates a Curve25519-dalek object. When an accelerated function is encountered, it checks in the Option is Some and if so moves ahead; if not, it will attempt to map and retry/yield until it gets the mapping
  5. The microcode and registers are locally loaded (very fast), and waits until the computation is done
  6. More operations may be performed on the object, and they will all go the "fast path" as long as they are in the same process space. Multiple threads will share the same static mut mapping.
  7. A user may explicitly make a call to unmap the hardware. This is safe, because if the mapping is None it will just try to remap it. Alternatively, the object will unmap onDrop automatically

In this case, a single thread can have multiple processes share the same map, and everything will run fast. This is similar to how we track CID count on objects, except we don't blow up if disconnect (unmap) is called, because we can always just get the mapping back.

The bugaboo is if two separate processes both want accelerated Curve25519 operations, and it is coded to keep the mapping live for the entire duration, and both of the operations depended upon each other (like a client-server talking to each other from different process spaces). In this case, you can deadlock, but that can be avoided by manual calls to unmap the hardware after a message has been sent (for example).

But overall this is very fast because mapping memory to CSR and hard-wired hardware memory windows is fast: you aren't searching the allocation table, the address is fully specified, so it's just updating the PTE. The memory isn't even zeroized because it's not main memory.

It could even be viable to map/unmap after every arithmetic operation -- this would entirely avoid the possibility of deadlock because of objects living too long, I think, but at the price of doing the map/unmap on every op.

Maybe that's something I should just try and see how much it impacts the benchmark....if it's relatively cheap, then, it's the safest way to solve the problem because then you have no shared state, you just keep retrying MapMemory with a yield of your quantum if you fail, so that the other process can make progress and release the engine for you to take...

@xobs
Copy link
Member

xobs commented Mar 11, 2024

I believe this is roughly what exists now, correct?

image

Though the ScalarMultiply call looks like it allocates memory for every call using MapMemory which is not optimized at all.

The proposed solution is this:

image

As you say, this can be made faster by fully-specifying the virtual and physical arguments to MapMemory. I worry about the brittleness of the system, but given how rare (?) Montgomery calls are, perhaps that's fine.

As an asside, there are a few changes I'd make to engine-25519:

  1. Remove the ability to run arbitrary ENGINE programs -- it seems like this is largely deprecated already?
  2. Depending on the amount of time it takes to do a Montgomery operation, either the do_yield() call at https://github.com/betrusted-io/xous-core/blob/cc02f3ab4fa85c8d5ebccfe45fa5acdbef09f208/services/engine-25519/src/main.rs#L903-L906 and replace it with a Message that gets responded to (like how we do Sleep in the ticktimer), or just busywait without a context switch
  3. Remove the use of rkyv and replace it with a 192-byte buffer allocated on the stack

With this, you could hardcode the syscalls using asm!() and get away with not even needing xous as a dependency.

I suspect that would improve performance, but benchmarks would be needed. However, for expedience the approach you listed can work as well, and if performance is so much better then perhaps it's the correct one.

@bunnie
Copy link
Member

bunnie commented Mar 12, 2024

The current approach is almost correct, except that engine25519 is entirely removed. After this PR is merged and the main repo is cleaned up, we would actually completely eliminate that code from the code base.

The low level driver is now a part of the curve25519-dalek backend implementation. This also makes it a little easier to maintain the backend under the theory that it's all self-contained, similar to how we eliminated engine-sha2 and rolled all the drivers into the sha2 crate.

In the current implementaion, MapMemory never allocates. It only maps an existing, hard-wired memory region for the code segment and the register windows into the user's process space. Thus, MapMemory always works with explicit physical addresses. There is no case where we would need rkyv either, that dependency would be eliminated.

The only reason xous is needed is to implement the syscall. As you note, we could hard-code it in assembly, but I think xous is stable enough and we're fully committed to that being a public crate that it's not a problem. engine25519 as a dependency is much more of a problem because it's integrated into the xous-core repo, and is awkward as a dependency as a result, because it doesn't have a crates.io version that's kept in lock-step, and I'd prefer we didn't go there.

@bunnie
Copy link
Member

bunnie commented Mar 12, 2024

btw, the current approach also has the advantage of trimming engine25519 as a separate process, freeing up another PID for use for user programs, and I think it should also reduce net code space. Even though it means we're including curve25519 drivers in every app that uses the dalek crate, the net back-end driver code is fairly small compared to the overhead of a full stand-alone server. The bulk of the code duplication is that every process space now gets a copy of the assembly routines loaded into the accelerator, but it's just a couple kiB in the end, versus ~200kiB minimum for a stand-alone process.

@kotval
Copy link
Member Author

kotval commented Mar 17, 2024

Okay this works for me, and I can keep working on signal.

I like this approach of not having engine-25519 for the same reasons that I liked getting rid of engine-sha2, i.e. not forcing other applications to be updated because of cargo patching's transitivity, but the question of locking is interesting, and something I hadn't thought about when I started this. I really like the idea of just mapping/unmapping to acquire a lock on hardware, I am very curious how that performs, and how you would go about measuring it.

It is interesting to think about perhaps, having two types of drivers. One type is a process which arbitrates use of hw via a message passing api, and the other is that a process and directly include a driver each process uses locks to ensure exclusive access to the hardware. How would you pick between the two for a particular driver? Presumably, if your hardware has some state, like a ring buffer of messages, that state should be managed by only one process, but for things like this and sha2, relying on explicit unlocking seems fine. I still need to understand how we do scheduling and syscalls better to really understand the difference.

For what it is worth, not having engine-25519, means we don't have to patch curve25519-dalek to build apps, which means that each service would in principal have a different version of curve25519-dalek. While we shouldn't be flippant at letting cargo put multiple versions in a build as that'd take up lots of space, being that the main use of hw acceleration for curve25519 is relatively limited, we could have things which don't need acceleration depend directly on the mainline version, and things which want the acceleration can depend on this version. It's not quite as good as a "try hw and fall back to software", but it should work now if needed.

This function should be able to be enabled by feature
selection.
@bunnie bunnie marked this pull request as ready for review March 24, 2024 22:54
@bunnie bunnie merged commit 81d5c53 into betrusted-io:v4.1.1 Mar 24, 2024
0 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.