Asynchronous PSA crypto #7020

DemiMarie · 2023-02-02T06:49:22Z

Suggested enhancement

The PSA crypto API should support asynchronous operation.

Justification

Mbed TLS needs this because synchronous APIs cannot use hardware symmetric cryptography accelerators effectively. For hardware-accelerated symmetric cryptography to be beneficial, one must have a very large number of requests in-flight simultaneously so that the deep hardware pipelines can be kept full. This typically requires an event-driven or otherwise asynchronous programming style: One uses an interface to queue jobs for execution on the accelerator, and is notified of job completion via a callback.

gilles-peskine-arm · 2023-02-02T22:54:18Z

There are many different ways to implement asynchronicity. You can have functions that return a promise (a.k.a. continuation) plus a mechanism to notify when a promise is fulfilled, or use multiple threads. And different levels (application, service, driver) can use different mechanisms. We aren't going to support every possible combination.

Our first priority in this area is for Mbed TLS to make the PSA key store thread-safe, which has very wide applicability. This will allow multithreaded applications to keep running while one thread is waiting for an accelerator.

There are environments where multiple threads may not be desirable, for example in a trusted execution environment where all cryptography is performed by a single-threaded cryptography service to minimize the attack surface. There, we should support asynchronicity at the level of the driver interface, by allowing a driver request to return a promise. This requires integrating with a notification mechanism. I personally think this is an architecture that should exist, but I'm not sure we'll make the architecture an official standard, and the asynchonicity management is largely outside the scope of Mbed TLS.

At the moment, I don't see a strong case for offering asynchronicity at the application interface level, either as a PSA standard or as an Mbed TLS extension. This would have to be tied to the event mechanism of a runtime environment, so I think the interface would be specific to that particular runtime environment.

Do you have a specific scenario or architecture in mind? What interfaces would it require in the crypto/keystore library?

DemiMarie · 2023-02-04T00:00:30Z

Do you have a specific scenario or architecture in mind? What interfaces would it require in the crypto/keystore library?

I can think of at least two scenarios:

Smart cards and TPMs can take several hundred milliseconds to perform a private key operation. Blocking an event loop for this amount of time is not acceptable. Using a thread is possible but would be wasteful.
To keep a hardware accelerator busy, my understanding is that one must typically have significantly more operations in flight than there are CPUs on the system, perhaps by an order of magnitude or more. Having that many threads is a significant waste of resources, and the associated overhead may defeat the purpose of using an accelerator in the first place.

As far as interfaces: the most natural one I can think is something like:

enum Status {
    /// The operation completed successfully, and the result is available
    PSA_SUCCESS,
    
    /// The operation was successfully queued.  The provided promise will
    /// be fulfilled when the operation completes.
    PSA_PENDING,
             
    /// The operation was not queued because the device’s queues were full.
    /// The code should fall back to a software implementation or
    /// try again once some operations have completed.
    PSA_BUSY,
    
    /// The operation failed because a MAC or signature is invalid.
    PSA_INTEGRITY_CHECK_ERROR,
    
    /// The parameters were incorrect.  This typically indicates a bug in
    /// the calling code.
    PSA_INVALID_PARAMETER, 
};

PSA_PENDING and PSA_BUSY can only be returned if a promise is provided. The method by which the caller is informed of promise resolution is operating system (or lack thereof) dependent, but there should be standards that define the API on various platforms. For instance, kernel code should be notified by a callback in interrupt context, while code running under a POSIX-like OS will likely be notified by file descriptor readiness. On Linux, using io-uring to indicate completion is also possible once Linux gets support for that.

gilles-peskine-arm · 2023-02-04T14:54:01Z

To keep a hardware accelerator busy

That's a concern on high-performance systems which are not the main target of PSA or Mbed TLS. (Although we do have some features for such systems, in particular MBEDTLS_SSL_ASYNC_PRIVATE which is specifically designed for high-performance systems where threads are expensive and private key operations are offloaded to a HSM with high latency.) Our primary target is low-end systems where a hardware accelerator that's not busy is a good thing: it gets turned off to save energy.

Having that many threads is a significant waste of resources

That depends on how costly your threads are, on a scale from Erlang to Java.

The method by which the caller is informed of promise resolution is operating system (or lack thereof) dependent, but there should be standards that define the API on various platforms

Unfortunately, not only is this very platform-dependent, but also many platforms have multiple ways of reporting events. And if the application is waiting on multiple events, it needs all of them to use the same notification mechanism, or else it needs to use forwarding threads which run against optimization efforts.

This is a can of worms that we are unlikely to open in the short or medium term.

DemiMarie · 2023-02-04T18:08:08Z

To keep a hardware accelerator busy

That's a concern on high-performance systems which are not the main target of PSA or Mbed TLS. (Although we do have some features for such systems, in particular MBEDTLS_SSL_ASYNC_PRIVATE which is specifically designed for high-performance systems where threads are expensive and private key operations are offloaded to a HSM with high latency.) Our primary target is low-end systems where a hardware accelerator that's not busy is a good thing: it gets turned off to save energy.

How low-end are you talking about? For me, the motivating use-case is embedded networking gear that provides WireGuard or IPsec VPN services. Keeping the accelerator busy is the only way one can achieve high VPN throughput.

Having that many threads is a significant waste of resources

That depends on how costly your threads are, on a scale from Erlang to Java.

How costly should they be? C threads will never be as cheap as Erlang threads unless one is running in kernel space or on a microkernel.

tom-cosgrove-arm added enhancement size-l Estimated task size: large (2w+) labels Feb 3, 2023

gilles-peskine-arm mentioned this issue Feb 7, 2023

Asynchronous APIs ARM-software/psa-api#52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous PSA crypto #7020

Asynchronous PSA crypto #7020

DemiMarie commented Feb 2, 2023

gilles-peskine-arm commented Feb 2, 2023

DemiMarie commented Feb 4, 2023

gilles-peskine-arm commented Feb 4, 2023

DemiMarie commented Feb 4, 2023

Asynchronous PSA crypto #7020

Asynchronous PSA crypto #7020

Comments

DemiMarie commented Feb 2, 2023

Suggested enhancement

Justification

gilles-peskine-arm commented Feb 2, 2023

DemiMarie commented Feb 4, 2023

gilles-peskine-arm commented Feb 4, 2023

DemiMarie commented Feb 4, 2023