Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hardware-accelerated AES CryptoServiceProvider #865

Merged
merged 25 commits into from
Nov 28, 2023

Conversation

zybexXL
Copy link
Contributor

@zybexXL zybexXL commented Aug 30, 2021

This feature reduces CPU usage dramatically, allowing higher SFTP performance on slower machines. It does so by using the AesCryptoServiceProvider for AES CBC/CTR/ECB encryption/decryption, which taps into AES-NI accelerated OS functions.

STFP download performance testcases:

Test case A: 4 year-old i5-7300U 2-core/4-threads laptop, 250Mbps internet connection:

  • Before: 13.15MB/sec, one CPU thread fully saturated
  • After: 24.50 MB/sec (+86%), limited by internet connection, main thread at around 20-30% (5-10% in Task Manager)

Test case B: 4 year old I7-7700 4-core/8-threads, 500Mbps internet connection:

  • Before: 40.85 MB/sec, one CPU-thread fully saturated
  • After: 66.70 MB/sec (+63%), limited by internet connection, main thread at around 30-40% (5-10% in Task Manager)

Note: all values above were taken with PR #866 also merged in. Tested with sftpClient.DownloadFile()

SSH.NET vs AES-NI benchmarks

SSH.NET implements AES as pure managed code. CPUs nowadays have hardware acceleration for AES via the AES-NI instruction set, and .NET provides APIs to the OS functions that make use of it. There are 3 .NET providers for AES:

AesManaged : older managed-only implementation, not accelerated. Still faster than the current SSH.Net
AesCryptoServiceProvider: call into the OS and makes use of AES-NI if available
AesCNG (CryptoNewGen): newer Crypto API, available since 2018 in Win10. Also makes use of AES-NI and has less overhead

Here are some benchmarks (executed on the dual-core laptop...):

Provider    Mode  Iterations                                  Average
----------  ----  ---------------------------------------     ------------
SSH.NET     CTR     29.44   29.75   29.41   29.81   28.69  =>   29.42 MB/s
SSH.NET     CBC     24.72   28.47   28.34   28.59   28.81  =>   27.79 MB/s
SSH.NET     ECB     33.75   33.50   25.34   27.56   23.53  =>   28.74 MB/s
SSH.NET     CFB     27.81   26.41   27.41   27.69   27.56  =>   27.38 MB/s
SSH.NET     OFB     27.97   28.06   27.75   27.69   27.75  =>   27.84 MB/s
           
AesManaged  CBC     41.19   41.50   41.72   41.78   25.66  =>   38.37 MB/s
AesCSP      CBC    717.06  777.25  828.03  815.72  812.75  =>  790.16 MB/s   (AES-NI)
AesCNG      CBC    830.16  847.25  852.72  853.06  832.78  =>  843.19 MB/s   (AES-NI)

As you can see, the accelerated APIs are about 25x faster than the current managed code!

Since AesCNG is too new I chose to use AesCryptoServiceProvider. This supports CBC and ECB natively, but not CFB or OFB (which are not listed in the SSH.Net protocol list anyway). For CTR (used by AWS SFTP Transfer Family), my code uses ECB followed by XOR. This would be extremely fast on C/C++ (or using unsafe tag in C# to allow memory pointers), but on pure C# I had to do a few memory Blockcopy to convert between byte[] and uint[], so the result is not ideal - however, it's still about 10x faster than the previous code.

Here's the new AES benchmarks with this PR:

Provider    Mode  Iterations                                  Average
----------  ----  ---------------------------------------     ------------
SSH.NET+    CTR    316.53  317.69  315.72  305.00  308.53  =>  312.69 MB/s (1063%)
SSH.NET+    CBC    768.03  744.16  772.19  781.16  761.41  =>  765.39 MB/s (2754%)
SSH.NET+    ECB   2368.78 2370.50 2393.44 2368.03 2344.53  => 2369.06 MB/s (8243%)
SSH.NET+    CFB     27.31   27.78   26.91   27.81   26.66  =>   27.29 MB/s (100%)
SSH.NET+    OFB     28.41   27.91   28.25   28.38   27.97  =>   28.18 MB/s (101%)

These CTR and CBC values are now the new maximum theoretical SFTP performance of SSH.NET 😎

@jjxtra
Copy link

jjxtra commented Aug 30, 2021

Well that's pretty good!

@IgorMilavec
Copy link
Collaborator

CTRArrayXOR can be optimized on supported platforms with the use of Vector<T>. It's faster and allocation free:

Method Size Mean Error StdDev Gen 0 Allocated
Original 128 123.07 ns 2.455 ns 3.823 ns 0.0968 304 B
Vector 128 21.87 ns 0.101 ns 0.090 ns - -
Original 256 184.98 ns 1.866 ns 1.745 ns 0.1783 560 B
Vector 256 38.55 ns 0.096 ns 0.085 ns - -
Original 512 349.29 ns 10.298 ns 30.363 ns 0.3414 1,072 B
Vector 512 81.09 ns 0.251 ns 0.235 ns - -

Here is the code I used:

byte[] CTRArrayXOR(byte[] counter, byte[] data, int offset, int length)
{
    for (int loopOffset = 0; length > 0; length -= Vector<byte>.Count)
    {
        var v = new Vector<byte>(counter, loopOffset) ^ new Vector<byte>(data, offset + loopOffset);
        if (length >= Vector<byte>.Count)
        {
            v.CopyTo(counter, loopOffset);
            loopOffset += Vector<byte>.Count;
        }
        else
        {
            for (int i = 0; i < length; i++)
            {
                counter[loopOffset++] = v[i];
            }
        }
    }
    return counter;
}

@zybexXL
Copy link
Contributor Author

zybexXL commented Sep 9, 2021

I've tested your snippet by replacing the XOR on my AES-CTR benchmark, which processes chunks of 32KB of random data (similar to what happens when downloading a file)
My current code: ~300MB/sec
Your code: ~150MB/sec

Note that my code copies the byte[] to uint[] so that the XOR is not done byte by byte. Does your benchmark take that into account? Is the "original" my PR code (accelerated), or the existing SSH-NET code?

You mention that vector<T> doesn't need to allocate memory... I think that's not exactly true. That FOR loop instantiates 2 Vector<byte> objects, each holding 4 bytes. Due to the limited scope the objects are immediately discarded after each loop. So it's still allocating 16384 objects in the stack in that FOR loop to XOR the two 32KB arrays. Instantiating objects is not free.

I see that vector<T> is likely optimized by the compiler to use CPU registers. Even so, it seems slower than my current code.

@zybexXL
Copy link
Contributor Author

zybexXL commented Sep 9, 2021

I recompiled the benchmark for NetCore 5.0 and added 3 versions of CTR:

  • BlockCopy (current code in this PR, copying byte[] to uint[] before XOR)
  • Vector<T> (IgorMilavec's suggestion)
  • Span<T> (allowing cast from byte[] to uint[] without memcopy)

Running it as Release instead of Debug also kicked performance by a lot, and changes the ranking. So here are the results:

Benchmark Average
SSH.NET CTR 2020.0.1 49.06 MB/s
SSH.NET CTR BlockCopy 586.66 MB/s
SSH.NET CTR Span<T> 671.85 MB/s
SSH.NET CTR Vector<T> 731.97 MB/s
AesCSP CBC 785.79 MB/s

So Vector is indeed the fastest, but available only since NetStandard 2.1 (like Span). Blockcopy is not that bad considering it's available for all platforms, and it's still 13x faster than base. We can add a separate PR with Vector/Span later on if needed.

Here's the snippet for SpanXOR - it assumes the arrays are 4-byte aligned for simplification:

        private byte[] SpanXOR(byte[] data, int offset, byte[] output)
        {
            int uOffset = offset / 4;
            Span<uint> uData = MemoryMarshal.Cast<byte, uint>(data);
            Span<uint> uOut = MemoryMarshal.Cast<byte, uint>(output);
            for (int i = 0; i < uOut.Length; i++)
                uOut[i] = uOut[i] ^ uData[i + uOffset];
            return output;
        }

Copy link
Collaborator

@IgorMilavec IgorMilavec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drieseng This looks fine. I've been using it in production for a couple of months without any side effects. With 10x performance improvement, I propose to merge this before the next release.

@A9G-Data-Droid
Copy link

I have tested this and it works!

Pedro Fonseca added 5 commits November 1, 2023 15:07
Reduces CPU usage dramatically, allowing more performance on slower machines
Fix padding for non-AES blockciphers
Fix IV exception for non-AES blockciphers
It looks like the legacy code doesn't correctly remove padding, so this code needs to do the same.
restructure AES CSP code into its own class
@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 1, 2023

I've rebased this PR, please review and consider merging.
I have been using this code since 2021 and I still see the huge performance gains this provides, together with #866

@Rob-Hague
Copy link
Collaborator

I'm very much in favour of this change, but I think the design needs some work. In fact, I think it can be much simpler as a starting point: I think we can just replace the implementation of AesCipher with an ICryptoTransform in ECB mode and no padding.

That way, we still use the given CipherMode object (rather than subverting it) which is more predictable, testable and yet we should still get the majority of the performance gains. The CipherModes can easily be optimized as a next step, and then further down the line (if we are still not satisfied) we can think about increasing the complexity in order to delegate to the BCL for the cipher modes too.

If you have any thoughts on this approach, or would like to work on it, let me know. Otherwise, I will play with it.

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 5, 2023

All AES tests pass (including the new ones from #1232). The failed test is unrelated, seems spurious:

Failed CanWriteShouldReturnFalse [9 ms]
  Error Message:
   Initialization method Renci.SshNet.Tests.Classes.Sftp.SftpFileStreamTest_CanWrite_Closed_FileAccessReadWrite.SetUp threw exception. System.ArgumentOutOfRangeException: Cannot be less than or equal to zero. (Parameter 'bufferSize').

Is there a way to force re-run of the test suite without doing a dummy commit?

@WojciechNagorski
Copy link
Collaborator

I've merged #1232 can you update this PR and check if all tests passed?

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 5, 2023

DST messed up the order of comments, my comment above was actually made after Wojciech's.
AES tests passed.

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 5, 2023

I'm very much in favour of this change, but I think the design needs some work. In fact, I think it can be much simpler as a starting point: I think we can just replace the implementation of AesCipher with an ICryptoTransform in ECB mode and no padding.

That way, we still use the given CipherMode object (rather than subverting it) which is more predictable, testable and yet we should still get the majority of the performance gains. The CipherModes can easily be optimized as a next step, and then further down the line (if we are still not satisfied) we can think about increasing the complexity in order to delegate to the BCL for the cipher modes too.

If you have any thoughts on this approach, or would like to work on it, let me know. Otherwise, I will play with it.

@Rob-Hague
ECB by itself is not enough. My main goal with this code was to improve performance with AWS SFTP, which uses CTR.
When I added this PR 3 years ago, SSH.Net still included many legacy frameworks which did not support the accelerated AESCryptoProvider API, so I left the legacy/slow ECB code in place and surrounded the new code with the FEATURE_AES_CSP conditional.

With the removal of the legacy frameworks perhaps we can get rid of the legacy code and just use this new one. However, I think we should first merge this in, and then rework and remove the legacy code in a separate PR.

@Rob-Hague
Copy link
Collaborator

With the removal of the legacy frameworks perhaps we can get rid of the legacy code and just use this new one. However, I think we should first merge this in, and then rework and remove the legacy code in a separate PR.

Let's get it right in this PR.

My biggest concern is the subversion of logic. When passing a CipherMode object, my expectation is that this object is going to be used in the cipher process (as it always has done). But with this change, the object is only used as a type check and the implementation is not touched. This is unexpected.

I have run some experiments with the following changes to the benchmarks:

diff --git a/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs b/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs
index ff414cc4..b91e3b8a 100644
--- a/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs
+++ b/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs
@@ -15,7 +15,7 @@ namespace Renci.SshNet.Benchmarks.Security.Cryptography.Ciphers
         {
             _key = new byte[32];
             _iv = new byte[16];
-            _data = new byte[256];
+            _data = new byte[32 * 1024];
 
             Random random = new(Seed: 12345);
             random.NextBytes(_key);
@@ -34,5 +34,29 @@ namespace Renci.SshNet.Benchmarks.Security.Cryptography.Ciphers
         {
             return new AesCipher(_key, new CbcCipherMode(_iv), null).Decrypt(_data);
         }
+
+        [Benchmark]
+        public byte[] Encrypt_CFB()
+        {
+            return new AesCipher(_key, new CfbCipherMode(_iv), null).Encrypt(_data);
+        }
+
+        [Benchmark]
+        public byte[] Decrypt_CFB()
+        {
+            return new AesCipher(_key, new CfbCipherMode(_iv), null).Decrypt(_data);
+        }
+
+        [Benchmark]
+        public byte[] Encrypt_CTR()
+        {
+            return new AesCipher(_key, new CtrCipherMode(_iv), null).Encrypt(_data);
+        }
+
+        [Benchmark]
+        public byte[] Decrypt_CTR()
+        {
+            return new AesCipher(_key, new CtrCipherMode(_iv), null).Decrypt(_data);
+        }
     }
 }

Results on develop branch 826222f:

Method Mean Error StdDev Gen0 Allocated
Encrypt_CBC 320.4 μs 0.47 μs 0.36 μs 15.6250 32.41 KB
Decrypt_CBC 352.3 μs 2.37 μs 1.85 μs 15.6250 32.41 KB
Encrypt_CFB 335.6 μs 0.43 μs 0.34 μs 15.6250 32.45 KB
Decrypt_CFB 367.9 μs 4.23 μs 3.95 μs 15.6250 32.45 KB
Encrypt_CTR 310.4 μs 0.38 μs 0.31 μs 15.6250 32.45 KB
Decrypt_CTR 315.3 μs 3.96 μs 3.70 μs 15.6250 32.45 KB

Results on my suggestion to replace the AesCipher implementation with the BCL Rob-Hague@8619062

Method Mean Error StdDev Gen0 Allocated
Encrypt_CBC 221.2 μs 2.01 μs 1.88 μs 15.8691 32.7 KB
Decrypt_CBC 220.0 μs 2.73 μs 2.56 μs 15.8691 32.71 KB
Encrypt_CFB 251.3 μs 0.33 μs 0.27 μs 15.6250 32.75 KB
Decrypt_CFB 255.3 μs 0.67 μs 0.60 μs 15.6250 32.75 KB
Encrypt_CTR 201.3 μs 0.41 μs 0.38 μs 15.8691 32.75 KB
Decrypt_CTR 209.3 μs 0.22 μs 0.19 μs 15.8691 32.75 KB

Results on your branch zybexXL@a9f68fb

Method Mean Error StdDev Gen0 Allocated
Encrypt_CBC 29.637 μs 0.1278 μs 0.1067 μs 16.1133 33.14 KB
Decrypt_CBC 6.579 μs 0.0160 μs 0.0142 μs 16.1285 33.14 KB
Encrypt_CFB 333.761 μs 0.1710 μs 0.1335 μs 15.6250 32.52 KB
Decrypt_CFB 333.899 μs 0.2090 μs 0.1955 μs 15.6250 32.52 KB
Encrypt_CTR 24.630 μs 0.0327 μs 0.0273 μs 78.1250 161.2 KB
Decrypt_CTR 24.544 μs 0.0817 μs 0.0683 μs 78.1250 161.2 KB

So my suggestion gets moderate gains but clearly we can do a lot better with an approach to delegate the entire encryption to the BCL (rather than block-by-block).

So here is what I propose:

We keep the behaviour of the existing constructor on AesCipher as is. That is, if you pass a CipherMode and/or CipherPadding object then they will be used as expected, with encryption happening block-by-block using the BCL as in my branch.

We add a new constructor which will allow delegating to the BCL for the entire encryption process. This will achieve the much better performance and separation from the existing behaviour.

public AesCipher(
    System.Security.Cryptography.CipherMode cipherMode,
    System.Security.Cryptography.PaddingMode paddingMode,
    bool ctrMode = false)
{ }

Since S.S.C.CipherMode does not have a CTR value, there is a ctrMode parameter. The constructor would throw if ctrMode is true and cipherMode != ECB. A bit awkward, but it should work fine. Other ideas would be to define our own CipherMode enum which does contain a CTR value, or to allow passing e.g. (S.S.C.CipherMode)-1 which we interpret as CTR.

In terms of the design, we define nested classes in AesCipher which represent the different implementations. Essentially, AesCipher becomes:

public sealed class AesCipher : BlockCipher
{
    private readonly BlockCipher _impl;

    public AesCipher(byte[] key, CipherMode mode, CipherPadding padding)
        : base(key, 16, mode, padding)
    {
        _impl = new SshNetCipherModeImpl(key, mode, padding);
    }

    public AesCipher(
        byte[] key,
        byte[] iv,
        System.Security.Cryptography.CipherMode cipherMode,
        System.Security.Cryptography.PaddingMode paddingMode,
        bool ctrMode = false)
    {
        _impl = new BclImpl(/* ... */);
    }

    // AesCipher just forwards all implementation to _impl

    public override int EncryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
    {
        return _impl.EncryptBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
    }

    public override int DecryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
    {
        return _impl.DecryptBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
    }

    public override byte[] Encrypt(byte[] input, int offset, int length)
    {
        return _impl.Encrypt(input, offset, length);
    }

    public override byte[] Decrypt(byte[] input)
    {
        return _impl.Decrypt(input);
    }

    public override byte[] Decrypt(byte[] input, int offset, int length)
    {
        return _impl.Decrypt(input, offset, length);
    }

    // Implementations

    // The block-by-block implementation using instantiated SshNet.CipherMode, SshNet.CipherPadding objects
    private sealed class SshNetCipherModeImpl : BlockCipher
    {
        private readonly Aes _aes;
        private ICryptoTransform _encryptor;
        private ICryptoTransform _decryptor;

        public SshNetCipherModeImpl(byte[] key, CipherMode mode, CipherPadding padding)
            : base(key, 16, mode, padding)
        {
            // Initialise _aes in ECB mode
        }

        public override int EncryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
        {
            _encryptor ??= _aes.CreateEncryptor();

            return _encryptor.TransformBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
        }

        public override int DecryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
        {
            _decryptor ??= _aes.CreateDecryptor();

            return _decryptor.TransformBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
        }
    }

    private sealed class BclImpl : BlockCipher
    {
        // This overrides Encrypt/Decrypt for full BCL perf.

        public override byte[] Decrypt(byte[] input)
        {
            throw new NotImplementedException();
        }

        public override byte[] Decrypt(byte[] input, int offset, int length)
        {
            throw new NotImplementedException();
        }

        public override int DecryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
        {
            throw new NotImplementedException();
        }

        public override byte[] Encrypt(byte[] input, int offset, int length)
        {
            throw new NotImplementedException();
        }

        public override int EncryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
        {
            throw new NotImplementedException();
        }
    }
}

Comments appreciated. Most important to me is that we do not subvert provided SshNet.CipherMode instances.

@Rob-Hague
Copy link
Collaborator

Most important to me is that we do not subvert provided SshNet.CipherMode instances.

One other thought: because the proposed new constructor is a bit awkward for CTR mode, I am in principle OK with subverting our own SshNet.CipherMode instances with an exact type check e.g. mode.GetType() == typeof(CtrCipherMode) instead of an is check (which would pass also for derived types which we definitely should not subvert). Then we can stick with one constructor which decides on the implementation, and the awkward logic is not exposed publicly.

@WojciechNagorski
Copy link
Collaborator

@zybexXL I will review (and merge if OK) this pull request as soon as you update this pull request.

I merged #1235 PR before this because it reduced the diff for this one.

@WojciechNagorski
Copy link
Collaborator

@zybexXL are you going to update this PR. I would see this PR as one of the most important for the next release.

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 19, 2023

@WojciechNagorski Yes, I've just been busy this week. I'll update it today or tomorrow.

zybexXL and others added 4 commits November 20, 2023 17:04
Factor out the implementations and re-add the existing constructor
Add tests for stream cipher state preservation
@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 20, 2023

@WojciechNagorski PR is now updated, ready to merge.

Copy link
Collaborator

@Rob-Hague Rob-Hague left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. My only "important" feedback is on the naming of AesCipherMode. The others are not necessary to address.

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 20, 2023

Appveyor failed on unrelated test (ForwardedPortShouldAcceptNewConnections).

I've noticed that some tests always fail if there are 2 or more Appveyor sessions running at the same time, usually because some port is already in use. Maybe there's a way to queue the appveyor sessions so that only one runs at any given time?

Edit - consider adding this to appveyor.yml:

# Maximum number of concurrent jobs for the project
max_jobs: 1

@Rob-Hague
Copy link
Collaborator

The appveyor open source licence only allows 1 concurrent job, and I don't see any overlapping runs in the history. The failed test is not using any sockets. It is probably a timing issue like #1185

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 21, 2023

The appveyor open source licence only allows 1 concurrent job, and I don't see any overlapping runs in the history. The failed test is not using any sockets. It is probably a timing issue like #1185

I'm not talking about this particular failure. The fact is, whenever I push to multiple branches/PRs, Appveyor testing fails. I've seen port-related issues at least twice.

@WojciechNagorski
Copy link
Collaborator

To be honest, I prefer GitHub Actions but we can't use it because there is a limit.

@zybexXL
Copy link
Contributor Author

zybexXL commented Nov 27, 2023

@WojciechNagorski Please merge this PR, I think all issues are resolved.

@drieseng
Copy link
Member

@WojciechNagorski, can this one go in for you?

Copy link
Collaborator

@WojciechNagorski WojciechNagorski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is an amazing contribution.

@Rob-Hague I know you had some comments about AesCipherMode, but to be honest, it doesn't block merging. Please feel free to open the next PR. We still have some time until the next release.

@WojciechNagorski WojciechNagorski merged commit 4ce18d3 into sshnet:develop Nov 28, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants