Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interesting hashes / macs / ciphers / checksums #45

Open
ThomasWaldmann opened this issue Jun 5, 2015 · 66 comments
Open

interesting hashes / macs / ciphers / checksums #45

ThomasWaldmann opened this issue Jun 5, 2015 · 66 comments
Milestone

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 5, 2015

https://github.com/Cyan4973/xxHash - not a cryptographic hash fn, not for HMAC! So, maybe we could use it as a crc32 replacement (if we keep the crc32(header+all_data) approach). borg uses xxh64 at some places

siphash - cryptographic hash fn (internally used by python >= 3.4), but: only 64bits return value. a 128bit version is "experimental".

libsodium has some hashes / macs also. but not yet widespread on linux dists.

last but not least: sha512-256 is faster on 64bit CPUs than sha256.

@ThomasWaldmann ThomasWaldmann changed the title use xxhash? interesting hash functions Jun 5, 2015
This was referenced Jun 5, 2015
@ThomasWaldmann ThomasWaldmann changed the title interesting hash functions interesting hashes / macs / ciphers Jun 5, 2015
@namelessjon
Copy link

If/when you do feel you can use libsodium, that also opens up two things which could be useful:

  • scrypt as a PBKDF. This should be (much) less vulnerable to brute force password guessing
  • secretbox with XSalsa20 and Poly-1305. This would solve worries over counter re-use, as you can randomly generate a 24-byte nonce for each new encryption with negligible chance of reuse.

@ThomasWaldmann
Copy link
Member Author

@namelessjon interesting, thanks. about the counter re-use issue: there is also this idea of creating per-backup(-thread) random session keys, start counter from 0, encrypt the keys with the master key and store them with the backup.

@namelessjon
Copy link

@ThomasWaldmann That sounds less fragile than the current implementation at least (I think).

The secretbox option from libsodium, despite the larger nonce, adds the same 40 byte overhead to the files, because the Poly-1305 MAC is only 16 bytes. I guess if you're now including a pointer to the encrypted session key with each encrypted blob, secretbox would have less overhead, but that shouldn't be that significant anyway?

@ThomasWaldmann
Copy link
Member Author

I had a look at libsodium yesterday, seems pretty nice and it also is in some stable linux distributions now.

It would be useful to get some comparative performance values:
aes256-ctr + hmac-sha256 vs aes256-gcm (hw accel.) vs. xsalsa20 + poly1305
same for sha256 against some faster hash from libsodium.

For interfacing, we have 2 options: either via cython (like we use it for openssl right now) or using some python wrapper for libsodium.

@namelessjon
Copy link

namelessjon commented Nov 6, 2015

https://www.imperialviolet.org/2014/02/27/tlssymmetriccrypto.html < not a perfect comparison (it's ChaCha20, not XSalsa20, but I believe the performance is supposed to be similar), and that's one or more intel processor generations ago, but there's that. However, I think since then libsodium has picked up an assembly version of Salsa20 which should be faster.

@ThomasWaldmann
Copy link
Member Author

pysodium crypto_generichash (256bit) is 2.8 times faster than sha256 from python stdlib.
pysodium crypto_generichash (512bit) is 1.8 times faster than sha512 from python stdlib.

note: sha256 eats most of the cpu time for borgbackup currently (when using hw accel. aes and lz4).

But: no AES256-CTR in libsodium yet. jedisct1/libsodium#317

@namelessjon
Copy link

namelessjon commented Nov 7, 2015

Seems unlikely aes-ctr will be added in libsodium from how that thread has evolved. I think I agree with the why, too. The nice thing about libsodium is it inherits "hard to mess up" from nacl.

@namelessjon
Copy link

Though it does complicate a migration to different algorithms without adding more dependencies

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Dec 10, 2015

openssl 1.1.0 is scheduled for april 2016 release. update: borg uses openssl 1.1.x
https://www.openssl.org/news/changelog.html#x0

chacha20 / poly1305
ocb mode

@infectormp
Copy link
Contributor

infectormp commented Mar 3, 2016

Hash news from Google :

  • siphash
  • highwayhash

@enkore
Copy link
Contributor

enkore commented Apr 23, 2016

I have a branch where I worked about on the LoggedIO write performance and managed to double it when processing large files (45 MB/s => 90 MB/s, vmlocked input, output onto tmpfs; dd does ~500 MB/s here), mainly by managing syncs in a way to give the kernel a chance to do them when it wants to, without compromising transactionality (and indeed, syncs don't make a significant appearance in the profile anymore)

Adding a none64 encryption using SHA512-256 moved it to ~110 MB/s.

Profiled it there (with Cython profiling enabled):

  1. 40 % Chunker.__next__
  2. 35 % _hashlib.openssl_sha512
  3. 10 % CRC32 (which matches very well with CRC32 stand-alone giving me approx 1 GB/s)
  4. 5 % buffered IO writes
  5. 4 % the bytes-join in Plaintext64Key.encrypt
  6. 3.5 % compression (none was enabled)

So it seems to me that the Chunker is the next big target for optimization. i.e. mainly see what the compiler does there and if there is anything left to optimize.

Btw. using that branch in production currently, nada issues so far. So a PR for that will probably come this weekend.

Extraction is basically 70 % SHA-512, 20 % CRC-32 and 10 % IO+misc (for ~210 MB/s). Normal plaintext w/ SHA-256 is 160 MB/s or so. I'd say extraction speed is acceptable for my CPU (which is old and has 'AMD' lasered onto the lid).

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented May 16, 2016

As debian stable and ubuntu lts now has libsodium, I've begun working on a cython-based libsodium binding for borg. gives us chacha20-poly1305 as new aead cipher, blake2b as new hash.

Strange, I am seeing less than expected speedup:

speedup sha256 -> sha512: 1.4616587679833115
speedup sha256 -> blake2b: 1.5823430737200959
speedup sha512 -> blake2b: 1.0825666758755845

I first thought this is maybe caused by a slow blake2b 1.0.8 in ubuntu and I manually installed 1.0.10 (which has "record speed avx2 blake2b") - but it doesn't get faster. https://blake2.net/ says blake2b should be about 3x faster than sha512, so what's going wrong here?

@ThomasWaldmann ThomasWaldmann changed the title interesting hashes / macs / ciphers interesting hashes / macs / ciphers / checksums May 21, 2016
@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented May 21, 2016

Quote from python docs: "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly."

Quote from stackexchange: "Do note that Adler32 is almost useless for short runs of data. Up to about 180 bytes, it produces numerous collisions."

>>> from zlib import *
>>> data=b'x'*1000000000
>>> # dt computes the runtime of given function in seconds
>>> dt(lambda: crc32(data))
1.1496269702911377
>>> dt(lambda: adler32(data))
0.49709367752075195

@enkore
Copy link
Contributor

enkore commented May 21, 2016

CRC32 is already around 1 GB/s (even on my older CPUs), and should be [much] faster on CPUs with CLMUL (although I'm not sure whether zlib makes use of that - if it doesn't getting an implementation or nudging Python into using one that does would make sense and comes for free (except the hassle)).

For 2.0 it would make sense to switch to something as fast as CRC32 (blake) but with much higher integrity guarantees. E.g. 128+ bit blake checksums on the Repository layer.

@enkore
Copy link
Contributor

enkore commented May 24, 2016

All tests made with openssl speed -evp <algorithm>. (Note: AES-NI always includes CLMUL; I therefore don't mention it separately)

AMD K10 'Thuban', 3.3 GHz, no AES-NI, OpenSSL master (to-be 1.1)

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           32807.44k    80855.72k   153107.35k   191174.64k   203030.09k   202454.73k
sha512           22194.31k    88633.71k   174282.49k   272886.07k   312806.06k   313252.22k
blake2s256       30131.61k   121603.55k   248297.91k   334311.34k   345933.32k   338420.01k
blake2b512       29113.06k   115848.20k   340357.74k   517671.73k   574694.83k   582163.52k


aes-256-cbc      65277.71k    71006.95k    73144.51k   184525.14k   185352.90k   186289.92k
aes-256-gcm      50956.21k    56946.58k    57792.17k    60589.06k    59232.56k    58938.56k
aes-256-ocb      61807.33k    65769.66k    66718.91k    66782.89k    66816.68k    67196.32k
chacha20-poly1305   168435.67k   321127.98k   360382.55k   381305.75k   384996.69k   382865.04k

Intel Xeon E3-1231v3, 3.4 GHz, AES-NI

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           55129.16k   145301.59k   311774.38k   428385.09k   468527.79k   476840.16k
sha512           36564.32k   148308.25k   301326.17k   534995.48k   683207.34k   697860.10k
blake2s256       42168.03k   163007.85k   341910.18k   473521.49k   532892.33k   540114.94k
blake2b512       41371.63k   163340.95k   474446.42k   712455.17k   835794.26k   846603.47k

aes-256-cbc     571487.46k   600316.16k   608165.89k   609246.21k   611030.36k   613452.03k
aes-256-ctr     420332.94k  1373547.81k  2840697.54k  3595480.49k  3863664.23k  3909044.31k
aes-256-gcm     373152.65k  1071874.69k  2080868.78k  2579107.16k  2893392.55k  2940644.01k
aes-256-ocb     345922.01k  1456611.33k  2691367.08k  3528726.53k  3820748.80k  3855936.17k
chacha20-poly1305   282856.40k   509213.58k  1095028.99k  1905031.85k  2016478.61k  2010589.87k

powermac G5, dual core, 2 GHz, OpenSSL master (to-be 1.1), configured for ppc64.

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256            8032.07k    24662.84k    57131.78k    86498.65k   102213.69k   102951.59k
sha512            7186.19k    28637.80k    66281.56k   118863.87k   155130.18k   157947.22k
blake2s256        8173.06k    32977.38k    69935.79k    98463.06k   112172.86k   112831.15k
blake2b512        7160.29k    28813.66k    87564.20k   152836.62k   195205.22k   198838.20k

aes-256-cbc      50287.13k    58998.33k    63115.43k    64246.10k    64804.47k    64823.65k
aes-256-gcm      33388.54k    37288.54k    38743.81k    39149.57k    39439.41k    39436.67k
aes-256-ocb      37836.55k    43085.70k    44221.53k    44683.95k    44921.75k    44772.01k
chacha20-poly1305    63010.86k   122622.37k   218386.26k   239075.57k   246295.21k   247820.33k

X200, Intel P8600, 2.4 GHz, no AES-NI, OpenSSL git 38e19eb96

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           37834.40k    82205.76k   142650.64k   172888.69k   188077.40k   190355.78k
sha512           25130.82k   101252.47k   156723.29k   219927.55k   250352.98k   252177.07k
blake2s256       23435.04k    92726.06k   184286.46k   246177.13k   273435.31k   275256.66k
blake2b512       21004.03k    82610.52k   246223.54k   382414.17k   457061.72k   460215.64k


aes-256-cbc     128602.15k   149870.77k   155655.25k   157678.25k   156516.85k   157543.08k
aes-256-gcm      40657.10k    46472.35k   120275.88k   129553.07k   131236.39k   131164.84k
aes-256-ocb     107630.54k   123701.66k   125957.38k   129277.61k   130329.51k   128839.82k
chacha20-poly1305   143071.17k   253906.65k   395730.28k   412351.74k   423469.06k   420422.21k

X201, Intel i5-520M (1st gen), AES-NI, 2.5 GHz, OpenSSL master (to-be 1.1)

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           27347.57k    70701.40k   132937.71k   180204.41k   198897.91k   200478.58k
sha512           18721.99k    74486.12k   138151.43k   203135.40k   254793.19k   259274.05k
blake2s256       29862.39k   120343.92k   213276.18k   268852.92k   276566.88k   283360.73k
blake2b512       22243.32k    92160.97k   260590.30k   400471.62k   474613.69k   479858.77k

aes-256-cbc     411130.34k   451687.22k   469638.53k   471725.87k   472540.95k   471177.45k
aes-256-gcm     198351.23k   435388.37k   570959.46k   620392.81k   632221.72k   630816.99k
aes-256-ocb     191981.68k   603404.07k  1036441.86k  1254013.08k  1342182.23k  1355308.67k
chacha20-poly1305   151109.65k   275360.32k   468868.72k   495142.83k   503227.96k   504269.83k

Odroid-C2, ARM Cortex-A53 (NEON acceleration), 2 GHz, AArch64 mode, 2G RAM, OpenSSL master (to-be-1.1)

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           10262.06k    29336.60k    62571.67k    87548.95k    99074.37k   100018.35k
sha512            7540.40k    29637.11k    68571.21k   119539.12k   152745.54k   155860.28k
blake2s256        6914.23k    27511.27k    45578.94k    54736.87k    58282.55k    58521.26k
blake2b512        5008.73k    20017.75k    51662.25k    67957.66k    74815.75k    75377.29k

aes-256-cbc      40206.14k    47970.14k    50627.05k    51336.65k    51544.50k    51555.35k
aes-256-gcm      23389.15k    26157.46k    27123.45k    27378.10k    27464.90k    27478.46k
aes-256-ocb      35397.02k    40917.59k    42634.34k    43232.19k    43095.87k    43238.57k
chacha20-poly1305    51664.38k   106305.89k   208580.15k   235916.72k   247165.12k   247984.32k

A modern ARM core with NEON, performs quite well for AES, and extremely well for ChaCha20-Poly1305 (at ~250 MB/s). SHA-2 is faster than Blake since AArch64 includes instructions for SHA.

As expected, the chacha20-poly1305 scheme is by far the fastest in software[1]. AES-OCB is faster than GCM but doesn't quite gets "nearly as fast" as CBC.

A test on HW with AES-NI and CLMUL would be interesting to see how GCM and OCB compare there.

Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM.

Update: Added results for a Haswell desktop CPU. The ratios almost exactly match Thomas' results as one would expect (both are Haswell).

Update: Added results for ARM Cortex-A53 (amlogic s905), AArch64

[1] but I still find it surprisingly fast even on the G5.

@ThomasWaldmann
Copy link
Member Author

we don't need to compare gcm and cbc modes, cbc does not have auth, so the comparison would be gcm and cbc+auth (hmac or whatever).

i am a bit unsure about ocb. although the patent stuff seems unproblematic meanwhile, it hindered wide usage until recently, so one could suspect ocb is way less practically tested than gcm.

also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency).

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented May 24, 2016

i5-4200u with aes-ni, openssl 1.0.2:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha256           36164.96k    96595.22k   207142.23k   288822.61k   327669.08k
sha512           26040.90k   104421.67k   221632.00k   371365.89k   470551.21k
aes-256-cbc     386736.39k   410446.76k   416114.01k   417011.03k   417680.04k
aes-256-gcm     289575.60k   828874.30k  1474338.99k  1708658.35k  1785738.58k

openssl 1.1.0 git master:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           36976.78k    98577.73k   211887.62k   292908.18k   327972.18k   332016.30k
sha512           24729.87k   100302.29k   216779.00k   368207.19k   467856.04k   477997.74k
blake2s256       25418.42k   107899.43k   224348.93k   315508.10k   353039.58k   372823.38k
blake2b512       27473.39k   108173.14k   317090.85k   488826.20k   580064.60k   587563.01k

aes-256-cbc     386987.58k   410620.12k   416009.47k   418933.81k   417838.42k   412663.81k
aes-256-gcm     272809.14k   721350.98k  1485755.59k  1755807.06k  1979094.36k  2008274.26k
aes-256-ocb     258501.45k   993106.39k  1864350.38k  2342698.67k  2612682.75k  2639790.08k
chacha20-poly1305   190241.36k   332894.29k   709114.61k  1299729.07k  1386345.81k  1397286.70k

@enkore
Copy link
Contributor

enkore commented May 24, 2016

also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency).

I used OpenSSL here mainly because it's a convenient way to test it: While on x86 I don't expect performance differences between *ssl and NaCl/libsodium, re-testing should be done with the library actually used in the end to ensure it has the performance level we expect(ed).

@ThomasWaldmann
Copy link
Member Author

Somehow embarrassing that we can encrypt+auth 4-8 times faster than compute any easily and separately available hash.

@enkore
Copy link
Contributor

enkore commented May 24, 2016

AES is cheating with it's dedicated per-round instructions :D Could use a hash/mac constructed from AES, but they all have many more caveats than typical MACs in my perception.

Another thing to consider is that more recent ARM chips also include acceleration for AES. Newer Raspis (at least the v3) are running on an A53 core that includes that.

@enkore
Copy link
Contributor

enkore commented Jun 4, 2016

I added another set of results above, for a 1st gen i5 (and also some for the previous Core2 processor). Generally in line with other observations, except...

Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM.

@ThomasWaldmann
Copy link
Member Author

Wow, that's a surprising result. It's just a pity that it likely will take quite some time until aes-ocb (openssl 1.1) is widely available and packaged - and by then many of these 1st gen Core-i machines might be gone anyway.

@ThomasWaldmann
Copy link
Member Author

About AES-GCM, see Black Hat 2016, paper is public on iacr: "nonce disrespecting adversaries"

https://twitter.com/tqbf/status/760907360319660032

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jul 30, 2021

LibreSSL 2.8.3 an macOS 12.0b4 Apple M1 cpu

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha256               67477.74k   154138.47k   262916.43k   319208.81k   340417.22k
sha512               57932.16k   231396.08k   350646.38k   481570.49k   546502.26k
aes-256-cbc         228120.47k   237817.67k   234978.42k   239252.78k   240820.57k
aes-256-gcm         142843.72k   145801.57k   143040.33k   142580.08k   142298.70k
aes-256-ctr         251498.14k   268501.38k   270107.85k   271982.03k   273654.51k
chacha20 poly1305    43378.82k   169592.00k   287290.88k   345544.80k   369688.69k

Note: no aes-ocb, no blake2b.

@Maryse47
Copy link

why sha256 is slower (significantly) than sha512?

@ThomasWaldmann
Copy link
Member Author

@Maryse47 that is expected for 64bit platforms, where sha512 is usually faster than sha256.

so nowadays it is kind of stupid to use sha256 as a software implementation because one could just use sha512 (and throw away half of the result if 256bits are wanted). only exception (see above) is CPU hw accelerated sha256, that might be faster again if sha512 is not hw accelerated.

borg uses sha256 mostly due to historical reasons, but we also have the fast blake2b algo (fast in software, there is no hw acceleration for that).

@enkore
Copy link
Contributor

enkore commented Aug 2, 2021

OpenSSL 1.1.1k on a Ryzen 5600X locked at 4.6 GHz all cores (if you're thinking "Hey, that seems way better than Zen 2 CPUs which almost never hit their advertised clocks even on the best core with light loads" you'd be right. Zen 3 parts always hit their advertised clocks because (1) AMD did not bullshit this time (2) the default GFL is 50 MHz above the advertised clock. These will always hit 4650 MHz on pretty much any core, even under load.)

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256          293473.80k   731471.10k  1500114.58k  2005120.14k  2254721.35k  2250309.63k
sha512           93705.30k   378856.55k   635270.24k   927596.54k  1054145.53k  1064222.72k
blake2s256       59266.68k   228436.29k   443893.58k   599372.60k   665122.79k   665390.10k
blake2b512       43775.31k   175761.00k   511203.16k   813573.12k   997690.03k  1023606.78k

aes-256-cbc     975053.72k  1182504.63k  1231511.72k  1238711.03k  1236388.52k  1240951.47k
aes-256-gcm     579415.38k  1531545.54k  3248525.31k  4580585.19k  5483003.90k  5563645.95k
aes-256-ctr     625653.43k  2292196.01k  5539421.87k  8358842.09k 10065625.00k 10040295.42k
aes-256-ocb     470562.27k  1772509.65k  4480814.42k  6982971.63k  8479542.53k  8470701.18k
chacha20-poly1305   363805.49k   648929.49k  1576549.03k  2895763.59k  3068497.39k  3116335.10k

Note massive improvement in ChaCha20-Poly1305 over Zen 2 (almost +50 %), and all other pipelinable modes (GCM, CTR, OCB). Zen 3 has more SIMD EUs and seems to have gained another EU capable of AES-NI. Higher AES-CBC performance likely due to much higher sustained clocks under load compared to my 3900X above.

Also note how even all the hashes see massively improved performance.

During these benchmarks the active core pulls around 4-6 W. Whole CPU is running at around 40 W, 3/4 of that is uncore - MCM / chiplet architecture is a "gas guzzler".

@ThomasWaldmann
Copy link
Member Author

10GB/s AES in counter mode, woah! 3GB/s chacha also quite fast.

@ThomasWaldmann ThomasWaldmann added this to the helium milestone Feb 22, 2022
@ThomasWaldmann
Copy link
Member Author

Had a quick test with pypi blake3 package on Apple MBA, macOS 12, M1 CPU:

hmac-sha256  1GB        0.681s
blake2b-256  1GB        2.417s
blake3-256   1GB        1.070s

Notable:

  • sha256 is CPU hw accelerated, thus super fast, faster than sw blake2 / blake3
  • blake3 much faster than blake2
  • blake3 pypi even has wheels for macOS arm64

@ThomasWaldmann
Copy link
Member Author

Would be cool if PR #6463 could get some review.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Mar 24, 2022

About adding blake3 support via https://github.com/oconnor663/blake3-py to borg:

How much platform / compile / installation and packaging issues would we likely get by doing so?

  • due to it using rust? there is also C code, but "experimental".
  • for pip installs, if no platform wheel is available?
  • esp. considering the less widespread platform, like netbsd, openbsd, openindiana?

Other options for blake3 support?

Didn't find a libb(lake)3(-dev) package on ubuntu, debian, fedora.

Issue on the python tracker: https://bugs.python.org/issue39298

@ThomasWaldmann
Copy link
Member Author

https://lwn.net/Articles/681616/ old, but partly still relevant I guess.

@ThomasWaldmann
Copy link
Member Author

I played a bit around with blake3:

  • the portable (generic) blake3 C code is only a bit faster than the blake2b code we already use
  • didn't get the NEON accelerated C code to compile
  • code from the blake3-py (rust-based) pypi package (macOS ARM wheel) showed about double the blake2b performance

@py0xc3
Copy link
Contributor

py0xc3 commented Apr 1, 2022

Had a quick test with pypi blake3 package on Apple MBA, macOS 12, M1 CPU:

hmac-sha256  1GB        0.681s
blake2b-256  1GB        2.417s
blake3-256   1GB        1.070s

Notable:

* sha256 is CPU hw accelerated, thus super fast, faster than sw blake2 / blake3

* blake3 much faster than blake2

* `blake3` pypi even has wheels for macOS arm64

This is even more impressive given the fact that HMAC runs SHA256 twice. It would be interesting to compare SHA256 with the SHA extensions against Blake2 with AVX2 (M1 does not have AVX2), although I do not know if hashlib's Blake2 implementation already makes use of AVX2. Unfortunately, I have currently netiher the SHA extensions nor AVX2. Maybe I can add a benchmark in some time when I got a new machine. Maybe someone else has already the possibility?

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.3.x, 1.3.0a1 Apr 8, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.3.x, 1.3.0b1 Apr 9, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b1, 2.x Jul 27, 2022
@infectormp
Copy link
Contributor

The many flavors of hashing article about different types of hash functions and algorithms

@enkore
Copy link
Contributor

enkore commented Sep 3, 2022

This is even more impressive given the fact that HMAC runs SHA256 twice.

The hash function is invoked twice in HMAC, yes, but the message is only hashed once. The outer hash fn invocation only processes the outer key and inner hash.

Alder Lake results look basically the same as Zen 3 above, except equivalent performance at lower clocks and lower power.

@rugk
Copy link
Contributor

rugk commented Jan 30, 2023

A new interesting encryption algorithm is called AEGIS, that is based on AES, but from my understanding builds on-top of what has been learned with AES block cipher modes/encryption schemes…

https://datatracker.ietf.org/doc/draft-irtf-cfrg-aegis-aead/00/

@enkore
Copy link
Contributor

enkore commented Jun 29, 2023

Some benchmarks again, in a roughly historical order.

Intel Xeon Gold 6230 CPU (Cascade Lake = Skylake, 14nm), OpenSSL 1.0.2k-fips 26 Jan 2017 (=RHEL 7.9)

type              16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              85504.16k   239891.14k   499248.12k   687352.49k   775319.37k
sha256           62643.81k   163584.06k   338441.59k   454458.71k   505399.82k
sha512           45636.55k   184678.49k   367671.74k   600479.76k   738538.84k

aes-256-cbc     969298.83k  1045541.12k  1059651.13k  1063793.53k  1069484.71k
aes-256-gcm     627073.36k  1413484.99k  2554710.39k  3718083.38k  4323027.63k

The remainder are OpenSSL 1.1.1k FIPS 25 Mar 2021 (RHEL 8)

Intel Xeon Platinum 8358 CPU (Ice Lake, 10nm / Intel 7)

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                  82856.64k   223238.38k   453702.49k   610063.36k   680492.18k   681306.79k
sha256               96116.06k   299662.45k   699202.90k  1039871.32k  1233153.54k  1243474.60k
sha512               42121.62k   170274.01k   321573.55k   501962.75k   599378.60k   610334.14k
blake2s256           64056.02k   255429.01k   386962.94k   450678.44k   479584.26k   483678.69k
blake2b512           53140.32k   215689.87k   544194.30k   701760.85k   775902.55k   787325.17k

aes-256-cbc         905916.56k  1120170.35k  1162550.02k  1171404.29k  1169083.05k  1169053.01k
aes-256-gcm         524914.40k  1490840.90k  3162930.69k  4295296.43k  5126362.45k  5202695.51k
aes-256-ocb         513380.86k  1833849.56k  4120666.03k  5847052.67k  6601149.10k  6706571.95k
chacha20-poly1305   283249.45k   574282.20k  1824270.34k  3251743.74k  3762856.84k  3789111.30k

Intel Gold 5318N CPU (also Ice Lake, different segment)

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                  65826.20k   184671.12k   374096.74k   500161.19k   556992.98k   561110.90k
sha256               77679.44k   241026.69k   569760.71k   858738.73k  1008492.74k  1017107.80k
sha512               33756.66k   141953.56k   265739.18k   412468.91k   491859.74k   499503.78k
blake2s256           51620.99k   212150.25k   319860.10k   371007.19k   394276.30k   394657.79k
blake2b512           44016.77k   178105.90k   450999.31k   578862.41k   636781.42k   641788.59k

aes-256-cbc         785405.33k   929806.40k   954328.27k   958396.87k   959707.87k   956410.54k
aes-256-gcm         467454.45k  1249941.55k  2595831.89k  3565374.50k  4200558.96k  4268459.41k
aes-256-ocb         414959.50k  1497825.82k  3194263.47k  4780095.02k  5413427.03k  5500996.49k
chacha20-poly1305   209454.20k   451417.77k  1462585.71k  2586606.53k  2970342.49k  2987840.85k

AMD EPYC 9454 (Zen 4, 5nm) @ 3.8 GHz. Zen 4 has VAES instructions, but it's unclear to me if this is supposed to double the AES throughput or just a different encoding for the existing AES-NI instructions. In any case, the OpenSSL version used in RHEL 8 is too old to know about VAES.

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                 111832.52k   285128.70k   551276.22k   717113.71k   783878.83k   792075.99k
sha256              150723.91k   449164.61k  1048627.37k  1571731.44k  1837775.20k  1860225.11k
sha512               68714.20k   274766.41k   494378.72k   753474.27k   879861.76k   895667.80k
blake2s256           85720.28k   342488.17k   486073.51k   550913.37k   570242.39k   573834.53k
blake2b512           72041.86k   288009.01k   727135.91k   919075.96k   993648.50k   999615.79k

aes-256-cbc         906857.08k  1021701.54k  1052213.42k  1064056.21k  1066768.27k  1067009.37k
aes-256-gcm         714170.81k  1763196.91k  3377087.10k  4257207.40k  4673672.53k  4730570.40k
aes-256-ocb         626893.67k  2318789.95k  5100248.79k  6748232.70k  7473972.57k  7559254.30k
chacha20-poly1305   296382.31k   556920.14k  1814487.55k  3516808.13k  3887621.82k  3900347.73k

What do we learn from this? Well, in terms of SHA and AES-NI extensions x86 are very, very uniform these days. Especially in server parts, where Intel cores typically have more FP resources than in client parts. If you normalize to clock speed, they're all pretty much the same.

Zen 3 to 4 has no changes at all here, unless VAES makes a difference.

Re-test with OpenSSL 3.1.1

VAES does seem to make a difference. A 2x difference. OpenSSL uses VAES for AES-GCM and AES-MB (multi-buffer, which interleaves encryption/decryption of independent streams and is not used here). It's also in a few stitches of AES-CBC and various SHAs, but not in AES-CTR or AES-OCB. Build flags:

version: 3.1.1
built on: Thu Jun 29 10:06:15 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG

gcc 8.5.0 (the RHEL 8 patchfest)

VAES (AVX512F) seems to just perform one encryption/decryption round on four independent blocks,
while VAES (AVX512VL) does the same, but on an AVX2 register with two blocks.

However, I'm not sure if the results below are actually VAES' doing and if this actually uses VAES with larger than 128 bit registers, because as far as I can tell the code generator uses xmm registers with VAESENC, which would use the AVX512VL encoding and hence should be equivalent to the traditional AES-NI in terms of performance.

So maybe it's just a better implementation in OpenSSL 3.x compared to the old 1.1.x series.

In any case, despite being a somewhat terrible construction, AES-GCM just doesn't seem to be able to stop winning. Almost 11 GB/s at just 3.8 GHz is impeccable performance (that's 0.35 cpb). AES-CTR is quite a bit slower at just 8.6 GB/s. The 128 bit variants are not much faster; 12.5 GB/s and 9.8 GB/s, respectively.

The Ice Lake Xeon performs even a bit better than the Zen 4 EPYC still at just below 0.3 cpb.

AMD EPYC 9454

CPUINFO: OPENSSL_ia32cap=0x7efa320b078bffff:0x415fdef1bf97a9
type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                 104800.85k   271984.80k   537514.55k   711293.35k   782983.17k   791643.10k
sha256              147387.25k   442508.86k  1041297.72k  1565552.50k  1830996.65k  1860800.47k
sha512               63776.48k   254963.31k   477623.57k   743322.63k   880691.88k   895717.12k
blake2s256           81892.92k   326048.83k   513686.77k   611124.91k   650505.08k   653639.41k
blake2b512           66261.11k   264983.76k   680883.63k   925331.26k  1036629.33k  1048839.02k

AES-256-CBC         907505.92k  1022436.75k  1052622.42k  1064306.90k  1066902.52k  1063479.98k
AES-256-GCM         735854.55k  2656460.46k  5900279.78k  6939782.94k 10419991.89k 10804415.10k
AES-256-OCB         697220.41k  2601784.53k  5363791.96k  6891523.42k  7504102.14k  7545640.28k
ChaCha20-Poly1305   296290.55k   555132.90k  1836174.68k  3647579.78k  3896452.20k  3916316.67k

Intel Xeon Platinum 8358 CPU (the Xeon Gold 5318N behaves the same way and has the same CPUID flags)

CPUINFO: OPENSSL_ia32cap=0x7ffef3f7ffebffff:0x40417f5ef3bfb7ef
type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                  80569.01k   218127.89k   449260.99k   607212.20k   679779.83k   683201.88k
sha256               92481.01k   285519.68k   680692.36k  1033920.85k  1230120.58k  1241890.82k
sha512               42249.83k   170846.44k   323417.00k   504486.79k   599127.38k   608157.70k
blake2s256           62037.24k   247927.21k   397370.62k   473488.70k   501844.65k   506309.44k
blake2b512           50842.18k   205802.25k   538162.01k   716786.64k   797903.53k   805612.20k

AES-256-CBC        1025099.80k  1140883.63k  1162179.58k  1171024.48k  1168703.49k  1168632.49k
AES-256-GCM         654373.75k  2529894.21k  4690537.90k  6565636.10k 11066534.44k 11562456.41k
AES-256-OCB         566158.67k  1972074.24k  4249955.57k  5881506.47k  6632761.02k  6708450.65k
ChaCha20-Poly1305   278457.26k   569466.01k  1851996.13k  3319006.21k  3773874.18k  3813877.38k

@ThomasWaldmann
Copy link
Member Author

Also interesting: while sha512 used to be faster than sha256 in a pure sw implementation, it's vice versa with the sha2 hw acceleration and it is faster than pure sw blake2 (as expected).

@infectormp
Copy link
Contributor

might be interesting
https://github.com/Blosc/c-blosc2
Blosc (c-blosc2) is a high-performance compressor focused on binary data for efficient storage of large binary data-sets in-memory or on-disk and helping to speed-up memory-bound computations.

@ThomasWaldmann
Copy link
Member Author

@infectormp IIRC, a talk from a blosc developer or user was the first time I heard about lz4 (and how they use it to get data faster into cpu cache than reading uncompressed memory). But blosc has quite a lot more stuff than we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants