-
-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interesting hashes / macs / ciphers / checksums #45
Comments
If/when you do feel you can use libsodium, that also opens up two things which could be useful:
|
@namelessjon interesting, thanks. about the counter re-use issue: there is also this idea of creating per-backup(-thread) random session keys, start counter from 0, encrypt the keys with the master key and store them with the backup. |
@ThomasWaldmann That sounds less fragile than the current implementation at least (I think). The secretbox option from libsodium, despite the larger nonce, adds the same 40 byte overhead to the files, because the Poly-1305 MAC is only 16 bytes. I guess if you're now including a pointer to the encrypted session key with each encrypted blob, secretbox would have less overhead, but that shouldn't be that significant anyway? |
I had a look at libsodium yesterday, seems pretty nice and it also is in some stable linux distributions now. It would be useful to get some comparative performance values: For interfacing, we have 2 options: either via cython (like we use it for openssl right now) or using some python wrapper for libsodium. |
https://www.imperialviolet.org/2014/02/27/tlssymmetriccrypto.html < not a perfect comparison (it's ChaCha20, not XSalsa20, but I believe the performance is supposed to be similar), and that's one or more intel processor generations ago, but there's that. However, I think since then libsodium has picked up an assembly version of Salsa20 which should be faster. |
pysodium crypto_generichash (256bit) is 2.8 times faster than sha256 from python stdlib. note: sha256 eats most of the cpu time for borgbackup currently (when using hw accel. aes and lz4). But: no AES256-CTR in libsodium yet. jedisct1/libsodium#317 |
Seems unlikely aes-ctr will be added in libsodium from how that thread has evolved. I think I agree with the why, too. The nice thing about libsodium is it inherits "hard to mess up" from nacl. |
Though it does complicate a migration to different algorithms without adding more dependencies |
openssl 1.1.0 is scheduled for april 2016 release. update: borg uses openssl 1.1.x chacha20 / poly1305 |
Hash news from Google :
|
I have a branch where I worked about on the LoggedIO write performance and managed to double it when processing large files (45 MB/s => 90 MB/s, vmlocked input, output onto tmpfs; dd does ~500 MB/s here), mainly by managing syncs in a way to give the kernel a chance to do them when it wants to, without compromising transactionality (and indeed, syncs don't make a significant appearance in the profile anymore) Adding a none64 encryption using SHA512-256 moved it to ~110 MB/s. Profiled it there (with Cython profiling enabled):
So it seems to me that the Chunker is the next big target for optimization. i.e. mainly see what the compiler does there and if there is anything left to optimize. Btw. using that branch in production currently, nada issues so far. So a PR for that will probably come this weekend. Extraction is basically 70 % SHA-512, 20 % CRC-32 and 10 % IO+misc (for ~210 MB/s). Normal plaintext w/ SHA-256 is 160 MB/s or so. I'd say extraction speed is acceptable for my CPU (which is old and has 'AMD' lasered onto the lid). |
As debian stable and ubuntu lts now has libsodium, I've begun working on a cython-based libsodium binding for borg. gives us chacha20-poly1305 as new aead cipher, blake2b as new hash. Strange, I am seeing less than expected speedup:
I first thought this is maybe caused by a slow blake2b 1.0.8 in ubuntu and I manually installed 1.0.10 (which has "record speed avx2 blake2b") - but it doesn't get faster. https://blake2.net/ says blake2b should be about 3x faster than sha512, so what's going wrong here? |
Quote from python docs: "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly." Quote from stackexchange: "Do note that Adler32 is almost useless for short runs of data. Up to about 180 bytes, it produces numerous collisions."
|
CRC32 is already around 1 GB/s (even on my older CPUs), and should be [much] faster on CPUs with CLMUL (although I'm not sure whether zlib makes use of that - if it doesn't getting an implementation or nudging Python into using one that does would make sense and comes for free (except the hassle)). For 2.0 it would make sense to switch to something as fast as CRC32 (blake) but with much higher integrity guarantees. E.g. 128+ bit blake checksums on the Repository layer. |
All tests made with AMD K10 'Thuban', 3.3 GHz, no AES-NI, OpenSSL master (to-be 1.1)
Intel Xeon E3-1231v3, 3.4 GHz, AES-NI
powermac G5, dual core, 2 GHz, OpenSSL master (to-be 1.1), configured for ppc64.
X200, Intel P8600, 2.4 GHz, no AES-NI, OpenSSL git 38e19eb96
X201, Intel i5-520M (1st gen), AES-NI, 2.5 GHz, OpenSSL master (to-be 1.1)
Odroid-C2, ARM Cortex-A53 (NEON acceleration), 2 GHz, AArch64 mode, 2G RAM, OpenSSL master (to-be-1.1)
A modern ARM core with NEON, performs quite well for AES, and extremely well for ChaCha20-Poly1305 (at ~250 MB/s). SHA-2 is faster than Blake since AArch64 includes instructions for SHA. As expected, the chacha20-poly1305 scheme is by far the fastest in software[1]. AES-OCB is faster than GCM but doesn't quite gets "nearly as fast" as CBC.
Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM. Update: Added results for a Haswell desktop CPU. The ratios almost exactly match Thomas' results as one would expect (both are Haswell). Update: Added results for ARM Cortex-A53 (amlogic s905), AArch64 [1] but I still find it surprisingly fast even on the G5. |
we don't need to compare gcm and cbc modes, cbc does not have auth, so the comparison would be gcm and cbc+auth (hmac or whatever). i am a bit unsure about ocb. although the patent stuff seems unproblematic meanwhile, it hindered wide usage until recently, so one could suspect ocb is way less practically tested than gcm. also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency). |
i5-4200u with aes-ni, openssl 1.0.2:
openssl 1.1.0 git master:
|
I used OpenSSL here mainly because it's a convenient way to test it: While on x86 I don't expect performance differences between *ssl and NaCl/libsodium, re-testing should be done with the library actually used in the end to ensure it has the performance level we expect(ed). |
Somehow embarrassing that we can encrypt+auth 4-8 times faster than compute any easily and separately available hash. |
AES is cheating with it's dedicated per-round instructions :D Could use a hash/mac constructed from AES, but they all have many more caveats than typical MACs in my perception. Another thing to consider is that more recent ARM chips also include acceleration for AES. Newer Raspis (at least the v3) are running on an A53 core that includes that. |
I added another set of results above, for a 1st gen i5 (and also some for the previous Core2 processor). Generally in line with other observations, except...
|
Wow, that's a surprising result. It's just a pity that it likely will take quite some time until aes-ocb (openssl 1.1) is widely available and packaged - and by then many of these 1st gen Core-i machines might be gone anyway. |
About AES-GCM, see Black Hat 2016, paper is public on iacr: "nonce disrespecting adversaries" |
LibreSSL 2.8.3 an macOS 12.0b4 Apple M1 cpu
Note: no aes-ocb, no blake2b. |
why sha256 is slower (significantly) than sha512? |
@Maryse47 that is expected for 64bit platforms, where sha512 is usually faster than sha256. so nowadays it is kind of stupid to use sha256 as a software implementation because one could just use sha512 (and throw away half of the result if 256bits are wanted). only exception (see above) is CPU hw accelerated sha256, that might be faster again if sha512 is not hw accelerated. borg uses sha256 mostly due to historical reasons, but we also have the fast blake2b algo (fast in software, there is no hw acceleration for that). |
OpenSSL 1.1.1k on a Ryzen 5600X locked at 4.6 GHz all cores (if you're thinking "Hey, that seems way better than Zen 2 CPUs which almost never hit their advertised clocks even on the best core with light loads" you'd be right. Zen 3 parts always hit their advertised clocks because (1) AMD did not bullshit this time (2) the default GFL is 50 MHz above the advertised clock. These will always hit 4650 MHz on pretty much any core, even under load.)
Note massive improvement in ChaCha20-Poly1305 over Zen 2 (almost +50 %), and all other pipelinable modes (GCM, CTR, OCB). Zen 3 has more SIMD EUs and seems to have gained another EU capable of AES-NI. Higher AES-CBC performance likely due to much higher sustained clocks under load compared to my 3900X above. Also note how even all the hashes see massively improved performance. During these benchmarks the active core pulls around 4-6 W. Whole CPU is running at around 40 W, 3/4 of that is uncore - MCM / chiplet architecture is a "gas guzzler". |
10GB/s AES in counter mode, woah! 3GB/s chacha also quite fast. |
Had a quick test with pypi
Notable:
|
Would be cool if PR #6463 could get some review. |
About adding blake3 support via https://github.com/oconnor663/blake3-py to borg: How much platform / compile / installation and packaging issues would we likely get by doing so?
Other options for blake3 support? Didn't find a libb(lake)3(-dev) package on ubuntu, debian, fedora. Issue on the python tracker: https://bugs.python.org/issue39298 |
https://lwn.net/Articles/681616/ old, but partly still relevant I guess. |
I played a bit around with blake3:
|
This is even more impressive given the fact that HMAC runs SHA256 twice. It would be interesting to compare SHA256 with the SHA extensions against Blake2 with AVX2 (M1 does not have AVX2), although I do not know if hashlib's Blake2 implementation already makes use of AVX2. Unfortunately, I have currently netiher the SHA extensions nor AVX2. Maybe I can add a benchmark in some time when I got a new machine. Maybe someone else has already the possibility? |
The many flavors of hashing article about different types of hash functions and algorithms |
The hash function is invoked twice in HMAC, yes, but the message is only hashed once. The outer hash fn invocation only processes the outer key and inner hash. Alder Lake results look basically the same as Zen 3 above, except equivalent performance at lower clocks and lower power. |
A new interesting encryption algorithm is called AEGIS, that is based on AES, but from my understanding builds on-top of what has been learned with AES block cipher modes/encryption schemes… https://datatracker.ietf.org/doc/draft-irtf-cfrg-aegis-aead/00/ |
Some benchmarks again, in a roughly historical order. Intel Xeon Gold 6230 CPU (Cascade Lake = Skylake, 14nm), OpenSSL 1.0.2k-fips 26 Jan 2017 (=RHEL 7.9)
The remainder are OpenSSL 1.1.1k FIPS 25 Mar 2021 (RHEL 8) Intel Xeon Platinum 8358 CPU (Ice Lake, 10nm / Intel 7)
Intel Gold 5318N CPU (also Ice Lake, different segment)
AMD EPYC 9454 (Zen 4, 5nm) @ 3.8 GHz. Zen 4 has VAES instructions, but it's unclear to me if this is supposed to double the AES throughput or just a different encoding for the existing AES-NI instructions. In any case, the OpenSSL version used in RHEL 8 is too old to know about VAES.
What do we learn from this? Well, in terms of SHA and AES-NI extensions x86 are very, very uniform these days. Especially in server parts, where Intel cores typically have more FP resources than in client parts. If you normalize to clock speed, they're all pretty much the same. Zen 3 to 4 has no changes at all here, unless VAES makes a difference. Re-test with OpenSSL 3.1.1 VAES does seem to make a difference. A 2x difference. OpenSSL uses VAES for AES-GCM and AES-MB (multi-buffer, which interleaves encryption/decryption of independent streams and is not used here). It's also in a few stitches of AES-CBC and various SHAs, but not in AES-CTR or AES-OCB. Build flags:
VAES (AVX512F) seems to just perform one encryption/decryption round on four independent blocks, However, I'm not sure if the results below are actually VAES' doing and if this actually uses VAES with larger than 128 bit registers, because as far as I can tell the code generator uses xmm registers with VAESENC, which would use the AVX512VL encoding and hence should be equivalent to the traditional AES-NI in terms of performance. So maybe it's just a better implementation in OpenSSL 3.x compared to the old 1.1.x series. In any case, despite being a somewhat terrible construction, AES-GCM just doesn't seem to be able to stop winning. Almost 11 GB/s at just 3.8 GHz is impeccable performance (that's 0.35 cpb). AES-CTR is quite a bit slower at just 8.6 GB/s. The 128 bit variants are not much faster; 12.5 GB/s and 9.8 GB/s, respectively. The Ice Lake Xeon performs even a bit better than the Zen 4 EPYC still at just below 0.3 cpb. AMD EPYC 9454
Intel Xeon Platinum 8358 CPU (the Xeon Gold 5318N behaves the same way and has the same CPUID flags)
|
Also interesting: while sha512 used to be faster than sha256 in a pure sw implementation, it's vice versa with the sha2 hw acceleration and it is faster than pure sw blake2 (as expected). |
might be interesting |
@infectormp IIRC, a talk from a blosc developer or user was the first time I heard about lz4 (and how they use it to get data faster into cpu cache than reading uncompressed memory). But blosc has quite a lot more stuff than we need. |
https://github.com/Cyan4973/xxHash - not a cryptographic hash fn, not for HMAC! So, maybe we could use it as a crc32 replacement (if we keep the crc32(header+all_data) approach). borg uses xxh64 at some places
siphash - cryptographic hash fn (internally used by python >= 3.4), but: only 64bits return value. a 128bit version is "experimental".
libsodium has some hashes / macs also. but not yet widespread on linux dists.
last but not least: sha512-256 is faster on 64bit CPUs than sha256.
The text was updated successfully, but these errors were encountered: