Faster (by a magnitude) OpenCL AES code #5594

magnumripper · 2024-11-30T01:16:28Z

The default KeePass benchmark includes old low iteration test vectors. To see modern, real-life cost, edit opencl_keepass_fmt_plug.c, change the define of KEEPASS_REAL_COST_TEST_VECTORS to 1, rebuild and benchmark with -cost=31250000 (it will take a looong time).

Each KDF-AES round is 32 bytes of AES-256 so at this cost it will encrypt exactly 1 GB with AES-256 per candidate - great for testing AES performance. Our shared OpenCL AES comes in two flavors. Using the "bitsliced AES" (default for GPU - I kinda doubt bitsliced is a correct description but it does some things in parallel), Super's 1080 does 20.5 GB/s which I believe could be improved with ~~at least 10x~~ over 5x, (and forcing it to run our older, even worse, vanilla AES it's down to 1.15 GB/s which is essentially useless for anything but decrypting a short verifier).

The faster GPU code (CUDA) is at https://www.github.com/cihangirtezcan/CUDA_AES and there's also a freely available PDF.

The text was updated successfully, but these errors were encountered:

magnumripper · 2024-12-04T00:22:46Z

Our shared OpenCL AES comes in two flavors. Using the "bitsliced AES" (default for GPU - I kinda doubt bitsliced is a correct description

I had a closer look at our current code now. It really is bitsliced, and here's how:

/*
 * This constant-time implementation is "bitsliced": the 128-bit state is
 * split over eight 32-bit words q* in the following way:
 *
 * -- Input block consists in 16 bytes:
 *    a00 a10 a20 a30 a01 a11 a21 a31 a02 a12 a22 a32 a03 a13 a23 a33
 * In the terminology of FIPS 197, this is a 4x4 matrix which is read
 * column by column.
 *
 * -- Each byte is split into eight bits which are distributed over the
 * eight words, at the same rank. Thus, for a byte x at rank k, bit 0
 * (least significant) of x will be at rank k in q0 (if that bit is b,
 * then it contributes "b << k" to the value of q0), bit 1 of x will be
 * at rank k in q1, and so on.
 *
 * -- Ranks given to bits are in "row order" and are either all even, or
 * all odd. Two independent AES states are thus interleaved, one using
 * the even ranks, the other the odd ranks. Row order means:
 *    a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33
 *
 * Converting input bytes from two AES blocks to bitslice representation
 * is done in the following way:
 * -- Decode first block into the four words q0 q2 q4 q6, in that order,
 * using little-endian convention.
 * -- Decode second block into the four words q1 q3 q5 q7, in that order,
 * using little-endian convention.
 * -- Call aes_ct_ortho().
 *
 * Converting back to bytes is done by using the reverse operations. Note
 * that aes_ct_ortho() is its own inverse.
 */

/*
 * The AES S-box, as a bitsliced constant-time version. The input array
 * consists in eight 32-bit words; 32 S-box instances are computed in
 * parallel. Bits 0 to 7 of each S-box input (bit 0 is least significant)
 * are spread over the words 0 to 7, at the same rank.
 */

but it does some things in parallel)

Since it computes 32 S-box instances in parallel, it can encrypt or decrypt two blocks in parallel. I tried it now with KeePass format. The kernel makes one AES-ECB call of 32 bytes per iteration. If I instead make two calls of one block at a time, speed more than halves, eg. from 11636 to 5314 c/s on a GTX 1080 and cost 60000.

Some cipher modes can't utilize this though. For example, CBC encryption can't be two blocks in parallel because of its design, while CBC decryption can (~~but isn't with our current code~~ fixed with #5606).

I'm currently playing with putting parts in local memory, just to see how fast I can make it before having a closer look at the CUDA code mentioned above.

Ensure AES_cbc_decrypt() utilizes the parallel capability of the bitsliced AES code when more than one block is decrypted at once. For encryption this is not possible as the IV for block n is unknown until block n-1 has been encrypted. The potential boost is great (over 2x seen with AES-ECB) but none of our current formats use AES-CBC enough for this to make a significant difference, at least not with their (often small) test vectors. See openwall#5594.

Ensure AES_cbc_decrypt() utilizes the parallel capability of the bitsliced AES code when more than one block is decrypted at once. For encryption this is not possible as the IV for block n is unknown until block n-1 has been encrypted. The potential boost is great (over 2x seen with AES-ECB) but none of our current formats use AES-CBC enough for this to make a significant difference, at least not with their (often small) test vectors. See #5594.

magnumripper · 2024-12-05T23:11:00Z

BTW according to https://hashcat.net/forum/thread-8237.html, hashcat's keepass apparently decrypts 42 GB/s on a 2080ti. That should mean about 30 GB/s on a 1080 so hashcat is ~50% faster.

magnumripper · 2024-12-05T23:39:17Z

The faster GPU code (CUDA) is at https://www.github.com/cihangirtezcan/CUDA_AES and there's also a freely available PDF.

Reading it again, I realized they talk gigabits/s and not gigabytes/s. So they got 878.6 Gbps on a 2070S, which is 109 GB/s. That GPU is a tad slower than our old 1080 so it should correspond to 118 GB/s, give or take. Still, that is 5.75x our current speed of 20.5 GB/s or 164 Gb/s.

magnumripper · 2024-12-05T23:57:07Z

The faster GPU code (CUDA) is at https://www.github.com/cihangirtezcan/CUDA_AES and there's also a freely available PDF.

Tezcan's work is under a Creative Commons Attribution-NonCommercial-NoDerivatives license though. So we can't use it (unless we get permission to).

magnumripper · 2024-12-06T08:37:29Z

Reading it again, I realized they talk gigabits/s and not gigabytes/s. So they got 878.6 Gbps on a 2070S, which is 109 GB/s. That GPU is a tad slower than our old 1080 so it should correspond to 118 GB/s, give or take. Still, that is 5.75x our current speed of 20.5 GB/s or 164 Gb/s.

Furthermore, assuming AES256 is 40% slower than AES128, 109 GB/s of AES128 corresponds to 65 GB/s of AES256. That make our current code about 1/3 as fast as that whitepaper, and hashcat's about 45% of it. Looks like it's going to be hard to achieve a magnitude :)

magnumripper · 2024-12-06T08:49:57Z

Quoting from the whitepaper "The GPU architecture allows limited number of registers per thread block and 100% occupancy can be achieved on modern GPUs only when a thread uses 64 or less 32-bit registers. However, since a bitsliced implementation requires at least 128 register for every bit of the 128-bit block, a bitsliced implementation cannot fully occupy the GPU. These bitsliced implementations can provide better results on future GPUs if they come with more registers per thread block". Not sure if that has changed with later GPUs. Also, the fact that the BS can sometimes compute two blocks in parallel matters here.

This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or 914 Gbps of AES-256 encryption (keepass) on a 4070ti. The bitsliced code we defaulted to before is really good but it's register hungry. It has some merits when two or more blocks are encrypted/decrypted at once (does two in parallel) but still is slower than table based now. We can still opt in to use it. See openwall#5594

@aes

Closes openwall#5594 -test -format:@aes,+opencl Relbench best five (GTX 1080, ancient driver): Ratio: 3.12824 real, 3.13188 virtual axcrypt-opencl:Raw Ratio: 2.16188 real, 2.16309 virtual axcrypt2-opencl, AxCrypt 2.x:Raw Ratio: 1.67273 real, 1.67273 virtual pgpwde-opencl, PGP Whole Disk Encryption:Raw Ratio: 1.32885 real, 1.32876 virtual o5logon-opencl, Oracle O5LOGON protocol:Raw Ratio: 1.23904 real, 1.24523 virtual strip-opencl, Password Manager:Raw

This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or 914 Gbps of AES-256 encryption (keepass) on a 4070ti. The bitsliced code we defaulted to before is really good but it's register hungry. It has some merits when two or more blocks are encrypted/decrypted at once (does two in parallel) but still is slower than table based now. We can still opt in to use it. See openwall#5594

@aes

Closes openwall#5594 -test -format:@aes,+opencl Relbench best five (GTX 1080, ancient driver): Ratio: 3.12824 real, 3.13188 virtual axcrypt-opencl:Raw Ratio: 2.16188 real, 2.16309 virtual axcrypt2-opencl, AxCrypt 2.x:Raw Ratio: 1.67273 real, 1.67273 virtual pgpwde-opencl, PGP Whole Disk Encryption:Raw Ratio: 1.32885 real, 1.32876 virtual o5logon-opencl, Oracle O5LOGON protocol:Raw Ratio: 1.23904 real, 1.24523 virtual strip-opencl, Password Manager:Raw

This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or 914 Gbps of AES-256 encryption (keepass) on a 4070ti. The bitsliced code we defaulted to before is really good but it's register hungry. It has some merits when two or more blocks are encrypted/decrypted at once (does two in parallel) but still is slower than table based now. We can still opt in to use it. Note: This commit switches all formats to table-based AES without actually enabling the copying to local until next commit where all formats are adapted to use it. This very commit thus makes for a performance regression. See openwall#5594

Enable local memory for table-based AES. Closes openwall#5594 Bitlocker format is not affected as it has it's own implementation, but AES performance is insignificant for it anyway.

This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or 914 Gbps of AES-256 encryption (keepass) on a 4070ti. The bitsliced code we defaulted to before is really good but it's register hungry. It has some merits when two or more blocks are encrypted/decrypted at once (does two in parallel) but still is slower than table based now. We can still opt in to use it. Note: This commit switches all formats to table-based AES without actually enabling the copying to local until next commit where all formats are adapted to use it. This very commit thus makes for a performance regression. See #5594

Enable local memory for table-based AES. Closes #5594 Bitlocker format is not affected as it has it's own implementation, but AES performance is insignificant for it anyway.

solardiz · 2025-01-12T17:17:19Z

A paper related to bitslice AES (which I think we still have, even if now disabled by default): https://eprint.iacr.org/2024/1996
"A Framework for Generating S-Box Circuits with Boyar-Peralta Algorithm-Based Heuristics, and Its Applications to AES, SNOW3G, and Saturnin"
"Our results include depth improvements by about 40% [...] compared to the existing AES S-box"

magnumripper self-assigned this Nov 30, 2024

magnumripper changed the title ~~Faster OpenCL AES code~~ Faster (by a magnitude) OpenCL AES code Nov 30, 2024

solardiz added the enhancement label Nov 30, 2024

magnumripper mentioned this issue Dec 4, 2024

OpenCL: Double the performance of shared AES-CBC-decrypt #5606

Merged

magnumripper mentioned this issue Dec 16, 2024

OpenCL: Default to table based AES, now backed in local memory #5613

Merged

magnumripper closed this as completed in #5613 Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster (by a magnitude) OpenCL AES code #5594

Faster (by a magnitude) OpenCL AES code #5594

magnumripper commented Nov 30, 2024 •

edited

Loading

magnumripper commented Dec 4, 2024 •

edited

Loading

magnumripper commented Dec 5, 2024

magnumripper commented Dec 5, 2024

magnumripper commented Dec 5, 2024

magnumripper commented Dec 6, 2024

magnumripper commented Dec 6, 2024 •

edited

Loading

solardiz commented Jan 12, 2025

Faster (by a magnitude) OpenCL AES code #5594

Faster (by a magnitude) OpenCL AES code #5594

Comments

magnumripper commented Nov 30, 2024 • edited Loading

magnumripper commented Dec 4, 2024 • edited Loading

magnumripper commented Dec 5, 2024

magnumripper commented Dec 5, 2024

magnumripper commented Dec 5, 2024

magnumripper commented Dec 6, 2024

magnumripper commented Dec 6, 2024 • edited Loading

solardiz commented Jan 12, 2025

magnumripper commented Nov 30, 2024 •

edited

Loading

magnumripper commented Dec 4, 2024 •

edited

Loading

magnumripper commented Dec 6, 2024 •

edited

Loading