Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster (by a magnitude) OpenCL AES code #5594

Closed
magnumripper opened this issue Nov 30, 2024 · 7 comments · Fixed by #5613
Closed

Faster (by a magnitude) OpenCL AES code #5594

magnumripper opened this issue Nov 30, 2024 · 7 comments · Fixed by #5613
Assignees

Comments

@magnumripper
Copy link
Member

magnumripper commented Nov 30, 2024

#5591 (comment)

The default KeePass benchmark includes old low iteration test vectors. To see modern, real-life cost, edit opencl_keepass_fmt_plug.c, change the define of KEEPASS_REAL_COST_TEST_VECTORS to 1, rebuild and benchmark with -cost=31250000 (it will take a looong time).

Each KDF-AES round is 32 bytes of AES-256 so at this cost it will encrypt exactly 1 GB with AES-256 per candidate - great for testing AES performance. Our shared OpenCL AES comes in two flavors. Using the "bitsliced AES" (default for GPU - I kinda doubt bitsliced is a correct description but it does some things in parallel), Super's 1080 does 20.5 GB/s which I believe could be improved with at least 10x over 5x, (and forcing it to run our older, even worse, vanilla AES it's down to 1.15 GB/s which is essentially useless for anything but decrypting a short verifier).

The faster GPU code (CUDA) is at https://www.github.com/cihangirtezcan/CUDA_AES and there's also a freely available PDF.

@magnumripper magnumripper self-assigned this Nov 30, 2024
@magnumripper magnumripper changed the title Faster OpenCL AES code Faster (by a magnitude) OpenCL AES code Nov 30, 2024
@magnumripper
Copy link
Member Author

magnumripper commented Dec 4, 2024

Our shared OpenCL AES comes in two flavors. Using the "bitsliced AES" (default for GPU - I kinda doubt bitsliced is a correct description

I had a closer look at our current code now. It really is bitsliced, and here's how:

/*
 * This constant-time implementation is "bitsliced": the 128-bit state is
 * split over eight 32-bit words q* in the following way:
 *
 * -- Input block consists in 16 bytes:
 *    a00 a10 a20 a30 a01 a11 a21 a31 a02 a12 a22 a32 a03 a13 a23 a33
 * In the terminology of FIPS 197, this is a 4x4 matrix which is read
 * column by column.
 *
 * -- Each byte is split into eight bits which are distributed over the
 * eight words, at the same rank. Thus, for a byte x at rank k, bit 0
 * (least significant) of x will be at rank k in q0 (if that bit is b,
 * then it contributes "b << k" to the value of q0), bit 1 of x will be
 * at rank k in q1, and so on.
 *
 * -- Ranks given to bits are in "row order" and are either all even, or
 * all odd. Two independent AES states are thus interleaved, one using
 * the even ranks, the other the odd ranks. Row order means:
 *    a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33
 *
 * Converting input bytes from two AES blocks to bitslice representation
 * is done in the following way:
 * -- Decode first block into the four words q0 q2 q4 q6, in that order,
 * using little-endian convention.
 * -- Decode second block into the four words q1 q3 q5 q7, in that order,
 * using little-endian convention.
 * -- Call aes_ct_ortho().
 *
 * Converting back to bytes is done by using the reverse operations. Note
 * that aes_ct_ortho() is its own inverse.
 */

/*
 * The AES S-box, as a bitsliced constant-time version. The input array
 * consists in eight 32-bit words; 32 S-box instances are computed in
 * parallel. Bits 0 to 7 of each S-box input (bit 0 is least significant)
 * are spread over the words 0 to 7, at the same rank.
 */

but it does some things in parallel)

Since it computes 32 S-box instances in parallel, it can encrypt or decrypt two blocks in parallel. I tried it now with KeePass format. The kernel makes one AES-ECB call of 32 bytes per iteration. If I instead make two calls of one block at a time, speed more than halves, eg. from 11636 to 5314 c/s on a GTX 1080 and cost 60000.

Some cipher modes can't utilize this though. For example, CBC encryption can't be two blocks in parallel because of its design, while CBC decryption can (but isn't with our current code fixed with #5606).

I'm currently playing with putting parts in local memory, just to see how fast I can make it before having a closer look at the CUDA code mentioned above.

magnumripper added a commit to magnumripper/john that referenced this issue Dec 4, 2024
Ensure AES_cbc_decrypt() utilizes the parallel capability of the bitsliced
AES code when more than one block is decrypted at once.  For encryption this
is not possible as the IV for block n is unknown until block n-1 has been
encrypted.

The potential boost is great (over 2x seen with AES-ECB) but none of our
current formats use AES-CBC enough for this to make a significant
difference, at least not with their (often small) test vectors.

See openwall#5594.
solardiz pushed a commit that referenced this issue Dec 5, 2024
Ensure AES_cbc_decrypt() utilizes the parallel capability of the bitsliced
AES code when more than one block is decrypted at once.  For encryption this
is not possible as the IV for block n is unknown until block n-1 has been
encrypted.

The potential boost is great (over 2x seen with AES-ECB) but none of our
current formats use AES-CBC enough for this to make a significant
difference, at least not with their (often small) test vectors.

See #5594.
@magnumripper
Copy link
Member Author

BTW according to https://hashcat.net/forum/thread-8237.html, hashcat's keepass apparently decrypts 42 GB/s on a 2080ti. That should mean about 30 GB/s on a 1080 so hashcat is ~50% faster.

@magnumripper
Copy link
Member Author

The faster GPU code (CUDA) is at https://www.github.com/cihangirtezcan/CUDA_AES and there's also a freely available PDF.

Reading it again, I realized they talk gigabits/s and not gigabytes/s. So they got 878.6 Gbps on a 2070S, which is 109 GB/s. That GPU is a tad slower than our old 1080 so it should correspond to 118 GB/s, give or take. Still, that is 5.75x our current speed of 20.5 GB/s or 164 Gb/s.

@magnumripper
Copy link
Member Author

The faster GPU code (CUDA) is at https://www.github.com/cihangirtezcan/CUDA_AES and there's also a freely available PDF.

Tezcan's work is under a Creative Commons Attribution-NonCommercial-NoDerivatives license though. So we can't use it (unless we get permission to).

@magnumripper
Copy link
Member Author

Reading it again, I realized they talk gigabits/s and not gigabytes/s. So they got 878.6 Gbps on a 2070S, which is 109 GB/s. That GPU is a tad slower than our old 1080 so it should correspond to 118 GB/s, give or take. Still, that is 5.75x our current speed of 20.5 GB/s or 164 Gb/s.

Furthermore, assuming AES256 is 40% slower than AES128, 109 GB/s of AES128 corresponds to 65 GB/s of AES256. That make our current code about 1/3 as fast as that whitepaper, and hashcat's about 45% of it. Looks like it's going to be hard to achieve a magnitude :)

@magnumripper
Copy link
Member Author

magnumripper commented Dec 6, 2024

Quoting from the whitepaper "The GPU architecture allows limited number of registers per thread block and 100% occupancy can be achieved on modern GPUs only when a thread uses 64 or less 32-bit registers. However, since a bitsliced implementation requires at least 128 register for every bit of the 128-bit block, a bitsliced implementation cannot fully occupy the GPU. These bitsliced implementations can provide better results on future GPUs if they come with more registers per thread block". Not sure if that has changed with later GPUs. Also, the fact that the BS can sometimes compute two blocks in parallel matters here.

magnumripper added a commit to magnumripper/john that referenced this issue Dec 16, 2024
This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or
914 Gbps of AES-256 encryption (keepass) on a 4070ti.

The bitsliced code we defaulted to before is really good but it's register
hungry. It has some merits when two or more blocks are encrypted/decrypted
at once (does two in parallel) but still is slower than table based now. We
can still opt in to use it.

See openwall#5594
magnumripper added a commit to magnumripper/john that referenced this issue Dec 16, 2024
Closes openwall#5594

-test -format:@aes,+opencl

Relbench best five (GTX 1080, ancient driver):
Ratio:	3.12824 real, 3.13188 virtual	axcrypt-opencl:Raw
Ratio:	2.16188 real, 2.16309 virtual	axcrypt2-opencl, AxCrypt 2.x:Raw
Ratio:	1.67273 real, 1.67273 virtual	pgpwde-opencl, PGP Whole Disk Encryption:Raw
Ratio:	1.32885 real, 1.32876 virtual	o5logon-opencl, Oracle O5LOGON protocol:Raw
Ratio:	1.23904 real, 1.24523 virtual	strip-opencl, Password Manager:Raw
magnumripper added a commit to magnumripper/john that referenced this issue Dec 16, 2024
This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or
914 Gbps of AES-256 encryption (keepass) on a 4070ti.

The bitsliced code we defaulted to before is really good but it's register
hungry. It has some merits when two or more blocks are encrypted/decrypted
at once (does two in parallel) but still is slower than table based now. We
can still opt in to use it.

See openwall#5594
magnumripper added a commit to magnumripper/john that referenced this issue Dec 16, 2024
Closes openwall#5594

-test -format:@aes,+opencl

Relbench best five (GTX 1080, ancient driver):
Ratio:	3.12824 real, 3.13188 virtual	axcrypt-opencl:Raw
Ratio:	2.16188 real, 2.16309 virtual	axcrypt2-opencl, AxCrypt 2.x:Raw
Ratio:	1.67273 real, 1.67273 virtual	pgpwde-opencl, PGP Whole Disk Encryption:Raw
Ratio:	1.32885 real, 1.32876 virtual	o5logon-opencl, Oracle O5LOGON protocol:Raw
Ratio:	1.23904 real, 1.24523 virtual	strip-opencl, Password Manager:Raw
magnumripper added a commit to magnumripper/john that referenced this issue Dec 17, 2024
This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or
914 Gbps of AES-256 encryption (keepass) on a 4070ti.

The bitsliced code we defaulted to before is really good but it's register
hungry. It has some merits when two or more blocks are encrypted/decrypted
at once (does two in parallel) but still is slower than table based now. We
can still opt in to use it.

Note: This commit switches all formats to table-based AES without actually
enabling the copying to local until next commit where all formats are adapted
to use it.  This very commit thus makes for a performance regression.

See openwall#5594
magnumripper added a commit to magnumripper/john that referenced this issue Dec 17, 2024
Enable local memory for table-based AES.  Closes openwall#5594

Bitlocker format is not affected as it has it's own implementation, but
AES performance is insignificant for it anyway.
magnumripper added a commit that referenced this issue Dec 17, 2024
This revised version pushes 1.4 Tbps of AES-128 decryption (axcrypt) or
914 Gbps of AES-256 encryption (keepass) on a 4070ti.

The bitsliced code we defaulted to before is really good but it's register
hungry. It has some merits when two or more blocks are encrypted/decrypted
at once (does two in parallel) but still is slower than table based now. We
can still opt in to use it.

Note: This commit switches all formats to table-based AES without actually
enabling the copying to local until next commit where all formats are adapted
to use it.  This very commit thus makes for a performance regression.

See #5594
magnumripper added a commit that referenced this issue Dec 17, 2024
Enable local memory for table-based AES.  Closes #5594

Bitlocker format is not affected as it has it's own implementation, but
AES performance is insignificant for it anyway.
@solardiz
Copy link
Member

A paper related to bitslice AES (which I think we still have, even if now disabled by default): https://eprint.iacr.org/2024/1996
"A Framework for Generating S-Box Circuits with Boyar-Peralta Algorithm-Based Heuristics, and Its Applications to AES, SNOW3G, and Saturnin"
"Our results include depth improvements by about 40% [...] compared to the existing AES S-box"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants