-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inefficient integer multiplications #16
Comments
AFAIK, this is only true for very old cuda compute versions, thus irrelevant for mining. |
V_MUL_LO_U32, V_MUL_HI_U32 = 16 clock cycles on AMD Vega and RX GPUs |
This is an intentional tradeoff for simplicity. Note that for mining latency doesn't matter, it's all about throughput. AMD Vega and Polaris have V_MUL_LO_U32 and V_MUL_U32_U24. The 32-bit multiply is 4x longer latency than the 24-bit multiply. I haven't been able to find what the throughput difference is. Nvidia Pascal has XMAD, which is 16-bit input, 32-bit output multiply-add. A 3 instruction sequence is used to do a full 32-bit multiply: There's no full-speed, single-instruction common denominator between the two. Doing either 24-bit or 16-bit multiplies would be full speed on one architecture and require extra emulation code on the other architecture. Using normal 32-bit multiplies is a good compromise that's simple and works reasonably on both architectures. This is one of the small optimizations that a theoretical ProgPoW ASIC could do to be 10-20% more efficient than existing GPUs. The inner ProgPoW loop expands to about 200 instructions on Pascal. This optimization on average would save <10 instructions. |
Digging into the profiling data behind our Understanding ProgPoW blog post I just noticed that Turing appears to have a full-speed IMAD instruction. This allows the inner loop to be about 150 instructions, mostly because the *33 from the merge() takes a single IMAD instead of 2 XMADs. So a ProgPoW ASIC with this optimization would simply match a GPU you can buy today. |
Closing issue since this isn't going to be changed. |
An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:
32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.
Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.
The text was updated successfully, but these errors were encountered: