Inefficient integer multiplications #16

SChernykh · 2018-11-20T07:12:13Z

An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:

    case 1: return a * b;
    case 2: return mul_hi(a, b);

32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.

Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.

The text was updated successfully, but these errors were encountered:

pallas1 · 2018-11-20T08:47:18Z

AFAIK, this is only true for very old cuda compute versions, thus irrelevant for mining.

SChernykh · 2018-11-20T08:51:40Z

V_MUL_LO_U32, V_MUL_HI_U32 = 16 clock cycles on AMD Vega and RX GPUs
V_MUL_U32_U24 = 4 clock cycles on AMD

ifdefelse · 2018-11-20T21:53:33Z

This is an intentional tradeoff for simplicity. Note that for mining latency doesn't matter, it's all about throughput.

AMD Vega and Polaris have V_MUL_LO_U32 and V_MUL_U32_U24. The 32-bit multiply is 4x longer latency than the 24-bit multiply. I haven't been able to find what the throughput difference is.

Nvidia Pascal has XMAD, which is 16-bit input, 32-bit output multiply-add. A 3 instruction sequence is used to do a full 32-bit multiply:
https://groups.google.com/d/msg/maxas-discuss/4rovrjSRzKA/_UCuK9U2BAAJ

There's no full-speed, single-instruction common denominator between the two. Doing either 24-bit or 16-bit multiplies would be full speed on one architecture and require extra emulation code on the other architecture. Using normal 32-bit multiplies is a good compromise that's simple and works reasonably on both architectures.

This is one of the small optimizations that a theoretical ProgPoW ASIC could do to be 10-20% more efficient than existing GPUs. The inner ProgPoW loop expands to about 200 instructions on Pascal. This optimization on average would save <10 instructions.

ifdefelse · 2018-11-20T22:11:28Z

Digging into the profiling data behind our Understanding ProgPoW blog post I just noticed that Turing appears to have a full-speed IMAD instruction. This allows the inner loop to be about 150 instructions, mostly because the *33 from the merge() takes a single IMAD instead of 2 XMADs.

So a ProgPoW ASIC with this optimization would simply match a GPU you can buy today.

In sync

ifdefelse · 2018-12-10T16:51:11Z

Closing issue since this isn't going to be changed.

hackmod pushed a commit to EthersocialNetwork/ethminer-ProgPOW that referenced this issue Dec 9, 2018

Merge pull request ifdefelse#16 from ethereum-mining/master

5d41ed5

In sync

ifdefelse closed this as completed Dec 10, 2018

tevador mentioned this issue Jan 23, 2019

ProgPow ASIC possibilities evaluated by 2 experts #24

Open

solardiz mentioned this issue Apr 3, 2019

Make greater use of MADs #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient integer multiplications #16

Inefficient integer multiplications #16

SChernykh commented Nov 20, 2018 •

edited

Loading

pallas1 commented Nov 20, 2018

SChernykh commented Nov 20, 2018

ifdefelse commented Nov 20, 2018 •

edited

Loading

ifdefelse commented Nov 20, 2018

ifdefelse commented Dec 10, 2018

Inefficient integer multiplications #16

Inefficient integer multiplications #16

Comments

SChernykh commented Nov 20, 2018 • edited Loading

pallas1 commented Nov 20, 2018

SChernykh commented Nov 20, 2018

ifdefelse commented Nov 20, 2018 • edited Loading

ifdefelse commented Nov 20, 2018

ifdefelse commented Dec 10, 2018

SChernykh commented Nov 20, 2018 •

edited

Loading

ifdefelse commented Nov 20, 2018 •

edited

Loading