2304 thread blocks on 68 SMs.
Problem description here.
Baseline approach with a memory access pattern that uses many short cachelines, which leads to poor memory transaction coalescing (source).
4 thread blocks on 4 SMs.
Slightly adjusted access pattern where thread warps are accessing consecutive memory addresses, leading to fewer, wider memory transactions (source).
4 thread blocks on 4 SMs.
Reduced amount of memory accesses by reusing data in registers (source). The input data has been copied and transposed to enable a linear memory access pattern for both row- and column-wise accesses.
1 thread block on 1 SM.
Buffering memory accesses through shared memory (source).
1 thread block on 1 SM.