[2.8] What is new? #385

hwu36 · 2021-12-20T05:15:42Z

hwu36
Dec 20, 2021
Maintainer

CUTLASS 2.8 was released on 11/19, its anniversary, and tagged recently. In this release, we have several new exciting features.

As announced in GTC, we released 3xTF32 gemm, complex gemm, conv2d kernels. 3xTF32 is a technique to emulate FP32 accuracy but with 2x performance. The trick is just splitting a FP32 MMA into 3 TF32 MMAs as shown below. It is useful for HPC/DL when FP32 is too slow or TF32 is not accurate enough. Feel free to try the SDK examples which can check the accuracy and performance of your problems.

3xTF32 kernels are supported in the cutlass profiler. The CMake command is cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s1688gemm_1*,cutlass_tensorop_s1688gemm_2*,cutlass_tensorop_s1688gemm_6* to enable all 3xtf32 gemm kernels. Changing s1688gemm to c1688gemm, s1688fprop, s1688dgrad, s1688wgrad can enalbe 3xtf32 complex gemm, fprop, dgrad, wgrad kernels in the profilers.

a x b = (a_big + a_small) x (b_big + b_small) = a_big x b_big + a_big x b_small + a_small x b_big
big = fp32_to_tf32(fp32)
small = fp32_to_tf32(fp32-big)

Group GEMM, similar to batched GEMM but no restriction in any M/N/K dimensions between batches. Imagine its use in Transformer models. The SDK example provides profiling utilities.
Mainloop Scale+Bias+Relu fusion for Fprop and Wgrad. These per-channel elementwise operations are applied before MMA.
Back-to-back conv-conv fusion example can now stage the result of the first conv in the shared memory on Turing. This relaxes the tile size selection and results in better performance. This paper also provides some explanation and performance results.
Just a reminder, in the previous release, CUTLASS open sourced super fast strided dgrad (https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/conv/kernel/default_conv2d_dgrad.h#L678-L796), per channel bias broadcast epilogue fusion, and per channel redution epilgoue fusion.

donglinz · 2022-02-17T09:23:43Z

donglinz
Feb 17, 2022

Why there isn't a_small*b_small in the equation? Is it always so small in all cases so that can be safely omitted?

3xTF32 looks awesome.

1 reply

hwu36 Feb 17, 2022
Maintainer Author

Correct. Actually adding smallxsmall will hurt the precision since it is not in the same scale as the other components.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.8] What is new? #385

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[2.8] What is new? #385

hwu36 Dec 20, 2021 Maintainer

Replies: 1 comment · 1 reply

donglinz Feb 17, 2022

hwu36 Feb 17, 2022 Maintainer Author

hwu36
Dec 20, 2021
Maintainer

Replies: 1 comment 1 reply

donglinz
Feb 17, 2022

hwu36 Feb 17, 2022
Maintainer Author