Skip to content

ginsongsong/CUDA_cutlass_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA_cutlass_benchmark

This cutlass code will find the best MNK to stress the best performance in different GPU.

usage(This script will boost you CPU and GPU clocks):

make

#Find Best MNK Usage : ./all_gemm --gemmType=N --findBest
#Find Best MNK + Stress Usage: ./all_gemm --gemmType=N --autoStress
#Stress Usage : ./all_gemm --gemmType=N --stress --mn=xxx --k=yyy

Result as following.

./all_gemm --gemmType=4 --findBest [TensorCore FP16(FP16 accumulation) Time and TFLOPS Result]
m n k Time (msec) TFLOPS
1024, 1024, 1024, 0.06304, 34.07,
1024, 1024, 2048, 0.05139, 83.57,
1024, 1024, 3072, 0.0729, 88.38,
1024, 1024, 4096, 0.1659, 51.78,
1024, 1024, 5120, 0.2053, 52.31,
2048, 2048, 1024, 0.09267, 92.69,
2048, 2048, 2048, 0.1679, 102.3,
2048, 2048, 3072, 0.2458, 104.9,
2048, 2048, 4096, 0.3221, 106.7,
[Peak TFLOPS]=106.7, m=n=2048, k=4096

Tensor( 32 accumulation):
INT8_Tensor(INT8->INT32 accumulation)
FP16_Tensor(FP16->FP16 accumulation)
FP16_32_Tensor(FP16->FP32 accumulation)

Gemm without Tensor
HGEMM->FP16 GEMM
SGEMM->FP32 GEMM
DEGMM->FP64 GEMM

Theoretical FMA for Flops
( number of core * peak freq in graphic clock * instruction per clock)

Core/ GPU Core(streaming processor) Peak freq(GPU) Instruction per clock Flops GFlops
Tensor CORE 640(64)=40960 1.53(SXM2-V100) 2 125337.6 125.34
FP16 5120 1.53(SXM2-V100) 4 31334.4 31.33
FP32 5120 1.53(SXM2-V100) 2 15667.2 15.67
FP64 2560 1.53(SXM2-V100) 2 7833.6 7.83

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published