-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Why did I get a wrong result from GemmGrouped? #1924
Comments
This is a simple test case to reproduce the problem:
The binary tensor files 'input.bin' and 'weight.bin' can be generated by using NumPy. |
your align1 can be the problem. Since you are doing A:row x B:row -> C:row, your leading dimension is A:k=3584, B:N=48, C:N=48. You can just use alignment = 8. Also your tile sizes are not common ones, you could start from this
If it works, then switch the tile size to smaller ones like what you tried. |
@hwu36
It also works, but I can't make the tile size to be any smaller.
This will cause incorrect result. The 16 columns in middle of the output matrix are incorrect. Anyway, thanks very much for your help! |
I'm using GemmGrouped in this way:
With num_problems = 1, M = 2035, N = 48, K = 3584
all_problems[0] = cutlass::gemm::GemmCoord(2035, 48, 3584)
ptr_X[0] = matrix A, ptr_W[0] = matrix B and ptr_Y[0] = matrix C
ld_X[0] = K, ld_W[0] = N, ld_Y[0] = N
cutlass_t = cutlass::half_t
and the GPU I used is A100
cutlass version: 3.5.0
nvcc version: 12.4
The calculation I'm expecting is C = matmul(A, B) so I set alpha = 1.0 and beta = 0.0 for epilogue_op.
The shape of C is (2035, 48), however only elements in the first 16 columns were correct, all other elements of C were incorrect.
I spent a lot of time on tracing the execution procedure with cuda-gdb, and I found something is wrong in loading warp fragments of matrix B.
The loading procedure is done by code below:
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h#L395-L415
Before the 'ldsm' here, I printed the values in global memory from address 'source_byte_ptr' of block 0 thread 0, and the values and 'source_byte_ptr' of some threads are shown below:
Looks like cutlass has automatically done padding and swizzle for matrix B, and I think the memory layout of B looks correct. The ''source_byte_ptr' of 32 threads in Warp 0(thread 0 - 31) were all pointed to B[threadIdx.x][0:8], however for Warp 1, thread 34 got the address of 8 zeros...and many threads in Warp 1-3 got wrong memory addresses in my point of view.
As a result, after the 'ldsm' here, thread 0 got 4 32-bit values consisted of {B[0][0], B[1][0], B[8][0], B[9][0], B[0][8], B[1][8], B[8][8], B[9][8]}.
According to the figure above showing the element layout of an m16n8k16 mma instruction, threads in Warp 0 got correct fragments. But threads in other Warps would get wrong fragments, for example, thread 33 got all zeros.
I think that's the reason why only the part belonged to Warp 0 has correct values in matrix C, but I still don't know why the fragment loading procedure would go wrong.
The text was updated successfully, but these errors were encountered: