You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I was running the (code example)[https://github.com/user-attachments/files/17388059/sgemm_sm80_tmp.txt] provided by @ccecka in another issue, I got cudaDeviceSynchronize(): cudaErrorLaunchFailure: unspecified launch failure when running under the "TN" set up. With a bit of investigation, I found that the problem seems to occur due to the Swizzle layout used by the code.
By changing the swizzle_atom to composition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{}) and bK to _32, the code runs without the previous mentioned error. However, I am not able to find any other setup that also works. And what it seems to be strange to me is that the value of bK also affect the correctness of the code, when I change bK to _64 while keeping swizzle_atom to be composition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{}), the code also failed to work. Shouldn't tile_to_shape handle this kind of situation automatically?
The text was updated successfully, but these errors were encountered:
The original example that I provided gave me no errors. In that thread, we discuss all kinds of changes that are possible including swapping out the MMA, changing the SMEM layouts, and modifying the copy patterns.
You can inspect the Tensors/Layouts pre-partitioning and post-partitioning with print, print_tensor, and print_latex to sanity check any concerns. Yes, tile_to_shape should be working fine (and you can verify that by inspecting the output), but we would also need to see the configuration of the TiledMMAs, the TiledCopys, etc. Many static assertions in the original example code to catch common incompatibilities.
Despite our best efforts to check everything statically that can be checked, out-of-bounds accesses and run-time failures can still occur. This can often be due to problem sizes that are not a multiple of the tile sizes -- the example provided does not perform any predication.
Swizzle layouts are primarily a bank access pattern optimization, so if you believe that Swizzle is the problem, then test it for correctness without a Swizzle first. Only then, optimize any SMEM bank conflicts by using a swizzle pattern (and verifying the post-partitioned layouts still make sense).
When I was running the (code example)[https://github.com/user-attachments/files/17388059/sgemm_sm80_tmp.txt] provided by @ccecka in another issue, I got
cudaDeviceSynchronize(): cudaErrorLaunchFailure: unspecified launch failure
when running under the "TN" set up. With a bit of investigation, I found that the problem seems to occur due to theSwizzle
layout used by the code.By changing the
swizzle_atom
tocomposition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{})
andbK
to_32
, the code runs without the previous mentioned error. However, I am not able to find any other setup that also works. And what it seems to be strange to me is that the value ofbK
also affect the correctness of the code, when I changebK
to_64
while keepingswizzle_atom
to becomposition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{})
, the code also failed to work. Shouldn'ttile_to_shape
handle this kind of situation automatically?The text was updated successfully, but these errors were encountered: