-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs for specific gemm shapes #1801
Comments
Thanks for reporting. This is due to a bug in the CUTLASS 3.x implementation of "separate reduction." For the time being, you can circumvent this with the following change, which go this problem size to work for me. diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 36888a29..46adb3ed 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -1047,11 +1047,7 @@ struct PersistentTileSchedulerSm90StreamKParams {
CUTLASS_HOST_DEVICE
static bool
should_perform_separate_reduction(uint32_t epilogue_subtile, uint64_t sk_units, uint64_t sk_tiles, uint64_t dp_tiles, uint64_t ctas_per_wave) {
- // We perform separate reduction if we have fewer than one wave of output tiles
- // and each output tile is covered by at least to stream-K units. When sk_units is
- // multiple of sk_tiles, will choose basic split-k path instead of separate reduction for now.
- return (epilogue_subtile != 1) && (dp_tiles == 0) && (sk_units > 2u * sk_tiles) &&
- (sk_units + sk_tiles * epilogue_subtile <= ctas_per_wave);
+ return false;
}
// Get the amount of scratch workspace needed for the kernel. This variant of the method should only be used when |
How long is this bug expected to be fixed on the main branch? If it takes pretty long, maybe I should fork the branch and use it with the patch you provided. The buggy GeMM shapes are from LLMs which are pretty popular now. And I also wonder if there's any performance implication applying your patch? That is to say, is there any potential performance penalty when I always turn off the separate reduction? |
There is no timeline for when the separate reduction implementation will be fixed. We plan to roll out the patch I described soon, though. There is no performance implication because, as far as I have seen, separate reduction is currently broken in any of its use cases. |
This issue has been labeled |
@jackkosaian curious how long the separate-reduction fix is expected to take and any suggested workarounds? My understanding is that for small GEMM shapes with large K dimension, separate reduction would be very helpful and since it's disabled, this directly affects the performance for these GEMMs. One such GEMM configuration is m=16,n=2560,k=8192. |
Describe the bug
Gemm kernels with the following configurations hang for specific gemm shapes.
e4m3 x e4m3 -> bf16
256x32x128
2x1x1
KernelTmaWarpSpecializedCooperative
TmaWarpSpecializedCooperative
Stream-K
Tested gemm shapes(
MxNxK
):When I change the epilogue schedule to
NoSmemWarpSpecialized
, this issue seems to disappear. Therefore, I guess there's something wrong with the TMA epilogue when it is used with Stream-K.Steps/Code to reproduce bug
Apply the following patch file to
48_hopper_warp_specialized_gemm.cu
:(To apply the patch, use
patch -p1 < xxx.patch
)Then execute the example with the command
Environment details
The text was updated successfully, but these errors were encountered: