[Software pipeline] Fix hardcoded index in access_ptr
rewriting, add a GPU test with depth 4
#11495
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix a hardcoded index in
access_ptr
rewriting, which assumes that the number of stages is 2.Refactored MMA code in test_tir_schedule_tensorize_ldmatrix_mma.py, so that it can be used by other tests. The new test in
test_tir_transform_inject_software_pipeline.py
applies software pipelining annotations to the MMA-tensorized schedule withsoftware_pipeline_stage = [0, 0, 3]
, which makes global to shared load pipelined with depth 4. Without async copy, this is not useful for performance. But it does demonstrate that a multi-stage pipeline with depth > 2 works on a semi-realistic GPU schedule.The test uses large dynamic shared memory, which serves as a test case for #11478.
@vinx13 @junrushao1994 @csullivan