Streamk v0.2 #646

xiaohuguo2023 · 2024-09-25T16:17:49Z

streamk v0.2:

new streamk tuning script to reduce compiling and profiling time
use load/store cache modifier to reimplement spinning lock
add CI test for streamk-kernel
able to use streampipelineV2

Additionally changes locks to use uint8 instead of int32 for smaller space footprint.

neoblizz

I put in some comments.

neoblizz · 2024-09-26T21:18:40Z

python/perf-kernels/streamk/streamk_kernel.py

Note, for gfx90a the load/stores with cache_modifiers do not work. Documented here: https://github.com/ROCm/triton-internal/issues/311

neoblizz · 2024-09-26T21:19:07Z

python/perf-kernels/streamk/streamk_kernel.py

+            rm1 = tl.max_contiguous(tl.multiple_of(rm1, BLOCK_SIZE_M), BLOCK_SIZE_M)
+            rn1 = tl.max_contiguous(tl.multiple_of(rn1, BLOCK_SIZE_N), BLOCK_SIZE_N)
+            P_ = P + pid * BLOCK_SIZE_M * BLOCK_SIZE_N + rm1[:, None] * BLOCK_SIZE_N + rn1[None, :]
+            tl.store(P_, acc, cache_modifier=".wt")


Note, for gfx90a the load/stores with cache_modifiers do not work. Documented here: https://github.com/ROCm/triton-internal/issues/311

neoblizz · 2024-09-26T21:19:27Z

python/perf-kernels/streamk/streamk_kernel.py

-                # todo: try use tl.load once cache modifier landed upstream
-                while tl.atomic_cas(locks + next_pid, 1, 1) != 1:
+            while (end < tile_iter_end and next_pid < NUM_SMS):
+                while tl.load(locks + next_pid, cache_modifier=".cv", volatile=True) != 1:


This also does not work in gfx90a: https://github.com/ROCm/triton-internal/issues/311

neoblizz · 2024-09-26T21:20:02Z

python/perf-kernels/streamk/streamk_kernel.py

    EVEN_K: tl.constexpr,
 ):
    pid = tl.program_id(0)
-    pid = get_new_pid(pid, num_cus)
+    pid = (pid % 8) * (NUM_SMS // 8) + (pid // 8)


This is not needed for anything but gfx942, so we will actually remove this if the arch was gfx90a.

neoblizz · 2024-09-26T21:20:37Z

python/perf-kernels/streamk/tune_streamk.py

-    P = torch.zeros((num_cus, block_m * block_n), device="cuda", dtype=torch.float32)
-    triton_output = matmul(a, b, c, P, locks, num_cus, block_m, block_n, block_k, group_m, num_warps, num_stages,
-                           waves_per_eu, mfmaInstrSize, kpack, EVEN_K)
+    locks = torch.zeros((num_sms, ), device="cuda", dtype=torch.int32)


locks can be less than int32 type, we only need 1 byte: uint8 should work.

xiaohuguo2023 and others added 18 commits August 29, 2024 17:05

add Lixun's occ.sh to here, will move to proper location later

83f298a

the v0.2 streamk kernel and tuning script

b927376

update readme

50fc4f6

fix manual k-loop peeling bug and experiment with cache modifier

5711d6e

add missing dependencies

556aa1a

update tune_streamk script

ec58261

add wrapper for peopel to test

64535f7

tl.load() now doesn't result in a race cond.

d686de8

Additionally changes locks to use uint8 instead of int32 for smaller space footprint.

reduce register usage using tl.multiple_of before tl.load

b29688f

change num_sms to compiletime constant

71c5793

add num_sms to tuning space

72fc9f3

Merge branch 'main_perf' into streamk_v0.2

1838c8f

add unit test for streamk kernel

cca9c8e

add CI tests for streamk kernel

b7a99c0

fix format issues

7272ed6

remove unused wrapper and occ.sh

1edd889

fix the format issues

e98f789

update README

3275cdd

xiaohuguo2023 requested review from zhanglx13, scxiao and jayfurmanek September 25, 2024 16:18

xiaohuguo2023 added 2 commits September 25, 2024 13:01

fix git issue

2755b72

more change to make git work

4ff9faa

neoblizz self-requested a review September 26, 2024 21:17

neoblizz reviewed Sep 26, 2024

View reviewed changes

xiaohuguo2023 and others added 4 commits October 4, 2024 08:28

need cache_modifer when load P_

8add102

add back matmul wrapper for now

bf8cfe8

fix wrapper

c129a0a

Merge branch 'main_perf' into streamk_v0.2

2ac73f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamk v0.2 #646

Streamk v0.2 #646

xiaohuguo2023 commented Sep 25, 2024

neoblizz left a comment

neoblizz Sep 26, 2024

neoblizz Sep 26, 2024

neoblizz Sep 26, 2024

neoblizz Sep 26, 2024

neoblizz Sep 26, 2024

Streamk v0.2 #646

Are you sure you want to change the base?

Streamk v0.2 #646

Conversation

xiaohuguo2023 commented Sep 25, 2024

neoblizz left a comment

Choose a reason for hiding this comment

neoblizz Sep 26, 2024

Choose a reason for hiding this comment

neoblizz Sep 26, 2024

Choose a reason for hiding this comment

neoblizz Sep 26, 2024

Choose a reason for hiding this comment

neoblizz Sep 26, 2024

Choose a reason for hiding this comment

neoblizz Sep 26, 2024

Choose a reason for hiding this comment