[CUDA] Allow dynamic shmem of size > 48K in runtime #11478

masahi · 2022-05-26T21:34:30Z

Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem.

This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on VerifyGPUCode below).

I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform).

Reference in cutlass code:
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482

@vinx13 @junrushao1994 @tqchen @yzh119 @Hzfengsy

vinx13 · 2022-05-26T21:37:26Z

src/runtime/cuda/cuda_module.cc

    if (fcache_[device_id] == nullptr) {
      fcache_[device_id] = m_->GetFunc(device_id, func_name_);
+      if (wl.dyn_shmem_size >= (48 << 10)) {


if dynamic memory is too large, will it pass VerifyGPUCode check?

Haven't tested but yeah, it seems VerifyGPUCode checks the static alloc size against max_shared_memory_per_block, which would fail if dyn_shmem_size >= (48 << 10)

tvm/src/tir/analysis/verify_gpu_code.cc

Lines 70 to 71 in 534205b

} else if (storage_scope.rank == runtime::StorageRank::kShared) {

size_t size = static_cast<size_t>(op->ConstantAllocationSize());

Can we defer this issue later? I need this to demonstrate that a multi-stage pipeline with depth > 2 works on a semi-realistic cuda schedule.

Yeah let's defer this particular issue

masahi · 2022-05-26T22:06:17Z

src/runtime/cuda/cuda_module.cc

    if (fcache_[device_id] == nullptr) {
      fcache_[device_id] = m_->GetFunc(device_id, func_name_);
+      if (wl.dyn_shmem_size >= (48 << 10)) {
+        // Assumption: dyn_shmem_size doesn't change across different invocations of
+        // fcache_[device_id]


This assumption could be controversial, but this should be mostly ok in practice. To support a kernel which uses different big shmem sizes depending on input, we need to call cuFuncSetAttribute on every invocation.

junrushao · 2022-05-27T17:40:38Z

Thanks @masahi @vinx13 @Hzfengsy, it's merged!

Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem. This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on `VerifyGPUCode` below). I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform). Reference in cutlass code: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482

vinx13 reviewed May 26, 2022

View reviewed changes

masahi changed the title ~~[CUDA] Allow dynamic shmem of size > 48K~~ [CUDA] Allow dynamic shmem of size > 48K in runtime May 26, 2022

vinx13 approved these changes May 26, 2022

View reviewed changes

masahi commented May 26, 2022

View reviewed changes

Hzfengsy approved these changes May 27, 2022

View reviewed changes

masahi force-pushed the more-dyn-shmem branch from 0dc60de to e241c42 Compare May 27, 2022 02:46

masahi added 2 commits May 27, 2022 17:17

Allow dynamic shmem of size > 48K

2abce99

add error msg

5b52cc9

masahi force-pushed the more-dyn-shmem branch from e241c42 to 5b52cc9 Compare May 27, 2022 08:17

masahi added 2 commits May 27, 2022 19:50

skip cascader test

10f0181

suppres logging

4aa41f8

masahi mentioned this pull request May 27, 2022

[Flaky Test] tests/python/contrib/test_ethosu/cascader/test_scheduler.py::test_compute_cycles_annotation #11483

Closed

junrushao merged commit 6f3c8bd into apache:main May 27, 2022

junrushao mentioned this pull request May 27, 2022

[MetaSchedule] Add Profiler Support For Tuning Efficiency Optimization #11486

Merged

masahi mentioned this pull request May 27, 2022

[Software pipeline] Fix hardcoded index in access_ptr rewriting, add a GPU test with depth 4 #11495

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Allow dynamic shmem of size > 48K in runtime #11478

[CUDA] Allow dynamic shmem of size > 48K in runtime #11478

masahi commented May 26, 2022 •

edited

Loading

vinx13 May 26, 2022

masahi May 26, 2022

masahi May 26, 2022

junrushao May 26, 2022

masahi May 26, 2022

junrushao commented May 27, 2022

	} else if (storage_scope.rank == runtime::StorageRank::kShared) {
	size_t size = static_cast<size_t>(op->ConstantAllocationSize());

[CUDA] Allow dynamic shmem of size > 48K in runtime #11478

[CUDA] Allow dynamic shmem of size > 48K in runtime #11478

Conversation

masahi commented May 26, 2022 • edited Loading

vinx13 May 26, 2022

Choose a reason for hiding this comment

masahi May 26, 2022

Choose a reason for hiding this comment

masahi May 26, 2022

Choose a reason for hiding this comment

junrushao May 26, 2022

Choose a reason for hiding this comment

masahi May 26, 2022

Choose a reason for hiding this comment

junrushao commented May 27, 2022

masahi commented May 26, 2022 •

edited

Loading