-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] Allow dynamic shmem of size > 48K in runtime #11478
Conversation
if (fcache_[device_id] == nullptr) { | ||
fcache_[device_id] = m_->GetFunc(device_id, func_name_); | ||
if (wl.dyn_shmem_size >= (48 << 10)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dynamic memory is too large, will it pass VerifyGPUCode
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested but yeah, it seems VerifyGPUCode
checks the static alloc size against max_shared_memory_per_block
, which would fail if dyn_shmem_size >= (48 << 10)
tvm/src/tir/analysis/verify_gpu_code.cc
Lines 70 to 71 in 534205b
} else if (storage_scope.rank == runtime::StorageRank::kShared) { | |
size_t size = static_cast<size_t>(op->ConstantAllocationSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we defer this issue later? I need this to demonstrate that a multi-stage pipeline with depth > 2 works on a semi-realistic cuda schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's defer this particular issue
if (fcache_[device_id] == nullptr) { | ||
fcache_[device_id] = m_->GetFunc(device_id, func_name_); | ||
if (wl.dyn_shmem_size >= (48 << 10)) { | ||
// Assumption: dyn_shmem_size doesn't change across different invocations of | ||
// fcache_[device_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumption could be controversial, but this should be mostly ok in practice. To support a kernel which uses different big shmem sizes depending on input, we need to call cuFuncSetAttribute
on every invocation.
Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem. This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on `VerifyGPUCode` below). I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform). Reference in cutlass code: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482
Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem.
This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on
VerifyGPUCode
below).I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform).
Reference in cutlass code:
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482
@vinx13 @junrushao1994 @tqchen @yzh119 @Hzfengsy