-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reuseMemoryAllocations
pass needs rewrite
#221
Labels
Comments
I actually wonder if we should stop worrying about aliasing, and for smem and global memory focus on a stack of alloc and free's. A reuse pass always seemed a bit suspect to me. Yes in theory it could do more interesting reuse than simply offsetting a pointer with a high water mark, but I've never been sure how much better. |
jacobhinkle
added a commit
that referenced
this issue
Aug 8, 2023
This is related to #221. Currently, dynamic shared memory is used in two ways: either implicitly by reductions or when we have a `TensorView` with `MemoryType::Shared`. Note that our IR does not explicitly represent reduction (or Welford) smem at all. I'll refer to the shared memory space reserved for reductions and welfords as "reduction smem" and to all the shared mem above it via `TensorView`s as `tv smem`. Currently when we generate CUDA kernels, we look for reductions and Welford ops and find the largest data type and we reserve that amount times the block size at the beginning of the dynamic smem array `shared_mem`. We then set `smem_offset` to that size. When `tv smem` is defined in the kernel, we align the `smem_offset` address to 16 bytes and use it for the buffer definition, then we add the new buffer's size to `smem_offset` in the generated kernel. This method makes memory re-use cumbersome. We currently support aliasing identically-sized buffers, but if we wanted to create a new allocation with a different size between previous memory locations, we could not do that since we don't represent the smem offsets anywhere. This PR creates an `address()` attribute on `kir::Allocate` which holds a scalar expression for the number of bytes above `smem_offset` at which to start the allocation. When generating a kernel, we use these expressions to directly find the address of the new buffer, and we never modify `smem_offset`. During lowering, in the `reuseMemoryAllocations` pass, we assign those address `Val` attributes without re-using any `tv smem`. Note that this PR could potentially increase register use since we don't do any explicit CSE on these expressions: if there are multiple of smem TVs their offset expressions can grow in size due to intermediate alignment expressions. However, I didn't observe any change in register usage in our current tests.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
First, I don't think it makes sense to alias register tensors, regardless of its size. Modern compilers commonly convert user code into SSA. For CUDA, user C++ code is lowered to NVVM IR, which is based on LLVM IR, which is SSA. Aliasing register tensor at best is a no-op, and at worst, it would increase the compilation time of the
C++ -> NVVM IR
lowering in nvRTC. So we should only focus on the aliasing of shared memory and global memory.Currently,
reuseMemoryAllocations
can only alias tensors with the same size, this does not work well for applications like matmul. For the case of matmul, in prologue, the shared memory tensors has sizecta_tile.M x cta_tile.K
andcta_tile.N x cta_tile.K
. In epilogue, the shared memory tensors has sizecta_tile.M x cta_tile.N
, which is typically different from the previous shared memory tensor sizes. We need a smarter algorithm to be able to reuse prologue tensors's memory for epilogue.Reference implementation: csarofeen/pytorch#1979
The text was updated successfully, but these errors were encountered: