Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reuseMemoryAllocations pass needs rewrite #221

Closed
zasdfgbnm opened this issue Apr 25, 2023 · 2 comments
Closed

reuseMemoryAllocations pass needs rewrite #221

zasdfgbnm opened this issue Apr 25, 2023 · 2 comments
Assignees
Labels

Comments

@zasdfgbnm
Copy link
Collaborator

zasdfgbnm commented Apr 25, 2023

First, I don't think it makes sense to alias register tensors, regardless of its size. Modern compilers commonly convert user code into SSA. For CUDA, user C++ code is lowered to NVVM IR, which is based on LLVM IR, which is SSA. Aliasing register tensor at best is a no-op, and at worst, it would increase the compilation time of the C++ -> NVVM IR lowering in nvRTC. So we should only focus on the aliasing of shared memory and global memory.

Currently, reuseMemoryAllocations can only alias tensors with the same size, this does not work well for applications like matmul. For the case of matmul, in prologue, the shared memory tensors has size cta_tile.M x cta_tile.K and cta_tile.N x cta_tile.K. In epilogue, the shared memory tensors has size cta_tile.M x cta_tile.N, which is typically different from the previous shared memory tensor sizes. We need a smarter algorithm to be able to reuse prologue tensors's memory for epilogue.

Reference implementation: csarofeen/pytorch#1979

@zasdfgbnm
Copy link
Collaborator Author

cc @naoyam @drzejan2

@csarofeen
Copy link
Collaborator

I actually wonder if we should stop worrying about aliasing, and for smem and global memory focus on a stack of alloc and free's. A reuse pass always seemed a bit suspect to me. Yes in theory it could do more interesting reuse than simply offsetting a pointer with a high water mark, but I've never been sure how much better.

jacobhinkle added a commit that referenced this issue Aug 8, 2023
This is related to #221. Currently, dynamic shared memory is used in two
ways: either implicitly by reductions or when we have a `TensorView`
with `MemoryType::Shared`. Note that our IR does not explicitly
represent reduction (or Welford) smem at all. I'll refer to the shared
memory space reserved for reductions and welfords as "reduction smem"
and to all the shared mem above it via `TensorView`s as `tv smem`.

Currently when we generate CUDA kernels, we look for reductions and
Welford ops and find the largest data type and we reserve that amount
times the block size at the beginning of the dynamic smem array
`shared_mem`. We then set `smem_offset` to that size. When `tv smem` is
defined in the kernel, we align the `smem_offset` address to 16 bytes
and use it for the buffer definition, then we add the new buffer's size
to `smem_offset` in the generated kernel.

This method makes memory re-use cumbersome. We currently support
aliasing identically-sized buffers, but if we wanted to create a new
allocation with a different size between previous memory locations, we
could not do that since we don't represent the smem offsets anywhere.

This PR creates an `address()` attribute on `kir::Allocate` which holds
a scalar expression for the number of bytes above `smem_offset` at which
to start the allocation. When generating a kernel, we use these
expressions to directly find the address of the new buffer, and we never
modify `smem_offset`. During lowering, in the `reuseMemoryAllocations`
pass, we assign those address `Val` attributes without re-using any `tv
smem`.

Note that this PR could potentially increase register use since we don't
do any explicit CSE on these expressions: if there are multiple of smem
TVs their offset expressions can grow in size due to intermediate
alignment expressions. However, I didn't observe any change in register
usage in our current tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants