`reuseMemoryAllocations` pass needs rewrite #221

zasdfgbnm · 2023-04-25T15:17:48Z

First, I don't think it makes sense to alias register tensors, regardless of its size. Modern compilers commonly convert user code into SSA. For CUDA, user C++ code is lowered to NVVM IR, which is based on LLVM IR, which is SSA. Aliasing register tensor at best is a no-op, and at worst, it would increase the compilation time of the C++ -> NVVM IR lowering in nvRTC. So we should only focus on the aliasing of shared memory and global memory.

Currently, reuseMemoryAllocations can only alias tensors with the same size, this does not work well for applications like matmul. For the case of matmul, in prologue, the shared memory tensors has size cta_tile.M x cta_tile.K and cta_tile.N x cta_tile.K. In epilogue, the shared memory tensors has size cta_tile.M x cta_tile.N, which is typically different from the previous shared memory tensor sizes. We need a smarter algorithm to be able to reuse prologue tensors's memory for epilogue.

Reference implementation: csarofeen/pytorch#1979

The text was updated successfully, but these errors were encountered:

zasdfgbnm · 2023-04-25T15:18:53Z

cc @naoyam @drzejan2

csarofeen · 2023-05-03T17:28:37Z

I actually wonder if we should stop worrying about aliasing, and for smem and global memory focus on a stack of alloc and free's. A reuse pass always seemed a bit suspect to me. Yes in theory it could do more interesting reuse than simply offsetting a pointer with a high water mark, but I've never been sure how much better.

This is related to #221. Currently, dynamic shared memory is used in two ways: either implicitly by reductions or when we have a `TensorView` with `MemoryType::Shared`. Note that our IR does not explicitly represent reduction (or Welford) smem at all. I'll refer to the shared memory space reserved for reductions and welfords as "reduction smem" and to all the shared mem above it via `TensorView`s as `tv smem`. Currently when we generate CUDA kernels, we look for reductions and Welford ops and find the largest data type and we reserve that amount times the block size at the beginning of the dynamic smem array `shared_mem`. We then set `smem_offset` to that size. When `tv smem` is defined in the kernel, we align the `smem_offset` address to 16 bytes and use it for the buffer definition, then we add the new buffer's size to `smem_offset` in the generated kernel. This method makes memory re-use cumbersome. We currently support aliasing identically-sized buffers, but if we wanted to create a new allocation with a different size between previous memory locations, we could not do that since we don't represent the smem offsets anywhere. This PR creates an `address()` attribute on `kir::Allocate` which holds a scalar expression for the number of bytes above `smem_offset` at which to start the allocation. When generating a kernel, we use these expressions to directly find the address of the new buffer, and we never modify `smem_offset`. During lowering, in the `reuseMemoryAllocations` pass, we assign those address `Val` attributes without re-using any `tv smem`. Note that this PR could potentially increase register use since we don't do any explicit CSE on these expressions: if there are multiple of smem TVs their offset expressions can grow in size due to intermediate alignment expressions. However, I didn't observe any change in register usage in our current tests.

zasdfgbnm added the Matmuls label Apr 25, 2023

zasdfgbnm assigned liqiangxl May 3, 2023

liqiangxl mentioned this issue Jul 21, 2023

keepdim=False then broadcast_in_dim removes explicit alias registers and shared memory can't be reused. #633

Closed

zasdfgbnm mentioned this issue Jul 31, 2023

Alias allocations with same size elements despite different dtypes #665

Merged

jacobhinkle mentioned this issue Aug 4, 2023

Give dynamic shared buffers address expressions #689

Merged

zasdfgbnm closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`reuseMemoryAllocations` pass needs rewrite #221

`reuseMemoryAllocations` pass needs rewrite #221

zasdfgbnm commented Apr 25, 2023 •

edited

Loading

zasdfgbnm commented Apr 25, 2023

csarofeen commented May 3, 2023

reuseMemoryAllocations pass needs rewrite #221

reuseMemoryAllocations pass needs rewrite #221

Comments

zasdfgbnm commented Apr 25, 2023 • edited Loading

zasdfgbnm commented Apr 25, 2023

csarofeen commented May 3, 2023

`reuseMemoryAllocations` pass needs rewrite #221

`reuseMemoryAllocations` pass needs rewrite #221

zasdfgbnm commented Apr 25, 2023 •

edited

Loading