Challenges and Potential Solutions for Implementing Bufferization of Triton/TritonGPU Tensors #659
Replies: 6 comments 13 replies
-
Do you know why doesn't |
Beta Was this translation helpful? Give feedback.
-
Do you think it's a good idea to pass a buffer to |
Beta Was this translation helpful? Give feedback.
-
Hey! Thanks for the write-up. I think I have a proposal idea that should address most of the issues, but before I dive in let me clarify a few things:
This is not just Triton tensors but tensors in general. The MLIR docs specifies: "During Bufferization, we convert immutable value types (tensors) to mutable types (memref). This should also be the approach that we take.
Also, memrefs do support layout attribute, which specifies how the data should be laid out in memory:
Of course this doesn't have the same semantics as our TritonGPU layouts (which should encode how data is distributed between threads, not how it's stored). But we could bufferize #shared tensors into memrefs that have a layout that represents row/col-major ordering and swizzling. Now, I've spent some time thinking about the general problem of TritonGPU bufferization, and I came to the conclusion that we may be taking the problem the wrong way by trying to bufferize things while we still have distributed tensors around. I believe we should vectorize our distributed tensors before bufferizing it. Specifically, we want to get rid of distributed tensors, and jump down one level of abstraction so each MLIR program represents control flow running on a single CUDA thread. It probably means introducing a new This likely will require the introduction of a new dialect (not sure how to name it yet?) which would include utilities to retrieve thread indices from layouts (e.g., inverse layout functions). To give a very brief example, the following piece of dummy IR:
Could be converted to something like:
Then we could probably just one-shot bufferize this IR, and then do the LLVM codegen to convert the resulting program into LLVM. Of course, this would also require a major refactor of the LLVM backend, so I think we shouldn’t try to merge such a change before the What do you think? |
Beta Was this translation helpful? Give feedback.
-
Yes for insert, but because the tensor dialect doesn't have an |
Beta Was this translation helpful? Give feedback.
-
Yeah I think that's a good name, but maybe TritonDistributedVector would be better? I'm hoping at some point in the future, when the dialects are more stable, we'll be able to give them names that are more descriptive than "Triton" (: |
Beta Was this translation helpful? Give feedback.
-
I think @ptillet 's proposal is essentially postponing the time point of bufferization a further step, into the middle of the "backend", so that we could have the semantic of buffer that is pure enough for the general bufferization pass in MLIR. This should also improve another 'problem' (I personally think this is a disadvantage of the current pass pipeline), that the lowering of tritongpu -> llvm is too direct and abrupt, making it hard to debug/analysis in case of a codegen bug or performance issue. BTW, I think convert_layout(blocked -> blocked) may well be regarded as convert_layout(blocked->shared) + convert_layout(shared->blocked). (of course it's a different kind of shared layout). In this way, new_dialect.shuffle could well be replaced with tensor.insert + barrier + tensor.extract. This should save us the redundant complexity in dealing with scratch buffer allocation. |
Beta Was this translation helpful? Give feedback.
-
Useful links
[1] The One Shot Bufferization framework and basic concepts:
https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf
[2] MLIR bufferization APIs:
https://mlir.llvm.org/docs/Bufferization/
[3] Custom bufferization:
https://github.com/Jokeren/triton/tree/keren/bufferization
[4] Custom analysis:
https://github.com/Jokeren/triton/tree/b341c3275a377387b7afc1bd4e8f945e69c3870f
Introduction
[1] and [2] define what bufferization is and how to use MLIR bufferization APIs. The main goal of bufferization is allocating and deallocating memory correctly and using as little memory as possible. Without bufferization analysis, we could still allocate memory as if every operation requests a new piece of memory. However, it is definitely wasteful especially under memory constraint circumstances, such as the scratch memory on the GPU.
There exist many optimizations to save memory consumption based on static analysis. The one shot bufferization pass focuses on the most primitive and important one–alias analysis. If an operation does not have the alias semantic, we can always allocate a new memory region. In addition, if an instruction’s result is always an alias of an operand, we could safely skip memory allocation. The operations that make this complicated can have the “potential alias” semantic that may or may not be eligible for an in place operation without any copy/allocation. In triton, we have the
extract_slice
andinsert_slice_async
operations to facilitate loop pipelining, which are very useful operations but impose challenges on bufferization.Here’s an example from page 46 in [1], we must make a copy of
%a
on Line 2. Because we need to read the original%0
on Line 3, we cannot directly modify%0
on Line 2. This is the so-called the RAW conflict. More examples and principles can be found in [1].Instead of inspecting gory details of APIs and concepts, let’s get a high level sketch of the one shot bufferization code and understand the gap between the current Triton/TritonGPU dialect and a working bufferization pass. Please note that we are referring to the LLVM-14.0 but not the latest code on Github. There have been significant changes on the bufferization related code since LLVM-14.0. For example, InitAllDialects.h in LLVM-16.0 by default registers all bufferization interfaces so that we can use the
-one-shot-bufferize
option with mlir-opt, which is not available in LLVM-14.0. More than that, plenty of code in the Bufferization Dialect.has been rewritten since LLVM-14.0.Below is the interface to the bufferization analysis.
The bufferization process is done with phases. The first phase (
analyzeOp
) just analyzes every OpOperands to check if it could or couldn’t be inplace. Using thetestAnalysisOnly
andprintConflicts
can generate annotations as shown in the figure above. The second phase (bufferizeOp
) generates memref code (or others) based on BufferizableOpInterface callbacks. We observed challenges in both phases.Overall the benefits of using the standard bufferization pass is to get away from our non-standard allocation policy, taking advantage of the existing infrastructure, and employing some optimization passes like hosting and remove allocations to optimize code.
Challenges
Analysis Phase
There are some inconveniences caused by the analysis phase, but they are not as critical as the issues involved in the bufferize phase. That is, I suppose we could get around these issues even without changing the analysis code (
analyzeOp
). The key challenge is that the default alias analysis has many constraints, at least in LLVM-14.0. It assumes that every operation must be written in the destination passing style, which simplifies the aliasing analysis problem but in other hands prohibits some features. In a nutshell, the destination passing style doesn’t allow a return-like operation (e.g.,yield
) to return a buffer allocated within the same region. In triton, however, we don’t enforce this constraint. Thankfully, we might be able to fully workaround by enablingallowReturnMemref=true
to skip the default postmortem checks. To ensure that the inplace choices are still valid, we may add postmortem checks.Bufferize Phase
The bufferize phase is designed to convert ops with tensor semantics to ops with memref semantics. Although triton tensors are TensorType, they are different from general tensors since they have encodings. The bufferize phase currently only works for tensors without encodings because there’s no encoding field available in the MemrefType though it shares the shape and elementType fields with the TensorType. Thus, we are not able to convert between a MemrefType and a TensorType without losing information.
Above is an example code generated by removing the encoding of some tensors. This trick is not general at all since it will make the Triton/TritonGPU dialect completely invalid. For instance,
scf.for
checks if the iteration args and yield args have the same type. I don’t have any good way to get around this issue. According to @goostavz, I suppose a solution for now might be to make a local llvm repo and make small patches on the code.Triton tensors are always read-only according to @ptillet . However, after lowering to memref, we must support writable tensors (e.g., the destination pass style). And using memref’s existing ops may not suffice our demands. Here are two use cases.
The above code is invalid. Remember that
%src
is a ptr tensor, while%dst_new_sub
is a float/integer tensor. Because%src
and%dst_new_sub
’s element type do not match, the above code fails in the verification pass. One might be able to use a new op liketriton_gpu.tensor_store
, but then it violates the read-only principle.Suppose both
%src
and%dst
are blocked layouts, then we need a temporary buffer. Whereas triton doesn’t provide a way to pass an explicitly allocated tensor to an operation.As we can see above, IRs with memref become lengthy. Sometimes an allocation is moved to the beginning of the block. We’d like the allocation to happen as close to the operation as possible, at least in the debugging mode. This is not a critical issue though. I’ll probably have to try more examples and get familiar with this style…
In LLVM-14, we cannot specify at which memory space to allocate memory, but it has been resolved in LLVM-16 with an option called
defaultMemorySpace
.Beta Was this translation helpful? Give feedback.
All reactions