Challenges and Potential Solutions for Implementing Bufferization of Triton/TritonGPU Tensors #659

Jokeren · 2022-09-15T05:12:32Z

Jokeren
Sep 15, 2022
Maintainer

Useful links

[1] The One Shot Bufferization framework and basic concepts:
https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf
[2] MLIR bufferization APIs:
https://mlir.llvm.org/docs/Bufferization/
[3] Custom bufferization:
https://github.com/Jokeren/triton/tree/keren/bufferization
[4] Custom analysis:
https://github.com/Jokeren/triton/tree/b341c3275a377387b7afc1bd4e8f945e69c3870f

Introduction

[1] and [2] define what bufferization is and how to use MLIR bufferization APIs. The main goal of bufferization is allocating and deallocating memory correctly and using as little memory as possible. Without bufferization analysis, we could still allocate memory as if every operation requests a new piece of memory. However, it is definitely wasteful especially under memory constraint circumstances, such as the scratch memory on the GPU.

There exist many optimizations to save memory consumption based on static analysis. The one shot bufferization pass focuses on the most primitive and important one–alias analysis. If an operation does not have the alias semantic, we can always allocate a new memory region. In addition, if an instruction’s result is always an alias of an operand, we could safely skip memory allocation. The operations that make this complicated can have the “potential alias” semantic that may or may not be eligible for an in place operation without any copy/allocation. In triton, we have the extract_slice and insert_slice_async operations to facilitate loop pipelining, which are very useful operations but impose challenges on bufferization.

Here’s an example from page 46 in [1], we must make a copy of %a on Line 2. Because we need to read the original %0 on Line 3, we cannot directly modify %0 on Line 2. This is the so-called the RAW conflict. More examples and principles can be found in [1].

Instead of inspecting gory details of APIs and concepts, let’s get a high level sketch of the one shot bufferization code and understand the gap between the current Triton/TritonGPU dialect and a working bufferization pass. Please note that we are referring to the LLVM-14.0 but not the latest code on Github. There have been significant changes on the bufferization related code since LLVM-14.0. For example, InitAllDialects.h in LLVM-16.0 by default registers all bufferization interfaces so that we can use the -one-shot-bufferize option with mlir-opt, which is not available in LLVM-14.0. More than that, plenty of code in the Bufferization Dialect.has been rewritten since LLVM-14.0.

Below is the interface to the bufferization analysis.

LogicalResult bufferization::runOneShotBufferize(
    Operation *op, std::unique_ptr<AnalysisBufferizationOptions> options) {
  AnalysisBufferizationState state(op, *options);
  if (failed(analyzeOp(op, state)))
    return failure();
  if (options->testAnalysisOnly)
    return success();
  return bufferizeOp(op, state);
}

The bufferization process is done with phases. The first phase (analyzeOp) just analyzes every OpOperands to check if it could or couldn’t be inplace. Using the testAnalysisOnly and printConflicts can generate annotations as shown in the figure above. The second phase (bufferizeOp) generates memref code (or others) based on BufferizableOpInterface callbacks. We observed challenges in both phases.

Overall the benefits of using the standard bufferization pass is to get away from our non-standard allocation policy, taking advantage of the existing infrastructure, and employing some optimization passes like hosting and remove allocations to optimize code.

Challenges

Analysis Phase

There are some inconveniences caused by the analysis phase, but they are not as critical as the issues involved in the bufferize phase. That is, I suppose we could get around these issues even without changing the analysis code (analyzeOp). The key challenge is that the default alias analysis has many constraints, at least in LLVM-14.0. It assumes that every operation must be written in the destination passing style, which simplifies the aliasing analysis problem but in other hands prohibits some features. In a nutshell, the destination passing style doesn’t allow a return-like operation (e.g., yield) to return a buffer allocated within the same region. In triton, however, we don’t enforce this constraint. Thankfully, we might be able to fully workaround by enabling allowReturnMemref=true to skip the default postmortem checks. To ensure that the inplace choices are still valid, we may add postmortem checks.

Bufferize Phase

Encoding Mismatch [critical]

The bufferize phase is designed to convert ops with tensor semantics to ops with memref semantics. Although triton tensors are TensorType, they are different from general tensors since they have encodings. The bufferize phase currently only works for tensors without encodings because there’s no encoding field available in the MemrefType though it shares the shape and elementType fields with the TensorType. Thus, we are not able to convert between a MemrefType and a TensorType without losing information.

%0 = memref.alloc() {alignment = 128 : i64} : memref<1x16x16xf16>
  %c0 = arith.constant 0 : index
  %1 = memref.alloc() {alignment = 128 : i64} : memref<16x16xf16>
  %2 = tt.broadcast %arg0 : (!tt.ptr<f16>) -> tensor<16x16x!tt.ptr<f16>, #triton_gpu.blocked<{sizePerThread = [1, 4], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>>
  %3 = tt.splat %arg1 : (i1) -> tensor<16x16xi1, #triton_gpu.blocked<{sizePerThread = [1, 4], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>>
  %4 = memref.subview %0[%c0, 0, 0] [1, 16, 16] [1, 1, 1] : memref<1x16x16xf16> to memref<16x16xf16, affine_map<(d0, d1)[s0] -> (d0 * 16 + s0 + d1)>>
  %5 = bufferization.to_tensor %4 : memref<16x16xf16, affine_map<(d0, d1)[s0] -> (d0 * 16 + s0 + d1)>>
  %6 = triton_gpu.convert_layout %5 : (tensor<16x16xf16>) -> tensor<16x16xf16>
  %7 = bufferization.to_tensor %1 : memref<16x16xf16>
  memref.tensor_store %6, %1 : memref<16x16xf16>
  %8 = triton_gpu.convert_layout %7 : (tensor<16x16xf16>) -> tensor<16x16xf16, #triton_gpu.blocked<{sizePerThread = [1, 4], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>>
  tt.store %2, %8, %3 {cache = 1 : i32, evict = 1 : i32, isOtherUnspecified = false, isVolatile = false} : tensor<16x16xf16, #triton_gpu.blocked<{sizePerThread = [1, 4], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>>
  return

Above is an example code generated by removing the encoding of some tensors. This trick is not general at all since it will make the Triton/TritonGPU dialect completely invalid. For instance, scf.for checks if the iteration args and yield args have the same type. I don’t have any good way to get around this issue. According to @goostavz, I suppose a solution for now might be to make a local llvm repo and make small patches on the code.

Missing Semantics [critical]

Triton tensors are always read-only according to @ptillet . However, after lowering to memref, we must support writable tensors (e.g., the destination pass style). And using memref’s existing ops may not suffice our demands. Here are two use cases.

%dst = insert_slice_async %src, %dst, %index
—------------------------------------------------------------
%dst_new = memref.alloc %dst_tensor_type
%dst_new_sub = memref.subview %dst_new, ...
memref.tensor_store %src, %dst_new_sub {async}

The above code is invalid. Remember that %src is a ptr tensor, while %dst_new_sub is a float/integer tensor. Because %src and %dst_new_sub’s element type do not match, the above code fails in the verification pass. One might be able to use a new op like triton_gpu.tensor_store, but then it violates the read-only principle.

%dst = convert_layout %src

Suppose both %src and %dst are blocked layouts, then we need a temporary buffer. Whereas triton doesn’t provide a way to pass an explicitly allocated tensor to an operation.

Bufferized code is difficult to debug

As we can see above, IRs with memref become lengthy. Sometimes an allocation is moved to the beginning of the block. We’d like the allocation to happen as close to the operation as possible, at least in the debugging mode. This is not a critical issue though. I’ll probably have to try more examples and get familiar with this style…

Allocation doesn’t associate with memory space

In LLVM-14, we cannot specify at which memory space to allocate memory, but it has been resolved in LLVM-16 with an option called defaultMemorySpace.

daadaada · 2022-09-15T06:04:46Z

daadaada
Sep 15, 2022

Do you know why doesn't MemrefType support encoding? Is it possible that this is an MLIR bug? I remember there are some other functions doesn't support encoding and I think those are MLIR bugs.

2 replies

Jokeren Sep 15, 2022
Maintainer Author

I have no idea at all...Phil suggested me chat with MLIR developers if possible.

ptillet Sep 15, 2022
Maintainer

I don't think this is an MLIR bug. Memrefs layout and Tensor encodings don't mean the same thing, so one cannot be implicitly converted to the other

daadaada · 2022-09-15T06:10:16Z

daadaada
Sep 15, 2022

Suppose both %src and %dst are blocked layouts, then we need a temporary buffer. Whereas triton doesn’t provide a way to pass an explicitly allocated tensor to an operation.

Do you think it's a good idea to pass a buffer to convert_layout as an argument?

4 replies

Jokeren Sep 15, 2022
Maintainer Author

We will have other ops that require a temporary buffer, such as the reduce op. I'm not sure what would be the most uniform way to address this problem.

daadaada Sep 15, 2022

I think one solution is to have a pass that inserts alloc before convert_layout and reduce.

Jokeren Sep 15, 2022
Maintainer Author

Yeah, sure, but then we will have to change alloc_layout and reduce's syntax to have the destination passing style.

ptillet Sep 15, 2022
Maintainer

I don't think we could do that in TritonGPU programs. There are no buffers in TritonGPU programs, only tensors.

ptillet · 2022-09-15T22:20:57Z

ptillet
Sep 15, 2022
Maintainer

Hey! Thanks for the write-up. I think I have a proposal idea that should address most of the issues, but before I dive in let me clarify a few things:

Triton tensors are always read-only according to @ptillet

This is not just Triton tensors but tensors in general. The MLIR docs specifies: "During Bufferization, we convert immutable value types (tensors) to mutable types (memref). This should also be the approach that we take.

One might be able to use a new op like triton_gpu.tensor_store, but then it violates the read-only principle.
My understanding of memref.tensor_store is that it consumes a tensor but writes into a memref, so it doesn't mutate any tensor.

Also, memrefs do support layout attribute, which specifies how the data should be laid out in memory:

memref-type ::= `memref` `<` dimension-list-ranked type
            	(`,` layout-specification)? (`,` memory-space)? `>`
layout-specification ::= attribute-value
memory-space ::= attribute-value

Of course this doesn't have the same semantics as our TritonGPU layouts (which should encode how data is distributed between threads, not how it's stored). But we could bufferize #shared tensors into memrefs that have a layout that represents row/col-major ordering and swizzling.

Now, I've spent some time thinking about the general problem of TritonGPU bufferization, and I came to the conclusion that we may be taking the problem the wrong way by trying to bufferize things while we still have distributed tensors around. I believe we should vectorize our distributed tensors before bufferizing it. Specifically, we want to get rid of distributed tensors, and jump down one level of abstraction so each MLIR program represents control flow running on a single CUDA thread.

It probably means introducing a new Parallelize pass that takes in our optimized TritonGPU program and transforms it into something that uses both the GPU dialect and the Tensor dialect (for the shared values). Once we're in that realm, I'm pretty sure we'll be able to use one-shot bufferize out-of-the-box.

This likely will require the introduction of a new dialect (not sure how to name it yet?) which would include utilities to retrieve thread indices from layouts (e.g., inverse layout functions). To give a very brief example, the following piece of dummy IR:

#layout1 = ... // not shared
#layout2 = ... // not shared


%0 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #layout1>
%1 = arith.max %0, 17 -> tensor<64xi32, #layout1>
%2 = tt.convert_layout %1 -> tensor<64xi32, #layout2>
%3 = arith.add %2, 10
%4 = tt.convert_layout %3 -> tensor<64xi32, #shared>

Could be converted to something like:

%tid   = new_dialect.thread_id // similar to gpu.thread_id, but we don't want multiple dimensions I think
// "%idx_0, %idx1_1" are the indices of %tid in the layout matrix of #layout1. They are vectors of width sizePerThread
%idx1_0, %idx1_1 = new_dialect.get_layout_idx %tid {layout = #layout1} 
%1_0  	= arith.max (%idx1_0, 17)
%1_1  	= arith.max (%idx1_1, 17)
%idx2_0, %idx2_1  = new_dialect.get_layout_idx %tid {layout = #layout2}
// the following line shuffles data around, so that thread of layout indices [%idx1_0, %idx1_1] get the value of  [%1_0, %1_1] owned by thread of layout indices [%idx2_0, %idx2_1]. Semantics can probably be refined, but the general idea should be ok?
%2_0, %2_1 = new_dialect.shuffle [%1_0, %1_1], [%idx1_0, %idx1_1], [%idx2_0, %idx2_1]
%3_0 = arith.add %2_0, 10
%3_1 = arith.add %2_1, 10

%shared = bufferization.alloc_tensor <64xi32> // probably should reside in a region where things aren’t parallel, otherwise it’s unclear what that statement means.

%4_0 = tensor.insert %3_0 into %shared[%idx2_0]
%4_1 = tensor.insert %3_1 into %shared[%idx2_1]

Then we could probably just one-shot bufferize this IR, and then do the LLVM codegen to convert the resulting program into LLVM.

Of course, this would also require a major refactor of the LLVM backend, so I think we shouldn’t try to merge such a change before the triton-mlir branch lands on master.

What do you think?

5 replies

Jokeren Sep 15, 2022
Maintainer Author

But we could bufferize #shared tensors into memrefs that have a layout that represents row/col-major ordering and swizzling.

Technically I think it's possible, which means we ought to convert encodings to layout maps. Since the Triton/TritonGPU dialects only accept tensors with encodings, I agree that we need to have an intermediate Dialect.

The new Dialect could be "TritonVector" I guess? It is somehow different from the standard Vector Dialect which only models single thread execution.

%4_0 = tensor.insert %3_0 into %shared[%idx2_0]
%4_1 = tensor.insert %3_1 into %shared[%idx2_1]

So tensor.insert refers to the insert operation in the standard Tensor Dialect? If so, I think bufferization won't be a hurdle anymore.

Jokeren Sep 15, 2022
Maintainer Author

Of course, this would also require a major refactor of the LLVM backend, so I think we shouldn’t try to merge such a change before the triton-mlir branch lands on master.

Yeah, I think we'll have to upgrade our LLVM dependency to LLVM-16 before merging. The key feature that specifies the memory space of memref is missing in LLVM-14.

daadaada Sep 16, 2022

This likely will require the introduction of a new dialect (not sure how to name it yet?) which would include utilities to retrieve thread indices from layouts (e.g., inverse layout functions).

I've been waiting for this for so long!

goostavz Sep 19, 2022
Collaborator

I think after we upgrade to LLVM-16 (I believe this is not very small effort, probably 1 person * week. Maybe also after triton-mlir branch lands on master?), we probably need to upgrade it periodically (monthly or bi-monthly maybe) in order to control the efforts in further upgrades.

ptillet Sep 19, 2022
Maintainer

@goostavz Yeah I totally agree with that. There are a lot of potential Triton users who work with llvm@head, so staying up-to-date will be quite important.

ptillet · 2022-09-15T23:08:46Z

ptillet
Sep 15, 2022
Maintainer

So tensor.insert refers to the insert operation in the standard Tensor Dialect? If so, I think bufferization won't be a hurdle anymore.

Yes for insert, but because the tensor dialect doesn't have an insert_slice_async instruction we'd still need triton_gpu.insert_slice_async unless we can efficiently model it with the async dialect (which I don't think we can since the barriers don't have the same semantics)

1 reply

Jokeren Sep 15, 2022
Maintainer Author

Maybe we could add an "async" attribute to the Tensor Dialect? It won't affect bufferization since the attribute is added to operations but not tensors. We are still going to generate LLVM code the same way as triton_gpu.insert_slice_async.

ptillet · 2022-09-15T23:16:03Z

ptillet
Sep 15, 2022
Maintainer

The new Dialect could be "TritonVector" I guess? It is somehow different from the standard Vector Dialect which only models single thread execution.

Yeah I think that's a good name, but maybe TritonDistributedVector would be better? I'm hoping at some point in the future, when the dialects are more stable, we'll be able to give them names that are more descriptive than "Triton" (:

1 reply

Jokeren Sep 15, 2022
Maintainer Author

Sure, "TritonDistributedVector" seems like a good name to me.

goostavz · 2022-09-19T02:24:26Z

goostavz
Sep 19, 2022
Collaborator

I think @ptillet 's proposal is essentially postponing the time point of bufferization a further step, into the middle of the "backend", so that we could have the semantic of buffer that is pure enough for the general bufferization pass in MLIR. This should also improve another 'problem' (I personally think this is a disadvantage of the current pass pipeline), that the lowering of tritongpu -> llvm is too direct and abrupt, making it hard to debug/analysis in case of a codegen bug or performance issue.
If we are only to solve the problem to reuse the MLIR's bufferization, i think probably we may have other more light-weighted patches, for example making patches on LLVM or proposing CRs to LLVM (it's quite natural for MemRef/Bufferization pass to support with customized attributes, isn't it?). But I think it's more meaningful if we could have an intermediate logical layer in scalar/vector level in the backend during the process of codegen. So generally, I'm personally positive to this proposal in the longer term.

BTW, I think convert_layout(blocked -> blocked) may well be regarded as convert_layout(blocked->shared) + convert_layout(shared->blocked). (of course it's a different kind of shared layout). In this way, new_dialect.shuffle could well be replaced with tensor.insert + barrier + tensor.extract. This should save us the redundant complexity in dealing with scratch buffer allocation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenges and Potential Solutions for Implementing Bufferization of Triton/TritonGPU Tensors #659

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Challenges and Potential Solutions for Implementing Bufferization of Triton/TritonGPU Tensors #659

Jokeren Sep 15, 2022 Maintainer

Useful links

Introduction

Challenges

Analysis Phase

Bufferize Phase

Replies: 6 comments · 13 replies

daadaada Sep 15, 2022

Jokeren Sep 15, 2022 Maintainer Author

ptillet Sep 15, 2022 Maintainer

daadaada Sep 15, 2022

Jokeren Sep 15, 2022 Maintainer Author

daadaada Sep 15, 2022

Jokeren Sep 15, 2022 Maintainer Author

ptillet Sep 15, 2022 Maintainer

ptillet Sep 15, 2022 Maintainer

Jokeren Sep 15, 2022 Maintainer Author

Jokeren Sep 15, 2022 Maintainer Author

daadaada Sep 16, 2022

goostavz Sep 19, 2022 Collaborator

ptillet Sep 19, 2022 Maintainer

ptillet Sep 15, 2022 Maintainer

Jokeren Sep 15, 2022 Maintainer Author

ptillet Sep 15, 2022 Maintainer

Jokeren Sep 15, 2022 Maintainer Author

goostavz Sep 19, 2022 Collaborator

Jokeren
Sep 15, 2022
Maintainer

Replies: 6 comments 13 replies

daadaada
Sep 15, 2022

Jokeren Sep 15, 2022
Maintainer Author

ptillet Sep 15, 2022
Maintainer

daadaada
Sep 15, 2022

Jokeren Sep 15, 2022
Maintainer Author

Jokeren Sep 15, 2022
Maintainer Author

ptillet Sep 15, 2022
Maintainer

ptillet
Sep 15, 2022
Maintainer

Jokeren Sep 15, 2022
Maintainer Author

Jokeren Sep 15, 2022
Maintainer Author

goostavz Sep 19, 2022
Collaborator

ptillet Sep 19, 2022
Maintainer

ptillet
Sep 15, 2022
Maintainer

Jokeren Sep 15, 2022
Maintainer Author

ptillet
Sep 15, 2022
Maintainer

Jokeren Sep 15, 2022
Maintainer Author

goostavz
Sep 19, 2022
Collaborator