Skip to content
This repository has been archived by the owner on May 22, 2023. It is now read-only.

[DISCUSS] Layout transformation in TIR graph #162

Closed
vinx13 opened this issue Jun 11, 2022 · 9 comments
Closed

[DISCUSS] Layout transformation in TIR graph #162

vinx13 opened this issue Jun 11, 2022 · 9 comments

Comments

@vinx13
Copy link
Contributor

vinx13 commented Jun 11, 2022

As we start to work on specific hardware, many operators would expect a specific kind of layout for both data and weight. Logically the layout start with simple ones. This thread discusses an example of how to handle layout transformation in a Relax-TIR setting.

The general idea is to lift layout transformation into the graph and cancels out the pair of pre-compute and post-compute transformations.

The same principle can also be applied to other cases like adding padding before layout transformation.

Example problem of layout transformation

@tvm.script.ir_module
class MyMod:
  @T.func
  def matmul(A: T.Buffer((128, 128)), B: T.Buffer((128, 128)), C: T.Buffer((128, 128))):
    for i, j, k in grid(128, 128, 128):
        with T.block("compute")
          with T.init():
           C[i, j] =0
          C[i, j] = C[i, j] + A[i, k] * B[k, j]

  @R.func
  def main(x: R.Tensor((128, 128))):
    w0 = R.const(shape=(128, 128))
    w1 = R.const(shape=(128, 128))
    lv0 = call_tir(mamul, [x, w0], (128, 128))
    lv1 = call_tir(mamul, [lv0, w1], (128, 128))

Considering the example program. the data layout of A and C are in normal layout.

Assume that due to hardware or other restrictions, we need to convert the layout to a different setting. Say the layout is being represented as the following mapping

Aphysical[i // 8, j, i % 8] = A[i, j] for i, j in grid(128, 128)
Cphysical[i // 8, j, i % 8] = C[i, j] for i, j in grid(128, 128)

For simplicity let us assume that we do not change the layout of B(and W), but same principle applies. When we have the layout requirement. The first step is for TIR function to express the layout requirement through transformation.

Step 0: PrimFunc transformation

In the first step, the PrimFunc is transformed to a program with three steps:

  • pre-compute layout transform (A-layout-convert): that transforms the input data to desirable layout
  • compute: the computation that leverages the new layout
  • post-compute layout transform (C-layout-convert): converts the layout back to the original one.
@tvm.script.ir_module
class MyMod:
  @T.func
  def matmul(A: T.Buffer((128, 128)), B: T.Buffer((128, 128)), C: T.Buffer((128, 128))):
    Aphysical = T.alloc_buffer(16, 128, 8)
    Cphysical = T.alloc_buffer(16, 128, 8)
    for i0, j, i1 in grid(16, 128, 8):
      with block("A-layout-convert"):	
        Aphysical[i, j] = A[i0 * 8 + i1, j]
    
    for i0, j, i1, k in grid(16, 128, 8, 256):
      with block("compute"):
        with T.init():
          Cphysical[i0, j, i1] = 0
        Cphysical[i0, j, i1] = Cphysical[i0, j, i1] + Aphysical[i0, k, i1] * B[k, j]

    for i, j in grid(128, 128):
      with block("C-layout-convert"):
        C[i, j] = Cphysical[i //8, j, i % 8]

  @R.func
  def main(x: R.Tensor((128, 128))):
    w0 = R.const(shape=(128, 128))
    w1 = R.const(shape=(128, 128))
    lv0 = call_tir(mamul, [x, w0], (128, 128))
    lv1 = call_tir(mamul, [lv0, w1], (128, 128))

Step 1: Lift Layout Convert into Graph

If we stop at step 0, the additional layout convert brings extra cost and sometimes infeasible, if the memory does not support the layout natively. In this second step, we lift layout conversion into graph

@tvm.script.ir_module
class MyMod:
  @T.func
  def pre_layout_convert(A: T.Buffer((16, 128, 8)), Aphysical: T.Buffer((16, 128, 8))):
    for i0, j, i1 in grid(16, 128, 8):
      with block("A-layout-convert"):
        Aphysical[i, j] = A[i0 * 8 + i1, j]

  @T.func
  def post_layout_convert(Cphysical: T.Buffer((16, 128, 8)), C: T.Buffer((128, 128))):
    for i, j in grid(128, 128):
      with block("C-layout-convert"):
        C[i, j] = C[i //8, j, i % 8]

  @T.func
  def matmul_physcial(Aphysical: T.Buffer((16, 128, 8)), B: T.Buffer((128, 128)), Cphysical: T.Buffer((16, 128, 8))):
    for i0, j, i1, k in grid(16, 128, 8, 256):
      with block("compute"):
        with T.init():
          Cphysical[i0, j, i1] = 0
        Cphysical[i0, j, i1] = C[i0, j, i1] + Aphysical[i0, k, i1] * B[k, j]

  @R.func
  def main(x: R.Tensor((128, 128))):
    w0 = R.const(shape=(128, 128))
    w1 = R.const(shape=(128, 128))
    lv0 = call_tir(pre_layout_convert, [x], (16, 128, 8))
    lv1 = call_tir(matmul_physical, [lv0, w0], (16, 128, 8))
    lv2 = call_tir(post_layout_convert, [lv1], (128, 128))
    lv3 = call_tir(pre_layout_convert, [lv2], (16, 128, 8))
    lv4 = call_tir(matmul_physical, [lv3, w0], (16, 128, 8))
    lv5 = call_tir(post_layout_convert, [lv4], (128, 128))

The result program is shown as above. Note that the layout conversion get lifted into the graph part. Now matmul_physical runs completely under the desirable (physical) layout.

Step 2: Fold Layout conversion

The above step still leaves many layout conversions in the graph code. In this step, we will run folding to fold the layout conversion. Note that in the above code segment. pre_layout_conversion and post_layout_conversion cancels out with each other and forms an identity(this can be done by TIR analysis)

  lv2 = call_tir(post_layout_convert, [lv1], (128, 128))
  lv3 = call_tir(pre_layout_convert, [lv2], (16, 128, 8))

So we can run folding, the final code becomes as follows

@tvm.script.ir_module
class MyMod:
  @T.func
  def pre_layout_convert(A: T.Buffer((16, 128, 8)), Aphysical: T.Buffer((16, 128, 8))):
    for i0, j, i1 in grid(16, 128, 8):
      with block("A-layout-convert"):
        Aphysical[i, j] = A[i0 * 8 + i1, j]

  @T.func
  def post_layout_convert(Cphysical: T.Buffer((16, 128, 8)), C: T.Buffer((128, 128))):
    for i, j in grid(128, 128):
      with block("C-layout-convert"):
        C[i, j] = C[i //8, j, i % 8]

  @T.func
  def matmul_physcial(Aphysical: T.Buffer((16, 128, 8)), B: T.Buffer((128, 128)), Cphysical: T.Buffer((16, 128, 8))):
    for i0, j, i1, k in grid(16, 128, 8, 256):
      with block("compute"):
        with T.init():
          Cphysical[i0, j, i1] = 0
        Cphysical[i0, j, i1] = C[i0, j, i1] + Aphysical[i0, k, i1] * B[k, j]

  @R.func
  def main(x: R.Tensor((128, 128))):
    w0 = R.const(shape=(128, 128))
    w1 = R.const(shape=(128, 128))
    lv0 = call_tir(pre_layout_convert, [x], (16, 128, 8))
    lv1 = call_tir(matmul_physical, [lv0, w0], (16, 128, 8))
    # NOTE how we directly takes in output from the previous one with desirable phyiscal layout
    lv4 = call_tir(matmul_physical, [lv1, w1], (16, 128, 8))
    lv5 = call_tir(post_layout_convert, [lv4], (128, 128))

Importantly, imagine we have a long sequence of matmul chains, then the final code will become

  @R.func
  def main(x: R.Tensor((128, 128))):
    ...
    # transform data to the desirable layout.
    lv0 = call_tir(pre_layout_convert, [x], (16, 128, 8))

    # Steady state computation, all in the correct phyiscal layout
    lv1 = call_tir(matmul_physical, [lv0, w0], (16, 128, 8))
    lv2 = call_tir(matmul_physical, [lv1, w1], (16, 128, 8))
    lv3 = call_tir(matmul_physical, [lv2, w2], (16, 128, 8))
    lv4 = call_tir(matmul_physical, [lv4, w3], (16, 128, 8))
    ....

    # transform data back to desirable output logical layout
    lv5 = call_tir(post_layout_convert, [lv4], (128, 128))
    ...

Discussion and Remarks

There are several advantages of this method.

  • The interface of PrimFunc remains unchanged in step0, this preserves the properly of TIR transform not changing the interface semantics
  • This approach enables us to have different kind of layout semantics if needed(e.g. if we choose to ensure that C have a different layout, it is contained in the IR.
  • The layout conversion is always inserted correctly(in the beginning and end), in case there are many complicated intermediate layouts.
  • We can also specify layout conversion in the weights, in this case, the layout conversion can constant fold into the weight shape.

The layout handling is useful for several use cases:

  • In meta-scheduling where the tuner suggest one possible layout
  • When we integrate a library, where the library have a desirable layout (e.g. NHWc), we can write the TIR with explicit pre and post layout conversion, which get canceled out as a result we get the ideal computation with correct intermediate layouts.
 @T.func
def matmul(A: T.Buffer((128, 128)), B: T.Buffer((128, 128)), C: T.Buffer((128, 128))):
  Aphysical = T.alloc_buffer(16, 128, 8)
  Cphysical = T.alloc(16, 128, 8)
  for i0, j, i1 in grid(16, 128, 8):
    with block("A-layout-convert"):	
      Aphysical[i, j] = A[i0 * 8 + i1, j]

      with block("compute"):
        T.call_extern("oneDNNkernel", Aphysical, B, Cphysical)

      for i, j in grid(128, 128):
        with block("C-layout-convert"):
      		C[i, j] = Cphysical[i //8, j, i % 8]
  • When we are handling a specialized hardware(NPU), we convert the layout into the desirable physical layout that is supported by the NPU.

The same principle applies to padding to axes. Here are some additional examples.

Example Problem of layout padding

@tvm.script.ir_module
class MyMod:
  @T.func
  def add(A: T.Buffer((128, 127)), B: T.Buffer((128, 127)), C: T.Buffer((128, 127))):
    for i, j in grid(128, 127):
      with T.block("compute"):
        C[i, j] = A[i, j] * B[i, j]

  @R.func
  def main(x: R.Tensor((128, 127))):
    w0 = R.const(shape=(128, 127))
    w1 = R.const(shape=(128, 127))
    lv0 = call_tir(add, [x, w0], (128, 127))
    lv1 = call_tir(add, [lv0, w1], (128, 127))

Step 0: PrimFunc transformation

Support the hardware requires the input to be padded to multiple of 128, it can be expressed in PrimFunc

@T.func
def add(A: T.Buffer((128, 127)), B: T.Buffer((128, 127)), C: T.Buffer((128, 127))):
  A_physical = T.alloc_buffer((128, 128))
  C_physical = T.alloc_buffer((128, 128))
  for i, j in T.grid(128, 128):
    with T.block("A-convert-layout"):
      A_physical[i, j] = T.if_then_else(i < 128 && j < 127, A[i, j], 0.) # we can also pad with T.undef()
  
  for i, j in T.grid(128, 128):
    with T.block("B-convert-layout"):
      B_physical[i, j] = T.if_then_else(i < 128 && j < 127, B[i, j], 0.) # we can also pad with T.undef()
  
  for i, j in T.grid(128, 128):
    with T.block("compute"):
      C_physical[i, j] = A_physical[i, j] + B_physical[i, j]
  
  for i, j in T.grid(128, 127):
    with T.block("C-convert-layout"):
      C[i, j] = C_physical[i, j]

Step 1: Lift padding and cropping to the graph

@tvm.script.ir_module
class MyMod:
  @T.func
  def pad(A: T.Buffer((128, 127)), Aphysical: T.Buffer((128, 128))):
    for i, j in T.grid(128, 128):
      with T.block("pad"):
        Aphysical[i, j] = T.if_then_else(i < 128 && j < 127, A[i, j], 0.)

  @T.func
  def crop(Cphysical: T.Buffer((128, 128)), C: T.Buffer((128, 127))):
    for i, j in T.grid(128, 127):
      with T.block("C-layout-convert"):
        C[i, j] = Cphysical[i, j]

  @T.func
  def add_physical(A: T.Buffer((128, 128)), B: T.Buffer((128, 128)), C: T.Buffer((128, 128))):
    for i, j in T.grid(128, 128):
      with T.block("compute"):
        C[i, j] = A[i, j] + B[i, j]

  @R.func
  def main(x: R.Tensor((128, 127))):
    w0 = R.const(shape=(128, 127))
    w1 = R.const(shape=(128, 127))
    lv0 = call_tir(pad, [x], (128, 128))
    lv1 = call_tir(pad, [w0], (128, 128))
    lv2 = call_tir(add_physical, [lv0, lv1], (128, 128))
      
    lv3 = call_tir(crop, [lv2], (128, 127))
    lv4 = call_tir(pad, [lv3], (128, 128))
      
    lv5 = call_tir(pad, [w1], (128, 128))
    lv6 = call_tir(add_physical, [lv4, lv5], (128, 128))
    lv7 = call_tir(crop, [lv6], (128, 127))

Step 2: Fold layout conversion

lv3 and lv4 is a pair of crop and pad, with the same shape before padding. If the padding value in lv4 is the same as the value before cropping in lv3, or the padding value is T.undef(), they can be cancelled out.

Conversion of w0, w1 can also be folded at compile time, the result will be

@R.func
def main(x: R.Tensor((128, 127))):
  w0_padded = R.const(shape=(128, 128))
  w1_padded = R.const(shape=(128, 128))
  lv0 = call_tir(pre_layout_convert, [x], (128, 128))
  lv1 = call_tir(add_physical, [lv0, w0_padded], (128, 128))
  lv2 = call_tir(add_physical, [lv1, w1_padded], (128, 128))
  lv3 = call_tir(post_layout_convert, [lv2], (128, 127))

This enforce that each transformation step maintain the original semantic in the TIR and Relax graph. If we are allowed to output the final result in physical shape, with some undefined value in the padding region, the final post_layout_convertcan also be eliminated.

Padding on reduction dimensions

When the padding is introduced in the reduction dimensions, there are requirements on padding value so that crop and pad and be cancelled. Although the padding value should ensure semantic correctness (e.g. pad_value = 0.0 for conv, -inf for max_pool), the value in the padding region of the output can still be arbitrary as they will be cropped out anyways. In this case, we will need to insert hints about the padding value (which is usually operator specific property, for example, applying conv filter to padding region of value 0.0 has output 0.0).
Here are an example of conv1d with padding=2 on both sides of the input.

@R.func
def main(X: R.Tensor[16]):
  F: R.Const[3]
  Y: R.Tensor[18] = conv1d(X, F, pad=2)
  Z: R.Tensor[20] = conv1d(Y, F, pad=2)

Suppose the conv1d physically requires the input to be multiple of 8 with padding already explicily inserted, after inserting padding and crop:

@R.func
def main(X: R.Tensor[16]):
  X_pad: R.Tensor[24] = pad(X, before=2, after=6)
  Y: R.Tensor[22] = conv1d(X_pad, F, pad=0)
  assert(Y[18:] == 0)
  Y_crop: R.Tensor[18] = crop(Y[0:18])
  Y_crop_pad: R.Tensor[24] = pad(Y_crop, before=2, after=4)
  Z: R.Tensor[22] = conv1d(Y_crop_pad, F, pad=0)
  Z_crop: R.Tensor[20] = crop(Z[0:20])

The assertion of Y[18:] == 0 is needed to hint the next padding can be simplified. In general case of multiple convolutions, this may also need non-local transformations, such as propagating the padding into the beginning.

cc @YuchenJin @Hzfengsy @jinhongyii @sunggg @junrushao1994 @tqchen

@psrivas2
Copy link
Contributor

pre_layout_conversion and post_layout_conversion cancels out with each other and forms an identity(this can be done by TIR analysis)

cc @junrushao @vinx13
Does this analysis currently exist in TIR or it is to be implemented? Can you comment on how difficult it would be to do?

@vinx13
Copy link
Contributor Author

vinx13 commented Aug 23, 2022

It doesn't exist yet. It will use arith analysis or use some annotation as hint. For simple layout conversion like reordering / packing axes, they should be well supported because they are basically affine transformations.

@psrivas2
Copy link
Contributor

Thanks @vinx13 for the response! I have a few follow up questions too :)

Given the discussion, it seems most people are in agreement that we need to flow/merge layout constraints through other ops. This means that given the following graph:

(pre_layout_convert) -> matmul -> (post_layout_convert) -> add -> (pre_layout_convert) -> matmul -> (post_layout_convert)

we should be able to flow the post_layout_convert of matmul down after add. (It could also have been the other way but it is not relevant to the discussion yet.) So the graph after merging layout constraint with add would be:

(pre_layout_convert) -> matmul -> add -> (post_layout_convert) -> (pre_layout_convert) -> matmul -> (post_layout_convert)

I think such transformations would eventually allow us to fold (post_layout_convert) -> (pre_layout_convert) into a no-op. For this to happen, TIR should be able to do the following:

  • Check "if a TIR PrimFunc is layout agnostic". For example, element-wise ops are layout agnostic. Similarly Relay identifies other ops as layout agnostic. Can TIR PrimFunc analysis do this?
  • Figure out "how a TIR PrimFunc should change to accommodate the new layout", i.e., if instead of add we were flowing the layout across concatenate or pad, how would that affect the schedule. This seems really hard to do at TIR level. At operator level we can pick a different TOPI schedule for the op but at TIR level this change is not obvious.

@sunggg
Copy link
Collaborator

sunggg commented Aug 24, 2022

Thank you for great proposal, @vinx13! I also have a question.
Since TIR-level layout rewrite occurs purely in TIR-level, IIUC, these two paths might produce different primfuncs.

  • P1: topi.cuda.conv2d_NCHW → emitTE→TIR layout transform to NHWC
  • P2: topi.cuda.conv2d_NHWC → emitTE

If so, do we have some sort of guarantee that the performance of P1 and P2 match to each other? Otherwise, we may need to be smart at picking up the right path in different situations.

@vinx13
Copy link
Contributor Author

vinx13 commented Aug 24, 2022

Thanks for the discussion.
@psrivas2 There are no analysis to check if a PrimFunc is layout agnostic. After graph op lowering, there are no such layout information in PrimFunc (unless we add annotations). In TIR PrimFunc, instead we can check if it is a layout-unconstrained one during scheduling. For example, if (auto-)scheduling doesn't apply tensor intrinsics to some ops, there ops are not constrained. For top-down conversions like relay.convert_layout, there are indeed lost of information after lowering to TIR that makes it hard to tell if a op is layout agnostic.
@sunggg There are indeed two different primfuncs. P1 contains pre and post layout conversions that will need to be optimized away (if possible). For the computation part, it is feasible to make sure the conv2d NWHC generated from conv2d NCHW is equivalent to directly written conv2d NHWC.

@psrivas2
Copy link
Contributor

Ah interesting. If I am interpreting your comment correctly, the computation part for P1 & P2 is equivalent (can be made equivalent?) modulo post/pre layout conversions.

@vinx13
Copy link
Contributor Author

vinx13 commented Aug 25, 2022

@psrivas2 exactly, it can be made equivalent

@csullivan
Copy link
Collaborator

csullivan commented Sep 2, 2022

Many many thanks for the great RFC and discussions everyone! I wanted to initiate a discussion around the hoisting and splitting step involved when padding is present.

When splitting / hoisting out padding and cropping transformations, in order to preserve the ability to simplify between split producer and consumer primfuncs, in some cases we should expect to leave behind assumptions to preserve the local information that becomes non-local after splitting. Let us take the following TIR as an example. Prior to hoisting,

@ir_module
class BeforeHoist:
    @R.func
    def main():
        R.call_tir(func)

    @T.prim_func
    def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
        AC = T.alloc_buffer([4, 4], "int32")

        for io, ii in T.grid(4, 4):
            with T.block():
                T.block_attr("preproc", "pad")
                AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

        BC = T.alloc_buffer([4, 4], "int32")
        for i, j in T.grid(4, 4):
            BC[i, j] = 2 * AC[i, j]

        for i in T.serial(14):
            with T.block():
                T.block_attr("postproc", ["crop", 0])
                B[i] = BC[i // 4, i % 4]

Above we have a padding step that allows the inner compute statement to operate on a padded space. When splitting, we can leave behind the now non-local assumptions that could help simplifications of the inner compute. In this case the inner compute is already simplified, but the example can still help for discussion. Consider hoisting the padding loopnest out from the rest of func:

@ir_module
class AfterHoistOfPadStage:
    @R.func
    def main():
        R.call_tir(pad)
        R.call_tir(func)

    @T.prim_func
    def pad(A: T.Buffer[14, "int32"], AC: T.Buffer([4, 4], "int32")):
        for io, ii in T.grid(4, 4):
            with T.block():
                T.block_attr("preproc", "pad")
                AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

    @T.prim_func
    def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
        for io, ii in T.grid(4, 4):
            T.assume(4 * io + ii < 14 or AC[io, ii] == 0)

        BC = T.alloc_buffer([4, 4], "int32")
        for i, j in T.grid(4, 4):
            BC[i, j] = 2 * AC[i, j]

        for i in T.serial(14):
            with T.block():
                T.block_attr("postproc", ["crop", 0])
                B[i] = BC[i // 4, i % 4]

Here the choice was made to leave behind the assumption T.assume(4 * io + ii < 14 or AC[io, ii] == 0) which can be used to make simplifications on the subsequent compute loopnest. This can be helpful if the inner compute was performing a reduction. In general there can be some choices on which assumptions to leave in the producer and consumer based on the operations performed in producer and consumer. Here are the options that could have been taken by the layout planner when hoisting the above example,

O1. if_then_else(4 * io + ii < 14, A[4 * io + ii], 0) in pad,
T.assume(4 * io + ii < 14 or AC[io, ii] == 0) in func.

  • This option for assumptions is what is demonstrated above.

  • Maximum flexibility to the consumer. Minimal flexibility to the producer. Consumer may use the assumption of 0 if and as needed. Producer may not change the padded value away from being zero.

  • This is most useful for a sequence of convolutions/reductions. The consumer needs to know the locations of zero values. For an example of this option of assumptions applied to a conv1d, see here.

O2. if_then_else(4 * io + ii < 14, A[4 * io + ii], T.undef()) in pad,
T.assume(4 * io + ii < 14 or AC[io, ii] == T.undef()) in func.

  • Medium flexibility to the consumer. Medium flexibility to the producer. Consumer may read from the padded values, but has no information about what is in those locations. Producer must write a value to the padding to revent unintialized reads, but may write any value at all.

  • This is most useful for padding of elementwise operations. See here for a detailed example using the above primfunc as a starting point and option O2 for the assumptions applied.

O3. if 4 * io + ii < 14: AC[io,ii] = A[4 * io + ii] in pad,
T.evaluate(0) in func.

  • Minimal flexibility to the consumer. Maximal flexibility to the producer. Consumer may not read from the padded values, as they may not even be initialized (undefined behavior in CodegenC). Producer may write any value to the padding, or may leave it uninitialized.

Given that these options constitute a degree of freedom, for the purpose of staging the efforts, we could consider focusing the layout planner's initial efforts around leaving the assumptions proposed in O1 as they will be the most immediately applicable for the case of convolution and other contraction based operations where it is desired to operate in a padded and block-transformed data space.

Please let us know your thoughts on this aspect of hoisting for the layout planner. cc @Lunderberg who is fully paged in on this topic. Thanks !!

@tqchen
Copy link
Contributor

tqchen commented Jan 6, 2023

See #277

@tqchen tqchen closed this as completed Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants