Compilation flags for dispatch formation for deeplabv3 #515

MaheshRavishankar · 2024-07-08T17:27:17Z

The DeeplabV3 i8 model compilation by default might not create the efficient dispatches. Some flags to try

To start with we need --iree-flow-enable-aggressive-fusion --iree-opt-data-tiling=off

To fuse padding with consumer convolutions we would need to add --iree-flow-enable-fuse-padding-into-linalg-consumer-ops

To enable conversion of NCHW convolutions to NHWC we would need

--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline)
To enable transpose propagation to move these transposes away we will need --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true

The text was updated successfully, but these errors were encountered:

yzhang93 · 2024-07-08T19:21:04Z

@MaheshRavishankar With these flags, it now generates 83 dispatches. Most dispatches look reasonable to me, but there are still 7 standalone transpose dispatches, like this:

builtin.module {
      func.func @tf2onnx$async_dispatch_4_transpose_32x67081_f32() {
        %c14795072 = arith.constant 14795072 : index
        %c0 = arith.constant 0 : index
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c14795072) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<67081x32xf32>>
        %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<32x67081xf32>>
        %2 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [67081, 32], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<67081x32xf32>> -> tensor<67081x32xf32>
        %3 = tensor.empty() : tensor<32x67081xf32>
        %4 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%2 : tensor<67081x32xf32>) outs(%3 : tensor<32x67081xf32>) {
        ^bb0(%in: f32, %out: f32):
          linalg.yield %in : f32
        } -> tensor<32x67081xf32>
        flow.dispatch.tensor.store %4, %1, offsets = [0, 0], sizes = [32, 67081], strides = [1, 1] : tensor<32x67081xf32> -> !flow.dispatch.tensor<writeonly:tensor<32x67081xf32>>
        return
      }
    }

yzhang93 · 2024-07-08T19:38:38Z

All the dispatches can be found here https://github.com/nod-ai/npu-benchmark/blob/main/processed_dispatches.zip

MaheshRavishankar · 2024-07-08T23:48:57Z

Can you provide the dump of the IR after iree-flow-form-dispatch-regions . It will easy to see what those transpose are.

yzhang93 · 2024-07-09T00:05:57Z

Can you provide the dump of the IR after iree-flow-form-dispatch-regions . It will easy to see what those transpose are.

Yes, here is the dump IR https://gist.github.com/yzhang93/456640440608e48550308bf87245523c

MaheshRavishankar · 2024-07-09T00:14:49Z

There are some obvious things here that could help more

%11 = flow.dispatch.region -> (tensor<1x32x259x259xf32>) {
    %237 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d2, d3, d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_112 : tensor<259x259x1x32xf32>) outs(%10 : tensor<1x32x259x259xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<1x32x259x259xf32>
    flow.return %237 : tensor<1x32x259x259xf32>
  }
  %12 = tensor.empty() : tensor<1x32x257x257xf32>
  %13 = linalg.fill ins(%cst_14 : f32) outs(%12 : tensor<1x32x257x257xf32>) -> tensor<1x32x257x257xf32>
  %14 = flow.dispatch.region -> (tensor<1x32x257x257xf32>) {
    %237 = linalg.depthwise_conv_2d_nchw_chw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%11, %cst_27 : tensor<1x32x259x259xf32>, tensor<32x3x3xf32>) outs(%13 : tensor<1x32x257x257xf32>) -> tensor<1x32x257x257xf32>
    flow.return %237 : tensor<1x32x257x257xf32>
  }

This probably requires the depthwise convs also to be converted to nhwc, then the transpose should fold away

%14 = flow.dispatch.region -> (tensor<1x32x257x257xf32>) {
    %237 = linalg.depthwise_conv_2d_nchw_chw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%11, %cst_27 : tensor<1x32x259x259xf32>, tensor<32x3x3xf32>) outs(%13 : tensor<1x32x257x257xf32>) -> tensor<1x32x257x257xf32>
    flow.return %237 : tensor<1x32x257x257xf32>
  }
  %15 = tensor.empty() : tensor<257x257x1x32xf32>
  %16 = flow.dispatch.region -> (tensor<257x257x1x32xf32>) {
    %237 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d2, d3, d0, d1)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%14 : tensor<1x32x257x257xf32>) outs(%15 : tensor<257x257x1x32xf32>) {
    ^bb0(%in: f32, %out: f32):
      %238 = arith.cmpf ult, %in, %cst_14 : f32
      %239 = arith.select %238, %cst_14, %in : f32
      %240 = arith.cmpf ugt, %239, %cst_2 : f32
      %241 = arith.select %240, %cst_2, %239 : f32
      linalg.yield %241 : f32
    } -> tensor<257x257x1x32xf32>
    flow.return %237 : tensor<257x257x1x32xf32>
  }

These two should be the same dispatch. Dont know why it isnt. Something is going wrong with iree-flow-form-dispatch-regions pass here.

  %52 = flow.dispatch.region -> (tensor<129x129x1x144xf32>) {
    %237 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d2, d3, d0, d1)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%50 : tensor<1x144x129x129xf32>) outs(%51 : tensor<129x129x1x144xf32>) {
    ^bb0(%in: f32, %out: f32):
      %238 = arith.cmpf ult, %in, %cst_14 : f32
      %239 = arith.select %238, %cst_14, %in : f32
      %240 = arith.cmpf ugt, %239, %cst_2 : f32
      %241 = arith.select %240, %cst_2, %239 : f32
      linalg.yield %241 : f32
    } -> tensor<129x129x1x144xf32>
    flow.return %237 : tensor<129x129x1x144xf32>
  }
  %collapsed_125 = tensor.collapse_shape %52 [[0], [1], [2, 3]] : tensor<129x129x1x144xf32> into tensor<129x129x144xf32>
  %expanded_126 = tensor.expand_shape %collapsed_125 [[0, 1], [2], [3]] output_shape [1, 129, 129, 144] : tensor<129x129x144xf32> into tensor<1x129x129x144xf32>
  %53 = flow.dispatch.region -> (tensor<1x129x129x144xf32>) {
    %237 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_126 : tensor<1x129x129x144xf32>) outs(%41 : tensor<1x129x129x144xf32>) {
    ^bb0(%in: f32, %out: f32):
      %238 = arith.divf %in, %cst_10 : f32
      %239 = math.round %238 : f32
      %240 = arith.addf %239, %cst_14 : f32
      %241 = arith.cmpf ult, %240, %cst_16 : f32
      %242 = arith.cmpf ugt, %240, %cst_15 : f32
      %243 = arith.select %241, %cst_16, %240 : f32
      %244 = arith.select %242, %cst_15, %243 : f32
      %245 = arith.fptosi %244 : f32 to i8
      %246 = arith.extsi %245 : i8 to i32
      %247 = arith.sitofp %246 : i32 to f32
      %248 = arith.mulf %247, %cst_10 : f32
      linalg.yield %248 : f32
    } -> tensor<1x129x129x144xf32>
    flow.return %237 : tensor<1x129x129x144xf32>
  }

The collapse_shape -> expand_shape must be folded away. But that isnt happening for some reason. if that happens the two dispatches will become one dispatch.

Fixing these 3 things must get us into much better shape.

yzhang93 · 2024-07-10T21:18:52Z

@MaheshRavishankar I tried to pad the conv ops, and this is the IR afterwards https://gist.github.com/yzhang93/d0b09b559800f74314eb2d95c0aa2b7d.

Here I modify the codes to not only pad the intrinsic dimensions (OW, OC, IC), but also the OH dimension. OH dimension has to be padded in order to distribute inputs evenly to 4 AIE cores.

After padding I noticed some issues

Padding only works on linalg.conv_2d_nhwc_hwcf, and doesn't work on the elementwise ops. So after the transform, elementwise ops become separate dispatches. Is there a way to also pad elementwise ops?
Currently padding only works for stride 1 (https://github.com/iree-org/iree/blob/d174e8bcec9f221082511c67111b3f995bdd54a0/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp#L200), and there are some conv2d ops with stride as 2 which are not padded.

newling · 2024-07-11T23:37:53Z

The collapse_shape -> expand_shape must be folded away. But that isn't happening for some reason. if that happens the two dispatches will become one dispatch.

@MaheshRavishankar Well it's not just a simple fold, it goes from tensor<129x129x1x144xf32> to tensor<1x129x129x144xf32>. To get them to merge, there needs to be some bubbling up or pushing down of the elementwise operation through the reshape operations I think. This kind of optimization can get complicated, because it isn't a local optimization. Is there a pass in IREE that is responsible for this? I'm thinking some optimization where you switch between "bubble up" and "push down" phases, trying to get "math" ops and "reshape" ops to separate into isolated clusters.

This separation of conv and elementwise ops still appears in the quantized model.

MaheshRavishankar assigned newling Jul 8, 2024

newling assigned yzhang93 and unassigned newling Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilation flags for dispatch formation for deeplabv3 #515

Compilation flags for dispatch formation for deeplabv3 #515

MaheshRavishankar commented Jul 8, 2024

yzhang93 commented Jul 8, 2024

yzhang93 commented Jul 8, 2024

MaheshRavishankar commented Jul 8, 2024

yzhang93 commented Jul 9, 2024

MaheshRavishankar commented Jul 9, 2024 •

edited

Loading

yzhang93 commented Jul 10, 2024

newling commented Jul 11, 2024

Compilation flags for dispatch formation for deeplabv3 #515

Compilation flags for dispatch formation for deeplabv3 #515

Comments

MaheshRavishankar commented Jul 8, 2024

yzhang93 commented Jul 8, 2024

yzhang93 commented Jul 8, 2024

MaheshRavishankar commented Jul 8, 2024

yzhang93 commented Jul 9, 2024

MaheshRavishankar commented Jul 9, 2024 • edited Loading

yzhang93 commented Jul 10, 2024

newling commented Jul 11, 2024

MaheshRavishankar commented Jul 9, 2024 •

edited

Loading