[ARITH] Improve Canonical Simplification to Handle Fused Pattern #1711

ke1337 · 2018-09-13T06:12:25Z

I have following C++ code when testing TVM in CUDA:

    tvm::Array<tvm::Expr> a_shape1 = {5, 6};
    // tvm::Array<tvm::Expr> a_shape1 = {5, 2, 3};
    tvm::Tensor tvm_X = tvm::placeholder(a_shape1, tvm::Float(32), "A");
    tvm::Tensor tvm_Y = ::topi::where(less(0, tvm_X), 1 / (1 + exp(negative(tvm_X))), 1 - 1 / (1 + exp(tvm_X)));
    auto target1 = tvm::target::cuda();
    auto S1 = topi::cuda::schedule_injective(target1, {tvm_Y});

    auto args1 = tvm::Array<tvm::Tensor>({tvm_X, tvm_Y});
    std::unordered_map<tvm::Tensor, tvm::Buffer> binds1;
    auto config1 = tvm::build_config();
    config1->restricted_func = true;
    auto lowered1 = tvm::lower(S1, args1, "Sigmoid", binds1, config1);

    std::cout << lowered1[0]->body << std::endl;

When the input shape is 2D (5,6), the lowered function looks close to handwritten kernel:

  if ((threadIdx.x < 30)) {
    tensor[threadIdx.x] = tvm_if_then_else(((0.000000f < A[threadIdx.x]) == (uint1)0), (1.000000f - (1.000000f/(exp(A[threadIdx.x]) + 1.000000f))), (1.000000f/(exp((0.000000f - A[threadIdx.x])) + 1.000000f)))
  }

However for 3D input shape (5, 2, 3), the lowered function looks different:

  if ((threadIdx.x < 30)) {
    tensor[(((threadIdx.x/6)*6) + ((((threadIdx.x/3) % 2)*3) + (threadIdx.x % 3)))] = tvm_if_then_else(((0.000000f < A[(((threadIdx.x/6)*6) + ((((threadIdx.x/3) % 2)*3) + (threadIdx.x % 3)))]) == (uint1)0), (1.000000f - (1.000000f/(exp(A[(((threadIdx.x/6)*6) + ((((threadIdx.x/3) % 2)*3) + (threadIdx.x % 3)))]) + 1.000000f))), (1.000000f/(exp((0.000000f - A[(((threadIdx.x/6)*6) + ((((threadIdx.x/3) % 2)*3) + (threadIdx.x % 3)))])) + 1.000000f)))
  }

From my reading of injective schedule, it seems all input axes are fused before split, so the two cases above should have identical code gen. Is my understanding correct?

The text was updated successfully, but these errors were encountered:

merrymercy · 2018-09-13T16:10:17Z

Ideally (((threadIdx.x/6)*6) + ((((threadIdx.x/3) % 2)*3) + (threadIdx.x % 3))) should be simplified to threadIdx.x since they are equivalent. But the simplifier in tvm cannot handle this case. (cc @tqchen )

Although it introduces some extra arithmetic operations, in practice we don't observe performance regression. So it is still okay.

tqchen · 2018-09-13T17:07:58Z

as far as I recall, there is some ability in the buffer index fetch to simplify such expressions, by @sxjscience maybe someone can followup on this

sxjscience · 2018-09-14T08:31:14Z

Yes, I've met with this problem and have written the following code https://github.com/dmlc/tvm/blob/master/src/lang/buffer.cc#L152-L220 to optimize some predefined patterns.

xqdan · 2018-09-14T11:24:44Z

I've had similar issue, and my pattern looks more complicated, codegen mechanism in tvm can't handle this, so we chose low level ir builder for 3d conv.
Anyway, @sxjscience , could you take a look at this pattern, can we transform this like just you did?

for (j, 0, 32) {
 for (k, 0, 2) {
   for (m, 0, 16) {
     for (n, 0, 16) {
       Apad5d[((((j*512) + (k*256)) + (m*16)) + n)] = select((((((((((bo.outer*512) + (j*16)) + m)/32) + (((((ko.outer*7) + (k*16)) + n) % 25)/5)) >= 2) && (((((((bo.outer*512) + (j*16)) + m)/32) + (((((ko.outer*7) + (k*16)) + n) % 25)/5)) - 2) < 32)) && (((((j*16) + m) % 32) + (((((ko.outer*7) + (k*16)) + n) % 25) % 5)) >= 2)) && ((((((j*16) + m) % 32) + (((((ko.outer*7) + (k*16)) + n) % 25) % 5)) - 2) < 32)), A.local.L1[((((((((((bo.outer*512) + (j*16)) + m)/32)*576) + ((((j*16) + m) % 32)*16)) + (((((ko.outer*32) + (k*16)) + n)/400)*20736)) + ((((((ko.outer*7) + (k*16)) + n) % 25)/5)*576)) + (((((ko.outer*32) + (k*16)) + n)/25) % 16)) + ((((((ko.outer*7) + (k*16)) + n) % 25) % 5)*16))], 0.000000h)
     }
   }
 }
}

Thanks,

sxjscience · 2018-09-14T11:55:46Z

The current implementation haven’t considered the case that involves “<“ . Like “... < 32”. Also, what’s the simplified version of this pattern?(looks complicated) Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: xqdan <[email protected]> Sent: Friday, September 14, 2018 7:24:53 PM To: dmlc/tvm Cc: Xingjian SHI; Mention Subject: Re: [dmlc/tvm] [ARITH] Improve Canonical Simplification to Handle Fused Pattern (#1711) I've had similar issue, and my pattern looks more complicated, codegen mechanism in tvm can't handle this, so we chose low level ir builder for 3d conv. Anyway, @sxjscience<https://github.com/sxjscience> , could you take look this pattern, can we transform this like just you did? for (j, 0, 32) { for (k, 0, 2) { for (m, 0, 16) { for (n, 0, 16) { Apad5d[((((j*512) + (k*256)) + (m*16)) + n)] = select((((((((((bo.outer*512) + (j*16)) + m)/32) + (((((ko.outer*7) + (k*16)) + n) % 25)/5)) >= 2) && (((((((bo.outer*512) + (j*16)) + m)/32) + (((((ko.outer*7) + (k*16)) + n) % 25)/5)) - 2) < 32)) && (((((j*16) + m) % 32) + (((((ko.outer*7) + (k*16)) + n) % 25) % 5)) >= 2)) && ((((((j*16) + m) % 32) + (((((ko.outer*7) + (k*16)) + n) % 25) % 5)) - 2) < 32)), A.local.L1[((((((((((bo.outer*512) + (j*16)) + m)/32)*576) + ((((j*16) + m) % 32)*16)) + (((((ko.outer*32) + (k*16)) + n)/400)*20736)) + ((((((ko.outer*7) + (k*16)) + n) % 25)/5)*576)) + (((((ko.outer*32) + (k*16)) + n)/25) % 16)) + ((((((ko.outer*7) + (k*16)) + n) % 25) % 5)*16))], 0.000000h) } } } } Thanks, — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1711 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7kFMJCU5IOYHv_QZAn88OAiOFuTOks5ua5IFgaJpZM4WmofQ>.

merrymercy · 2018-09-14T12:20:00Z

I found a related Halide PR (halide/Halide#2845) which might be interesting.
They introduce template for rewrite so we can define new rules as follows

rewrite((x + y) + w < x + z, y + w < z)
rewrite(select(x, y, z) + select(x, w, u), select(x, y + w, z + u))

xqdan · 2018-09-14T12:42:04Z

@sxjscience it's im2col convolution

tqchen · 2018-09-14T15:36:08Z

Most of the simplification we talked about here involves bound checking as well as the arithmetic template, which is harder than the simple rewrites. I wanted to do it for quite a while ago, maybe it is a good time to rethink our arithmetic simplifier to handle these cases

tqchen · 2019-02-12T03:45:32Z

Consolidate this issue to #2588

ke1337 changed the title ~~CUDA schedule_injective creates different code after lower with different input shape~~ CUDA schedule_injective creates different code with different input shape Sep 13, 2018

tqchen changed the title ~~CUDA schedule_injective creates different code with different input shape~~ [ARITH] Improve Canonical Simplification to Handle Fused Pattern Sep 13, 2018

tqchen added the status: help wanted label Sep 13, 2018

tqchen closed this as completed Feb 12, 2019

tqchen mentioned this issue Feb 12, 2019

[RFC][EXPR] Formalize Integer Arithmetic Analysis #2588

Closed

8 tasks

merrymercy mentioned this issue Mar 4, 2019

[ARITH] Analyzer RewriteSimplifier: add/sub/mul/div/mod #2722

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARITH] Improve Canonical Simplification to Handle Fused Pattern #1711

[ARITH] Improve Canonical Simplification to Handle Fused Pattern #1711

ke1337 commented Sep 13, 2018

merrymercy commented Sep 13, 2018 •

edited

Loading

tqchen commented Sep 13, 2018

sxjscience commented Sep 14, 2018

xqdan commented Sep 14, 2018 •

edited

Loading

sxjscience commented Sep 14, 2018 via email

merrymercy commented Sep 14, 2018 •

edited

Loading

xqdan commented Sep 14, 2018

tqchen commented Sep 14, 2018

tqchen commented Feb 12, 2019

[ARITH] Improve Canonical Simplification to Handle Fused Pattern #1711

[ARITH] Improve Canonical Simplification to Handle Fused Pattern #1711

Comments

ke1337 commented Sep 13, 2018

merrymercy commented Sep 13, 2018 • edited Loading

tqchen commented Sep 13, 2018

sxjscience commented Sep 14, 2018

xqdan commented Sep 14, 2018 • edited Loading

sxjscience commented Sep 14, 2018 via email

merrymercy commented Sep 14, 2018 • edited Loading

xqdan commented Sep 14, 2018

tqchen commented Sep 14, 2018

tqchen commented Feb 12, 2019

merrymercy commented Sep 13, 2018 •

edited

Loading

xqdan commented Sep 14, 2018 •

edited

Loading

merrymercy commented Sep 14, 2018 •

edited

Loading