-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing performance issue in PassUpDomain when fusing and splitting axes #3073
Fixing performance issue in PassUpDomain when fusing and splitting axes #3073
Conversation
There is a problem related to #1139: import tvm
m = tvm.convert(12)
l = tvm.convert(6)
A = tvm.placeholder((m, l), name='A')
A1 = tvm.compute((m, l), lambda i, j: A[i, j], name='A1')
A2 = tvm.compute((m, l), lambda i, j: A1[i, j] + 3, name='A2')
s = tvm.create_schedule(A2.op)
fused_axes = s[A2].fuse(A2.op.axis[0], A2.op.axis[1])
xo, xi = s[A2].split(fused_axes, 10)
s[A1].compute_at(s[A2], xo)
print(tvm.lower(s, [A, A2], simple_mode=True)) produces: // attr [A1] storage_scope = "global"
allocate A1[float32 * (((((i.j.fused.outer*10) % 6) + 15)/6)*6)]
produce A2 {
for (i.j.fused.outer, 0, 8) {
produce A1 {
for (i, 0, ((((i.j.fused.outer*10) % 6) + 15)/6)) {
for (j, 0, 6) {
if (likely((((i.j.fused.outer*10)/6) < (12 - i)))) {
A1[((i*6) + j)] = A[(((((i.j.fused.outer*10)/6) + i)*6) + j)]
}
}
}
}
for (i.j.fused.inner, 0, 10) {
if (likely(((i.j.fused.outer*10) < (72 - i.j.fused.inner)))) {
A2[((i.j.fused.outer*10) + i.j.fused.inner)] = (A1[(((i.j.fused.outer*10) + i.j.fused.inner) - (((i.j.fused.outer*10)/6)*6))] + 3.000000f)
}
}
}
} Allocate size depends on the loop variable. With |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can help a lot for reducing unnecessary computing. And the new bound rule is correct and tight. I think it can be merged into master as long as it solve the problem of allocate generated depended on block/thread size.
I think this should work if #1139 gets fixed. |
sorry for the delayed action on this. @bulanova-huawei can you rebase against the master and update the PR? Please also make sure CI passes |
@tqchen I added a CI test that demonstrates the problem with alloc size depending on a loop variable before it is defined. I can remove the second commit with this test, but the problem will still be there. |
We need to make sure CI is always green, so if there will be further fixes that solves future problems, we should keep that in a separate PR |
dd6c970
to
9ee7dbd
Compare
Please review @merrymercy @sgrechanik-h |
9ee7dbd
to
4f3a032
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! lgtm. just cosmetic comment.
@bulanova-huawei can you rebase again? Sorry about the request but there is a conflict we need to resove wrt to the latest merge |
Apply suggestions from code review Co-Authored-By: Wei Chen <[email protected]>
469dad0
to
010f217
Compare
Thanks @bulanova-huawei @wweic , this PR is now merged |
Apply suggestions from code review Co-Authored-By: Wei Chen <[email protected]>
Apply suggestions from code review Co-Authored-By: Wei Chen <[email protected]>
This fixes #3072
Currently bounding box for
fused
parameter in PassUpDomain is the full extent ofouter
andinner
dimensions of the output tensor.Proposed change computes tight bounds for
outer
dimension. Iffused
IntSet
always fits insideinner
dimension, then tight bound forinner
also can be computed. Otherwise, full extent ofinner
is used.For example, if 2 axes of a 2d tensor with shape (12, 6) are fused and split with factor 12, then outer and inner bounds will have dimension (2, 6). If the tensor is split with factor 9, bounds will be the same, because 9 elements wrap around 2 rows of (12, 6), and some redundant computation will occur. If split factor is 3, we can again avoid redundant computation by using (1, 3) shape.