[TIR] Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern #9527

wrongtest-intellif · 2021-11-18T07:29:52Z

Hi there~ This PR is an enforcement for compute_at and reverse_compute_at primitives. Binding block into loops may create some non-trivial iter bounds. Complex iter bound is neither human-kind friendly nor compatible with backend passes targeting at bounds and conditions (eg, loop partition). So the PR try to distinguish some of complex bounds and use block predicates to make the ir structure simpler.

A working example is as below, we want to create spatial tiles and read each tiled data from cache, thus the schedule operation is compute_at cache_read block into tiled loops.

@T.prim_func
def tiled_pooling_read_cache(a: T.handle, b: T.handle) -> None:
    X = T.match_buffer(a, [224, 224], dtype="float32")
    Y = T.match_buffer(b, [224, 224], dtype="float32")
    cache = T.alloc_buffer([224, 224], dtype="float32")
    for hh, ww in T.grid(224, 224):
        with T.block("cache"):
            h, w = T.axis.remap("SS", [hh, ww])
            T.reads([X[h, w]])
            T.writes([cache[h, w]])
            cache[h, w] = X[h, w]
    for hh_0, ww_0, hh_1, ww_1, khh, kww in T.grid(28, 28, 8, 8, 3, 3):
        with T.block("compute"):
            h = T.axis.spatial(224, hh_0 * 8 + hh_1)
            w = T.axis.spatial(224, ww_0 * 8 + ww_1)
            kh, kw = T.axis.remap("RR", [khh, kww])
            T.reads([Y[h, w], cache[h + kh - 1, w + kw - 1]])
            T.writes([Y[h, w]])
            with T.init():
                Y[h, w] = 0.0
            Y[h, w] = T.max(Y[h, w], T.if_then_else(
                T.likely(1 <= h + kh, dtype="bool") and \
                T.likely(h + kh < 225, dtype="bool") and \
                T.likely(1 <= w + kw, dtype="bool") and \
                T.likely(w + kw < 225, dtype="bool"),
                cache[h + kh - 1, w + kw - 1], 0.0, dtype="float32"))

Main stream code will produce

@T.prim_func
def func(a: T.handle, b: T.handle) -> None:
    X = T.match_buffer(a, [224, 224], dtype="float32")
    Y = T.match_buffer(b, [224, 224], dtype="float32")
    # body
    # with T.block("root")
    cache = T.alloc_buffer([224, 224], dtype="float32")
    for hh_0, ww_0 in T.grid(28, 28):
        for ax0 in T.serial(0, T.min(hh_0 * 8 + 8, 223) + 1 - T.max(hh_0 * 8 - 1, 0)):
            for ax1 in T.serial(0, T.min(ww_0 * 8 + 8, 223) + 1 - T.max(ww_0 * 8 - 1, 0)):
                with T.block("cache"):
                    h = T.axis.spatial(224, T.max(hh_0 * 8 - 1, 0) + ax0)
                    w = T.axis.spatial(224, T.max(ww_0 * 8 - 1, 0) + ax1)
                    T.reads([X[h, w]])
                    T.writes([cache[h, w]])
                    cache[h, w] = X[h, w]
        for hh_1, ww_1, khh, kww in T.grid(8, 8, 3, 3):
            with T.block("compute"):
                ...

The PR will produce

def tiled_pooling_read_cache_after_compute_at(a: T.handle, b: T.handle) -> None:
    X = T.match_buffer(a, [224, 224], dtype="float32")
    Y = T.match_buffer(b, [224, 224], dtype="float32")
    cache = T.alloc_buffer([224, 224], dtype="float32")
    for hh_0, ww_0 in T.grid(28, 28):
        for ax0, ax1 in T.grid(10, 10):
            with T.block("cache"):
                h = T.axis.spatial(224, hh_0 * 8 - 1 + ax0)
                w = T.axis.spatial(224, ww_0 * 8 - 1 + ax1)
                T.where(1 <= hh_0 * 8 + ax0 and hh_0 * 8 + ax0 < 225 and 1 <= ww_0 * 8 + ax1 and ww_0 * 8 + ax1 < 225)
                T.reads([X[h, w]])
                T.writes([cache[h, w]])
                cache[h, w] = X[h, w]
        for hh_1, ww_1, khh, kww in T.grid(8, 8, 3, 3):
            with T.block("compute"):
                ...

The modification is to delay the intersection of intset deduced from required uses and intset enforced by buffer shape / original iter bound. Instead of direct intset intersection (can create much complex expr of min/max), A BlockVarDomainInfo class is added to maintain above two intsets named as dom and bound. Finally the implementation can choose with some heuristic:

use (dom ^ bound) as iter domain if it is simple enough
use dom as iter domain and add block predicate for bound

The PR also add minimal support to analyze floordiv/floormod in provide-required region mapping.

Hzfengsy

Thanks for the PR. It is super helpful for imperfect tiling case.

For the region cover problem, I will look at it. It's better to fix it before this PR merged.
cc @junrushao1994

Hzfengsy · 2021-11-18T09:42:04Z

src/tir/schedule/primitive/compute_at.cc

@@ -514,6 +605,14 @@ void ComputeAtOrReverseComputeAtImpl(ScheduleState self, const StmtSRef& block_s
      /*realize=*/reconstructor.new_block_realize_,
      /*loop_var_ranges=*/LoopDomainOfSRefTreePath(GetRef<StmtSRef>(block_sref->parent)),
      /*analyzer=*/&analyzer);
+  // The verifier can not prove region cover state if some complex predicte is introduced
+  // so here it explicitly reset these flags below.
+  if (is_compute_at && !is_const_int(reconstructor.new_block_realize_->predicate)) {


It's a bug of RegionCoverCheck. We should fix it instead of working around it.

junrushao · 2021-11-18T21:55:04Z

This is very helpful! Would love to let @Hzfengsy shepherd this PR. Thanks a lot!

junrushao · 2021-12-14T22:41:58Z

CC @Hzfengsy

wrongtest-intellif · 2021-12-16T10:11:43Z

Add an option allow_block_predicate, users could set it False if old behavious (dynamic loop extent) are prefered.

spectrometerHBH · 2021-12-17T00:02:31Z

Great job! Here are some comments.
Looks like you add allow_block_predicate, why do you think it is necessary to keep the dynamic loop extent behavior. It looks to me that we can abandon this.

wrongtest-intellif · 2021-12-17T11:39:24Z

why do you think it is necessary to keep the dynamic loop extent behavior

After discusion with @Hzfengsy, I decide to revert the allow_block_predicate option to make a unified behavior. Since there is not a sound demand for that yet.

The original concern is that if the desired pattern is just the dynamic loop extents. Take "cache" block as an example, user may want to lower it into some DMA operations. If the DMA intrinsic happen to be dynamic shape enabled, but without conditional accesses, it would be non-trivial to pattern matching during lowering.

gumingsiyi · 2022-01-08T09:23:24Z

In your example, why the extend of ax0 and ax1 is 10 ?

wrongtest-intellif · 2022-01-09T13:00:29Z

In your example, why the extend of ax0 and ax1 is 10 ?

This is the extent to cover the region required by compute block's reads.

Hzfengsy

LGTM. Thanks @wrongtest for the hard and long-term work!

…and support floordiv pattern (apache#9527) * allow generate block predicate in compute_at schedule * revert apache#9880 and add more testcases

wrongtest-intellif requested review from areusch, comaniac, Hzfengsy, icemelon, jroesch, junrushao, kparzysz-quic, masahi, merrymercy, tqchen, vinx13, yzhliu and ZihengJiang as code owners November 18, 2021 07:29

Hzfengsy reviewed Nov 18, 2021

View reviewed changes

junrushao assigned Hzfengsy Nov 18, 2021

wrongtest-intellif mentioned this pull request Dec 10, 2021

[TIR] Affine utility support iter lowerbound and diagnostics #9699

Merged

wrongtest-intellif force-pushed the tir_compute_at_support_block_predicate branch from f91c1fc to 15e3963 Compare December 16, 2021 11:16

wrongtest-intellif mentioned this pull request Jan 15, 2022

[TIR] Canonical simplify the intset before region cover proof #9941

Merged

wrongtest-intellif force-pushed the tir_compute_at_support_block_predicate branch from 15e3963 to 0280297 Compare January 30, 2022 14:52

wrongtest-intellif added 2 commits February 8, 2022 17:32

allow generate block predicate in compute_at schedule

40b0b07

revert apache#9880 and add more testcases

dca31be

wrongtest-intellif force-pushed the tir_compute_at_support_block_predicate branch from 0280297 to dca31be Compare February 8, 2022 09:46

Hzfengsy approved these changes Feb 9, 2022

View reviewed changes

spectrometerHBH approved these changes Feb 9, 2022

View reviewed changes

Hzfengsy merged commit 8c53f62 into apache:main Feb 9, 2022

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIR] Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern #9527

[TIR] Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern #9527

wrongtest-intellif commented Nov 18, 2021

Hzfengsy left a comment

Hzfengsy Nov 18, 2021

junrushao commented Nov 18, 2021

junrushao commented Dec 14, 2021

wrongtest-intellif commented Dec 16, 2021

spectrometerHBH commented Dec 17, 2021

wrongtest-intellif commented Dec 17, 2021 •

edited

Loading

gumingsiyi commented Jan 8, 2022

wrongtest-intellif commented Jan 9, 2022

Hzfengsy left a comment

[TIR] Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern #9527

[TIR] Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern #9527

Conversation

wrongtest-intellif commented Nov 18, 2021

Hzfengsy left a comment

Choose a reason for hiding this comment

Hzfengsy Nov 18, 2021

Choose a reason for hiding this comment

junrushao commented Nov 18, 2021

junrushao commented Dec 14, 2021

wrongtest-intellif commented Dec 16, 2021

spectrometerHBH commented Dec 17, 2021

wrongtest-intellif commented Dec 17, 2021 • edited Loading

gumingsiyi commented Jan 8, 2022

wrongtest-intellif commented Jan 9, 2022

Hzfengsy left a comment

Choose a reason for hiding this comment

wrongtest-intellif commented Dec 17, 2021 •

edited

Loading