Changes to make tensorize work. These changes also fix the previously broken test. #3981

kimishpatel · 2019-09-20T14:18:06Z

Summary:
Tensorize was breaking for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:

expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)]
The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)).

This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.

Equivalence between tensorized expr and one being asked to tensorize. For example,
the error would be.

TVMError: Check failed: Equal(lhs, rhs):
Failed to match the compute with TensorIntrin tensor_intrin's declaration
provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0),
intrin=  reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0)

Difference is mainly in the source part:

source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))]
source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))]

This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.

Here it fails with:

Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4]

This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.

Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py

Reviewers:

Subscribers:

Tasks:

Tags:

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

broken test. Summary: Tensorize was breaking for a few reasons. 1) Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent)) In some cases this cannot be proven, e.g.: expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)] The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)). This can be simplified but it is not because to simplify divide, it must prove ax1.outer > 0 and since it is var it cannot. The fix for this to just find all the vars in expr in relace them with some const value. 2) Equivalence between tensorized expr and one being asked to tensorize. For example, the error would be. TVMError: Check failed: Equal(lhs, rhs): Failed to match the compute with TensorIntrin tensor_intrin's declaration provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0), intrin= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0) Difference is mainly in the source part: source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))] source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))] This was not being simpifiled due to compute_intrin_iter_space (map for iter var to range) not containing leaf iter vars. 3) Here it fails with: Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4] This is in buffer binding where it thinks expected and buffer bound shape is different. Although if we could simplify expr, this would not be the case. Test Plan: On skylake avx512 machine: python tests/python/contrib/test_gemm_acc16.py Reviewers: Subscribers: Tasks: Tags:

kimishpatel · 2019-09-20T14:25:49Z

cc: @tqchen. I have attempted a fix here. Please take a look. I am particularly curious if the way I have attempted to simplify index expression is correct? I might have missed something. Thanks!

tqchen · 2019-09-20T15:48:51Z

I think a better fix would be to make a better use of Analyzer. Because the simplification depends on the context(of var being nonnegative), we could populate the range information of the related axis before we call analyzer->Simplify, this should resolve the problem. https://github.com/dmlc/tvm/blob/master/include/tvm/arithmetic.h#L126

Can you look into if we can take this better approach?

kimishpatel · 2019-09-20T15:50:48Z

Sure. Let me look into it.

tqchen · 2019-09-21T00:49:04Z

related RFC https://discuss.tvm.ai/t/discuss-embed-more-bound-information-into-var-or-expr/4079

kimishpatel · 2019-09-22T21:43:26Z

@tqchen, I moved to using domain map for itervar everywhere. This should be better right?

kimishpatel · 2019-09-22T21:44:26Z

related RFC https://discuss.tvm.ai/t/discuss-embed-more-bound-information-into-var-or-expr/4079

Thanks for the link. I will go through it and provide any feedback if I have.

tqchen · 2019-09-23T04:32:06Z

src/codegen/build_module.cc

@@ -392,7 +392,7 @@ Stmt BuildStmt(Schedule sch,

  // Phase 1
  stmt = ir::StorageFlatten(stmt, out_binds, 64,
-                            config->instrument_bound_checkers);
+                            config->instrument_bound_checkers, bounds);


We don't have to get bounds from the previous scheduling stage, instead, we could get them from the range of the loops an bind these information with analyzer->Bind related https://github.com/dmlc/tvm/blob/master/src/arithmetic/ir_mutator_with_analyzer.h

@tqchen, ir mutator would imply you want to mutate exprs whereas all we want is the bound information. Can we not just get new bounds after ScheduleOps?

So that we have bounds information from the new schedule.

Oh, i do not mean we should reuse the IR mutator, but simply means we could populate the bounds in the same way as the IRMutatorWithAnalyzer in the visitor.

The main reason to re-populate these bounds is to make sure that StorageFlatten stand on its own as a pass. As these information are already available in the context. So that we can apply the same pass for code that may not have bound information available

@tqchen, to make sure, you are suggesting that we populate bounds inside StorageFlatten and now pass it from outside? Sorry to drag this, but just want to make sure.

Yes, what i mean is to populate the bounds(from the loops) inside the StorageFlatten pass, so the pass itself can be self-contained without relying out the external information :) . Thanks for the clarification questions. It is great that we can do these kind of discussions in public, as the apache way always encourages public discussions and allow other developers to follow the development.

@tqchen, updated the PR. Hopefully this is akin to what you had in mind :).

statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

tqchen

some comments, getting close

tqchen · 2019-09-23T21:42:36Z

include/tvm/bounded_analyzer.h

+  }
+
+  /*! \brief internal analyzer field. */
+  arith::Analyzer analyzer;


make the field protected, analyzer_ to be consistent with google c format

tqchen · 2019-09-23T21:43:23Z

include/tvm/bounded_analyzer.h

+namespace tvm {
+namespace ir {
+
+class BoundedAnalyzer final : public IRVisitor {


How about IRVisitorWithAnalyzer to be consistent with IRMutatorWithAnalyzer, let us move the header into src/arithmetic for now as we want to minimize the amount of public headers(which means we promise stable APIs)

tqchen · 2019-09-23T21:43:57Z

src/op/tensorize.cc

-      for (size_t i = 0; i < e.start; ++i) {
-        CHECK(is_one(e.region[i]->extent))
+      for (size_t j = 0; j < e.start; ++j) {
+        auto canonical_extent = Simplify(e.region[j]->extent, *compute_intrin_iter_space);


if we have a analyzer, we don't have to call Simplify, but instead we can call analyzer_->Simplify

Actually here, I think it is going to be tricky. I suppose that is the reason dom_map constructed and passed from outside.

As an alternative, we could create an Analyzer outside, populate the iteration space(via binding) using dom_map, and analyzer->Bind, then pass the analyzer pointer in.

That sounds good for consolidating all the simplify stuff, however I think that would be a bit of hefty change and I was wondering if we can split it into a different PR.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel · 2019-09-23T22:59:51Z

@tqchen, updated PR and left a comment.

tqchen · 2019-09-23T23:38:08Z

Please look into the CI lint error(missing ASF header)

tqchen · 2019-09-23T23:58:20Z

more cpplint errors, you should be able to repro locally via make cpplint

kimishpatel · 2019-09-24T00:00:12Z

Aah my bad. Let me fix this once and for all.

…R_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel · 2019-09-24T14:23:00Z

@tqchen, seems like it passed all the checks.

tqchen

Some final follow up comments to see if we can consolidate a bit more to the analyzer. Thanks for making improvements to the code :) we are getting very close

tqchen · 2019-09-24T17:22:05Z

src/pass/arg_binder.cc

@@ -128,7 +128,7 @@ void ArgBinder::BindBuffer(const Buffer& arg,
    CHECK(fuzzy_match) << "Argument " << arg_name << " size mismatch";
    size_t diff = value->shape.size() - arg->shape.size();
    for (size_t i = 0; i < diff; ++i) {
-      CHECK(is_one(value->shape[i]))
+      CHECK(is_one(Simplify(value->shape[i])))


is this necessary? as most cases we eagerly simplify the constants, while calling simplify is certainly more powerful, it would be great if we can move such simplification to the place that produces this expression.

So I did hit into this assert, thats why the change else I dont try to eagerly fix things which are not a problem. Do you want me to reproduce the error and paste here?

OK, we can do that as a followup PR if you can dig a bit further.

tqchen · 2019-09-24T17:25:19Z

src/op/tensorize.cc

-      for (size_t i = 0; i < e.start; ++i) {
-        CHECK(is_one(e.region[i]->extent))
+      for (size_t j = 0; j < e.start; ++j) {
+        auto canonical_extent = Simplify(e.region[j]->extent, *compute_intrin_iter_space);


As an alternative, we could create an Analyzer outside, populate the iteration space(via binding) using dom_map, and analyzer->Bind, then pass the analyzer pointer in.

kimishpatel · 2019-09-24T17:41:30Z

src/pass/storage_flatten.cc

+   * However there is no copy/move operator on analyzer that safely copies
+   * or moves data. Perhaps we should disable copy operator and implement
+   * move operator.
+   */


@tqchen, btw in case you missed this comment of mine here, I think we would want to file task/issue to fix Analyzer so as to not have to resort to shared_ptr.

IRVisitorWithAnalyzer analyzer StorageFlattener(..., &analyzer,).Mutate()

Can we simply pass in raw pointer of IRVisitorWithAnalyzer* into StorageFlattener? given that we do not need to share the life-cycles between here and the flattener.

disable copy constructor sounds good

kimishpatel · 2019-09-24T17:42:16Z

@tqchen, I responded to your comments. Let me know what you think.

tqchen

We can defer the rest to a followup PR, let us change the shared_ptr

tqchen · 2019-09-24T17:50:55Z

src/pass/storage_flatten.cc

+   * However there is no copy/move operator on analyzer that safely copies
+   * or moves data. Perhaps we should disable copy operator and implement
+   * move operator.
+   */


IRVisitorWithAnalyzer analyzer StorageFlattener(..., &analyzer,).Mutate()

Can we simply pass in raw pointer of IRVisitorWithAnalyzer* into StorageFlattener? given that we do not need to share the life-cycles between here and the flattener.

disable copy constructor sounds good

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel · 2019-09-24T20:13:05Z

@tqchen, made the requested changes, but note that using raw pointer is little risky. We have to guarantee that it is not forwarded to someone else whose lifetime extend the object to which raw pointer is passed.

tqchen · 2019-09-24T20:33:15Z

yap, the use of raw pointer is fine as long as we do not use it to retain ownerships and extend the life-cycle(treat it as a weakref), which is our case

kimishpatel · 2019-09-24T22:54:44Z

@tqchen, can you merge this please?

tqchen · 2019-09-25T00:08:08Z

Thanks @kimishpatel !

… broken test. (apache#3981) * Changes to make tensorize work. These changes also fix the previously broken test. Summary: Tensorize was breaking for a few reasons. 1) Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent)) In some cases this cannot be proven, e.g.: expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)] The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)). This can be simplified but it is not because to simplify divide, it must prove ax1.outer > 0 and since it is var it cannot. The fix for this to just find all the vars in expr in relace them with some const value. 2) Equivalence between tensorized expr and one being asked to tensorize. For example, the error would be. TVMError: Check failed: Equal(lhs, rhs): Failed to match the compute with TensorIntrin tensor_intrin's declaration provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0), intrin= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0) Difference is mainly in the source part: source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))] source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))] This was not being simpifiled due to compute_intrin_iter_space (map for iter var to range) not containing leaf iter vars. 3) Here it fails with: Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4] This is in buffer binding where it thinks expected and buffer bound shape is different. Although if we could simplify expr, this would not be the case. Test Plan: On skylake avx512 machine: python tests/python/contrib/test_gemm_acc16.py Reviewers: Subscribers: Tasks: Tags: * Implemented bounded analyzer which traverses tree and for reduce/for statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Addressed comments. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well. * Relax the assumption that dom_map must always contain all leaf itervars. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Disable copy constructor and move to raw ptr. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel mentioned this pull request Sep 20, 2019

Added tesnorizeation for avx2 based gemm. #3982

Merged

tqchen self-assigned this Sep 20, 2019

tqchen added the status: need update need update based on feedbacks label Sep 20, 2019

anijain2305 mentioned this pull request Sep 20, 2019

Int8 NHWC pack tensorize failure #3598

Closed

kimishpatel force-pushed the tensorize_fix branch from cd9a849 to 517bbe9 Compare September 22, 2019 21:40

tqchen requested changes Sep 23, 2019

View reviewed changes

Implemented bounded analyzer which traverses tree and for reduce/for

cb8614a

statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel force-pushed the tensorize_fix branch from 517bbe9 to cb8614a Compare September 23, 2019 20:01

tqchen requested changes Sep 23, 2019

View reviewed changes

Addressed comments.

05eae48

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel force-pushed the tensorize_fix branch from 7e69ef1 to 5d193cc Compare September 24, 2019 00:20

Added ASF header + define macro for the header file: TVM_ARITHMETIC_I…

5e40764

…R_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well.

kimishpatel force-pushed the tensorize_fix branch from 5d193cc to 5e40764 Compare September 24, 2019 01:14

Relax the assumption that dom_map must always contain all leaf itervars.

380bc9a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

tqchen requested changes Sep 24, 2019

View reviewed changes

kimishpatel commented Sep 24, 2019

View reviewed changes

tqchen requested changes Sep 24, 2019

View reviewed changes

Disable copy constructor and move to raw ptr.

05ac0fd

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel force-pushed the tensorize_fix branch from 49abeb5 to 05ac0fd Compare September 24, 2019 20:11

tqchen approved these changes Sep 24, 2019

View reviewed changes

tqchen merged commit b410df8 into apache:master Sep 25, 2019

tqchen removed the status: need update need update based on feedbacks label Sep 25, 2019

tqchen removed their assignment Sep 25, 2019

tqchen added the status: accepted label Sep 25, 2019

yzhliu mentioned this pull request Nov 11, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Changes to make tensorize work. These changes also fix the previously broken test. #3981

Changes to make tensorize work. These changes also fix the previously broken test. #3981

Conversation

kimishpatel commented Sep 20, 2019

kimishpatel commented Sep 20, 2019

tqchen commented Sep 20, 2019

kimishpatel commented Sep 20, 2019

tqchen commented Sep 21, 2019

kimishpatel commented Sep 22, 2019

kimishpatel commented Sep 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen Sep 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Sep 23, 2019

tqchen commented Sep 23, 2019

tqchen commented Sep 23, 2019

kimishpatel commented Sep 24, 2019

kimishpatel commented Sep 24, 2019

tqchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Sep 24, 2019

tqchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Sep 24, 2019

tqchen commented Sep 24, 2019

kimishpatel commented Sep 24, 2019

tqchen commented Sep 25, 2019

tqchen Sep 23, 2019 •

edited

Loading