Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to make tensorize work. These changes also fix the previously broken test. #3981

Merged
merged 6 commits into from
Sep 25, 2019

Conversation

kimishpatel
Copy link
Contributor

Summary:
Tensorize was breaking for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:

expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)]
The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)).

This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.

  1. Equivalence between tensorized expr and one being asked to tensorize. For example,
    the error would be.
TVMError: Check failed: Equal(lhs, rhs):
Failed to match the compute with TensorIntrin tensor_intrin's declaration
provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0),
intrin=  reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0)

Difference is mainly in the source part:

source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))]
source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))]

This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.

  1. Here it fails with:
Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4]

This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.

Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py

Reviewers:

Subscribers:

Tasks:

Tags:

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

broken test.

Summary:
Tensorize was breaking  for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:
expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)]
The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)).
This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.

2) Equivalence between tensorized expr and one being asked to tensorize. For example,
the error would be.
TVMError: Check failed: Equal(lhs, rhs):
Failed to match the compute with TensorIntrin tensor_intrin's declaration
provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0),
intrin=  reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0)
Difference is mainly in the source part:
source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))]
source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))]
This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.

3) Here it fails with:
Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4]
This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.

Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py

Reviewers:

Subscribers:

Tasks:

Tags:
@kimishpatel
Copy link
Contributor Author

cc: @tqchen. I have attempted a fix here. Please take a look. I am particularly curious if the way I have attempted to simplify index expression is correct? I might have missed something. Thanks!

@tqchen
Copy link
Member

tqchen commented Sep 20, 2019

I think a better fix would be to make a better use of Analyzer. Because the simplification depends on the context(of var being nonnegative), we could populate the range information of the related axis before we call analyzer->Simplify, this should resolve the problem. https://github.com/dmlc/tvm/blob/master/include/tvm/arithmetic.h#L126

Can you look into if we can take this better approach?

@tqchen tqchen self-assigned this Sep 20, 2019
@tqchen tqchen added the status: need update need update based on feedbacks label Sep 20, 2019
@kimishpatel
Copy link
Contributor Author

Sure. Let me look into it.

@tqchen
Copy link
Member

tqchen commented Sep 21, 2019

@kimishpatel
Copy link
Contributor Author

@tqchen, I moved to using domain map for itervar everywhere. This should be better right?

@kimishpatel
Copy link
Contributor Author

related RFC https://discuss.tvm.ai/t/discuss-embed-more-bound-information-into-var-or-expr/4079

Thanks for the link. I will go through it and provide any feedback if I have.

@@ -392,7 +392,7 @@ Stmt BuildStmt(Schedule sch,

// Phase 1
stmt = ir::StorageFlatten(stmt, out_binds, 64,
config->instrument_bound_checkers);
config->instrument_bound_checkers, bounds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to get bounds from the previous scheduling stage, instead, we could get them from the range of the loops an bind these information with analyzer->Bind related https://github.com/dmlc/tvm/blob/master/src/arithmetic/ir_mutator_with_analyzer.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen, ir mutator would imply you want to mutate exprs whereas all we want is the bound information. Can we not just get new bounds after ScheduleOps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we have bounds information from the new schedule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i do not mean we should reuse the IR mutator, but simply means we could populate the bounds in the same way as the IRMutatorWithAnalyzer in the visitor.

The main reason to re-populate these bounds is to make sure that StorageFlatten stand on its own as a pass. As these information are already available in the context. So that we can apply the same pass for code that may not have bound information available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen, to make sure, you are suggesting that we populate bounds inside StorageFlatten and now pass it from outside? Sorry to drag this, but just want to make sure.

Copy link
Member

@tqchen tqchen Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what i mean is to populate the bounds(from the loops) inside the StorageFlatten pass, so the pass itself can be self-contained without relying out the external information :) . Thanks for the clarification questions. It is great that we can do these kind of discussions in public, as the apache way always encourages public discussions and allow other developers to follow the development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen, updated the PR. Hopefully this is akin to what you had in mind :).

statements binds the bound of the analyzer. Later this is used to
simplify expressions. Inspired from ir_mutator_with_analyzer

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, getting close

}

/*! \brief internal analyzer field. */
arith::Analyzer analyzer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the field protected, analyzer_ to be consistent with google c format

namespace tvm {
namespace ir {

class BoundedAnalyzer final : public IRVisitor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about IRVisitorWithAnalyzer to be consistent with IRMutatorWithAnalyzer, let us move the header into src/arithmetic for now as we want to minimize the amount of public headers(which means we promise stable APIs)

for (size_t i = 0; i < e.start; ++i) {
CHECK(is_one(e.region[i]->extent))
for (size_t j = 0; j < e.start; ++j) {
auto canonical_extent = Simplify(e.region[j]->extent, *compute_intrin_iter_space);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a analyzer, we don't have to call Simplify, but instead we can call analyzer_->Simplify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually here, I think it is going to be tricky. I suppose that is the reason dom_map constructed and passed from outside.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, we could create an Analyzer outside, populate the iteration space(via binding) using dom_map, and analyzer->Bind, then pass the analyzer pointer in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good for consolidating all the simplify stuff, however I think that would be a bit of hefty change and I was wondering if we can split it into a different PR.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@kimishpatel
Copy link
Contributor Author

@tqchen, updated PR and left a comment.

@tqchen
Copy link
Member

tqchen commented Sep 23, 2019

Please look into the CI lint error(missing ASF header)

@tqchen
Copy link
Member

tqchen commented Sep 23, 2019

more cpplint errors, you should be able to repro locally via make cpplint

@kimishpatel
Copy link
Contributor Author

Aah my bad. Let me fix this once and for all.

…R_VISITOR_WITH_ANALYZER_H_

Some lint fixes as well.
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@kimishpatel
Copy link
Contributor Author

@tqchen, seems like it passed all the checks.

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final follow up comments to see if we can consolidate a bit more to the analyzer. Thanks for making improvements to the code :) we are getting very close

@@ -128,7 +128,7 @@ void ArgBinder::BindBuffer(const Buffer& arg,
CHECK(fuzzy_match) << "Argument " << arg_name << " size mismatch";
size_t diff = value->shape.size() - arg->shape.size();
for (size_t i = 0; i < diff; ++i) {
CHECK(is_one(value->shape[i]))
CHECK(is_one(Simplify(value->shape[i])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? as most cases we eagerly simplify the constants, while calling simplify is certainly more powerful, it would be great if we can move such simplification to the place that produces this expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did hit into this assert, thats why the change else I dont try to eagerly fix things which are not a problem. Do you want me to reproduce the error and paste here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can do that as a followup PR if you can dig a bit further.

for (size_t i = 0; i < e.start; ++i) {
CHECK(is_one(e.region[i]->extent))
for (size_t j = 0; j < e.start; ++j) {
auto canonical_extent = Simplify(e.region[j]->extent, *compute_intrin_iter_space);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, we could create an Analyzer outside, populate the iteration space(via binding) using dom_map, and analyzer->Bind, then pass the analyzer pointer in.

* However there is no copy/move operator on analyzer that safely copies
* or moves data. Perhaps we should disable copy operator and implement
* move operator.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen, btw in case you missed this comment of mine here, I think we would want to file task/issue to fix Analyzer so as to not have to resort to shared_ptr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IRVisitorWithAnalyzer analyzer

StorageFlattener(..., &analyzer,).Mutate()

Can we simply pass in raw pointer of IRVisitorWithAnalyzer* into StorageFlattener? given that we do not need to share the life-cycles between here and the flattener.

disable copy constructor sounds good

@kimishpatel
Copy link
Contributor Author

@tqchen, I responded to your comments. Let me know what you think.

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can defer the rest to a followup PR, let us change the shared_ptr

* However there is no copy/move operator on analyzer that safely copies
* or moves data. Perhaps we should disable copy operator and implement
* move operator.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IRVisitorWithAnalyzer analyzer

StorageFlattener(..., &analyzer,).Mutate()

Can we simply pass in raw pointer of IRVisitorWithAnalyzer* into StorageFlattener? given that we do not need to share the life-cycles between here and the flattener.

disable copy constructor sounds good

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@kimishpatel
Copy link
Contributor Author

@tqchen, made the requested changes, but note that using raw pointer is little risky. We have to guarantee that it is not forwarded to someone else whose lifetime extend the object to which raw pointer is passed.

@tqchen
Copy link
Member

tqchen commented Sep 24, 2019

yap, the use of raw pointer is fine as long as we do not use it to retain ownerships and extend the life-cycle(treat it as a weakref), which is our case

@kimishpatel
Copy link
Contributor Author

@tqchen, can you merge this please?

@tqchen tqchen merged commit b410df8 into apache:master Sep 25, 2019
@tqchen
Copy link
Member

tqchen commented Sep 25, 2019

Thanks @kimishpatel !

@tqchen tqchen removed the status: need update need update based on feedbacks label Sep 25, 2019
@tqchen tqchen removed their assignment Sep 25, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
… broken test. (apache#3981)

* Changes to make tensorize work. These changes also fix the previously
broken test.

Summary:
Tensorize was breaking  for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:
expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)]
The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)).
This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.

2) Equivalence between tensorized expr and one being asked to tensorize. For example,
the error would be.
TVMError: Check failed: Equal(lhs, rhs):
Failed to match the compute with TensorIntrin tensor_intrin's declaration
provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0),
intrin=  reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0)
Difference is mainly in the source part:
source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))]
source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))]
This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.

3) Here it fails with:
Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4]
This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.

Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py

Reviewers:

Subscribers:

Tasks:

Tags:

* Implemented bounded analyzer which traverses tree and for reduce/for
statements binds the bound of the analyzer. Later this is used to
simplify expressions. Inspired from ir_mutator_with_analyzer

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Addressed comments.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_
Some lint fixes as well.

* Relax the assumption that dom_map must always contain all leaf itervars.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Disable copy constructor and move to raw ptr.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
… broken test. (apache#3981)

* Changes to make tensorize work. These changes also fix the previously
broken test.

Summary:
Tensorize was breaking  for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:
expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)]
The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)).
This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.

2) Equivalence between tensorized expr and one being asked to tensorize. For example,
the error would be.
TVMError: Check failed: Equal(lhs, rhs):
Failed to match the compute with TensorIntrin tensor_intrin's declaration
provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0),
intrin=  reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0)
Difference is mainly in the source part:
source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))]
source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))]
This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.

3) Here it fails with:
Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4]
This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.

Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py

Reviewers:

Subscribers:

Tasks:

Tags:

* Implemented bounded analyzer which traverses tree and for reduce/for
statements binds the bound of the analyzer. Later this is used to
simplify expressions. Inspired from ir_mutator_with_analyzer

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Addressed comments.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_
Some lint fixes as well.

* Relax the assumption that dom_map must always contain all leaf itervars.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Disable copy constructor and move to raw ptr.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
wweic pushed a commit to neo-ai/tvm that referenced this pull request Oct 1, 2019
… broken test. (apache#3981)

* Changes to make tensorize work. These changes also fix the previously
broken test.

Summary:
Tensorize was breaking  for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:
expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)]
The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)).
This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.

2) Equivalence between tensorized expr and one being asked to tensorize. For example,
the error would be.
TVMError: Check failed: Equal(lhs, rhs):
Failed to match the compute with TensorIntrin tensor_intrin's declaration
provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0),
intrin=  reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0)
Difference is mainly in the source part:
source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))]
source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))]
This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.

3) Here it fails with:
Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4]
This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.

Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py

Reviewers:

Subscribers:

Tasks:

Tags:

* Implemented bounded analyzer which traverses tree and for reduce/for
statements binds the bound of the analyzer. Later this is used to
simplify expressions. Inspired from ir_mutator_with_analyzer

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Addressed comments.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_
Some lint fixes as well.

* Relax the assumption that dom_map must always contain all leaf itervars.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Disable copy constructor and move to raw ptr.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants