Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Add new IR pass CombineParallelDense #3862

Merged
merged 21 commits into from
Sep 24, 2019

Conversation

soiferj
Copy link
Contributor

@soiferj soiferj commented Aug 30, 2019

This IR pass is similar to CombineParallelConv2D, but with the Dense operator. The pass takes multiple parallel dense operators and combines them into one batch_matmul operator. Element-wise operations after dense are also stacked and fused.

The code is refactored so CombineParallelConv2D and CombineParallelDense can share code, and make it easier to write a new "combine" pass later.

This is in opt level 4. In another PR, we can add tuning to decide whether or not to apply this pass.

This will help models like BERT, which have parallel Dense + BiasAdd at the start of each layer.

Discussed in this RFC: https://discuss.tvm.ai/t/discussion-new-ir-pass-proposal-combineparalleldense/3813

@vinx13 @icemelon9 would you be able to take a look?

@jroesch
Copy link
Member

jroesch commented Sep 1, 2019

@vinx13 I will add my review can you as well?

@jroesch jroesch requested a review from vinx13 September 1, 2019 00:56
@jroesch
Copy link
Member

jroesch commented Sep 1, 2019

@MarisaKirisame can you add review as well

src/relay/pass/combine_parallel_op.h Outdated Show resolved Hide resolved
python/tvm/relay/transform.py Show resolved Hide resolved
src/relay/pass/combine_parallel_op.cc Show resolved Hide resolved
src/relay/pass/combine_parallel_op.h Outdated Show resolved Hide resolved
@jroesch
Copy link
Member

jroesch commented Sep 1, 2019

Overall LGTM, thanks for you contribution! My comments are about documentation mostly.

We are making a push this release cycle to improve the docs. I am trying to encourage everyone to incrementally grow them with each PR to passes, etc.

I will take another pass after other reviews come in, and then merge.

src/relay/pass/combine_parallel_op.h Outdated Show resolved Hide resolved
src/relay/pass/combine_parallel_conv2d.cc Outdated Show resolved Hide resolved
@MarisaKirisame
Copy link
Contributor

As most operators take in batched input, it is not hard to imagine different version of CombineXXX... Furthermore, if we do dynamic batching ala tensorflow fold (which I think we will do), we need a CombineXXX for every XXX operator. So, is it possible to merge the two combine into a generic pass?

@soiferj
Copy link
Contributor Author

soiferj commented Sep 3, 2019

@MarisaKirisame that's a really good idea. The only tricky part is that IsOpSupported and AreOpsCompatible may need to change for each op. For example, with Dense, we need to make sure that the units attribute is empty.

How about this: we create a new class, CombineParallelOpBatch, that has a default implementation of IsOpSupported which returns true, and a default implementation of AreOpsCompatible which returns true if the shapes and dtypes of both ops are the same. From there, we can have CombineParallelDense inherit from CombineParallelOpBatch and override IsOpSupported and AreOpsCompatible. The constructor can take the name of the op and the min number of branches to combine. This way, we can easily add new IR passes by adding a line like CombineParallelOpBatch('add', 3).

I'm not sure it's possible to merge CombineParallelConv2D and CombineParallelDense because CombineParallelConv2D concatenates the inputs rather than stacking. The implementations of most methods would have to change.

What do you think? @jroesch any thoughts?

@MarisaKirisame
Copy link
Contributor

@soiferj sounds good to me. maybe the default should be False, because op need manual batching to be supported (for example, changing dense to BatchMatMul). For the easy case, we can override DeclareUnaryOp and DeclareBinaryOp, which will do the job

@soiferj
Copy link
Contributor Author

soiferj commented Sep 3, 2019

@MarisaKirisame, sounds good. For now, I'll add support for this functionality. I think we can decide how we actually want to enable it for all ops in another PR. Is that alright?

@MarisaKirisame
Copy link
Contributor

@soiferj exactly.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 3, 2019

@MarisaKirisame do you think that the new base class, CombineParallelOpBatch, should handle inputs with any number of dimensions and convert them to a batched version of ndim + 1 (for example, (2, 3, 4) -> (1, 2, 3, 4)), or should we just handle inputs with 2 dimensions and convert them to 3 dimensions (for example, (3, 4) -> (1, 3, 4))?

The reason why I ask is that broadcasting logic for element-wise ops that follow the main op can get tricky if we support an arbitrary number of dimensions.

@MarisaKirisame
Copy link
Contributor

@soiferj both are completely ok with me.
the main scenario I am afraid of, is you add CombineDense, someone add CombineRelu next week, another one add CombineSoftmax, etc etc etc... And after one year we see 10 seprate pass that could be unifiable. If we start with a better design, we save ourself a big refactor down the road.
It dont need to support much (more dimension, broadcasting, etc), the infrastructure itself working and allowing extension is the most important part.

@jroesch
Copy link
Member

jroesch commented Sep 4, 2019

@soiferj sounds good to me in general, I am always in favor of slowly layering changes, etc but setting yourself up for future success.

I think Marisa's point is that we should be able to hook just a few methods and leave the analysis/traversal code generic between the passes. For example we could also overload how they are combined with another hook, i.e concat vs. stack.

If I understand the goal correctly is we have some operators like, op1(args, attrs) and op1(args, attrs) we should only need to add 3 methods, one to check if op1 is supported, should be an easy one liner, that the args and attrs match, and how to combine the values, i.e. call stack on a sequence of arguments.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 4, 2019

Just pushed a new set of changes that introduces CombineParallelOpBatch. There is also a little bit of special logic for combining bias_add, as it needs to be broadcasted. For now, I'm pretty happy with the code design. I think we can look at moving the stack and concatenate logic into the base class if we see a need to in the future. Let me know what you think!

@soiferj
Copy link
Contributor Author

soiferj commented Sep 4, 2019

Also, I have a design question: after we support CombineParallelXXXX for many ops, is there a purpose to fusing elementwise operators in these classes? If we can really combine most parallel ops into batch ops, the regular fusion pass should do this for us, right? This will clean up the code a lot.

@vinx13
Copy link
Member

vinx13 commented Sep 10, 2019

@soiferj Likely some fused operations are not inlined. Can you check what sub-function after fusion raised this error during Relay compilation?

@soiferj
Copy link
Contributor Author

soiferj commented Sep 10, 2019

The error says it's coming from split_dev_host_funcs in codegen/build_module.cc. The line is

CHECK(ir::VerifyMemory(x, target->device_type)) << "Direct host side access to device memory is detected in " << x->func_name() << ". Did you forget to bind?";

@vinx13
Copy link
Member

vinx13 commented Sep 10, 2019

We need to trace the Relay part of these functions (and see how their TOPI schedules are called)
You can use debug logging here to get more info
https://github.com/dmlc/tvm/blob/master/python/tvm/relay/backend/_backend.py#L51-L53

@soiferj
Copy link
Contributor Author

soiferj commented Sep 10, 2019

Okay, I have the output. What exactly should I be looking for?

@vinx13
Copy link
Member

vinx13 commented Sep 10, 2019

@soiferj After fusion the Relay program is split into several parts like

%x = fn(...) {
  dense
  relu
}

Each part is passed to tvm.build_module independently. So the thing is to find exactly which part is broken

@soiferj
Copy link
Contributor Author

soiferj commented Sep 10, 2019

@vinx13, nevermind, I made a mistake on my testing branch. I created a new branch, merged my changes with master, and can run the model now. Sorry about that.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 10, 2019

The CI failure doesn't look related to me...it is occurring in test_forward_conv_batch_norm. The output from the test actually looks correct too. Is this a known issue? Edit: issue has been resolved.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 12, 2019

Thanks everyone for your help and feedback :) This PR has been approved for the past few days - would someone be able to take a look at merging when they get a chance?

src/relay/pass/combine_parallel_op.h Outdated Show resolved Hide resolved
src/relay/pass/combine_parallel_op_batch.h Show resolved Hide resolved
src/relay/pass/combine_parallel_op.h Show resolved Hide resolved
python/tvm/relay/transform.py Outdated Show resolved Hide resolved
@soiferj soiferj requested a review from icemelon September 13, 2019 15:46
src/relay/pass/combine_parallel_op.h Outdated Show resolved Hide resolved
src/relay/pass/combine_parallel_op.h Outdated Show resolved Hide resolved
@soiferj soiferj requested a review from icemelon September 16, 2019 17:24
@icemelon
Copy link
Member

@jroesch Could you take one more look at this PR?

@soiferj
Copy link
Contributor Author

soiferj commented Sep 20, 2019

@jroesch would you mind taking a look when you have a chance?

@jroesch
Copy link
Member

jroesch commented Sep 24, 2019

Sorry for dropping the ball on this one, been busy with life and other work things :) will merge tonight.

@jroesch
Copy link
Member

jroesch commented Sep 24, 2019

I think everything looks good to me, thanks for contribution, especially all the extra work generalizing the interface

@jroesch jroesch merged commit ed9fdfb into apache:master Sep 24, 2019
@soiferj soiferj deleted the soiferj/paralleldense branch September 24, 2019 17:14
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
* Refactor to create abstract ParallelOpCombiner

* First draft of CombineParallelDense

* Begin to work on tests

* Test

* Refactor to move out more common code

* Clean up

* Fix

* Remove statics

* fix wording

* Start to add combine_parallel_op_batch

* Resolve PR comments

* Resolve PR comments

* dummy change to retrigger CI

* Change special case from bias_add to add

* Revert special case change

* Ignore units check

* dummy change to retrigger CI

* dummy change to re-trigger CI

* Improve docs

* Update docs

* Update docs
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
* Refactor to create abstract ParallelOpCombiner

* First draft of CombineParallelDense

* Begin to work on tests

* Test

* Refactor to move out more common code

* Clean up

* Fix

* Remove statics

* fix wording

* Start to add combine_parallel_op_batch

* Resolve PR comments

* Resolve PR comments

* dummy change to retrigger CI

* Change special case from bias_add to add

* Revert special case change

* Ignore units check

* dummy change to retrigger CI

* dummy change to re-trigger CI

* Improve docs

* Update docs

* Update docs
wweic pushed a commit to neo-ai/tvm that referenced this pull request Oct 1, 2019
* Refactor to create abstract ParallelOpCombiner

* First draft of CombineParallelDense

* Begin to work on tests

* Test

* Refactor to move out more common code

* Clean up

* Fix

* Remove statics

* fix wording

* Start to add combine_parallel_op_batch

* Resolve PR comments

* Resolve PR comments

* dummy change to retrigger CI

* Change special case from bias_add to add

* Revert special case change

* Ignore units check

* dummy change to retrigger CI

* dummy change to re-trigger CI

* Improve docs

* Update docs

* Update docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants