Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BYOC] Enhance partitioning and external codegen #5310

Merged
merged 6 commits into from
Apr 13, 2020

Conversation

zhiics
Copy link
Member

@zhiics zhiics commented Apr 11, 2020

This PR

  • Removes the duplicated outputs to the followed subgraphs. Previously, there would be two outputs from A to (B, C) if both B and C use the output from A where actually one is enough.

  • Enables MobileNet test on DNNL by allocating the large constant array on the static section instead of the heap and then assigning then one by one. Note the compilation time would be slightly longer (< 2mins on a P2 instance).

  • Fixes the way we used to save the intermediate output by directly caching and returning it in the post order way.

@comaniac @masahi @soiferj @manupa-arm @trevor-m

@masahi masahi self-assigned this Apr 11, 2020
@masahi
Copy link
Member

masahi commented Apr 11, 2020

Great! Can you also enable mobilenet exec in the dnnl fuse test, by uncommenting this?
https://github.com/apache/incubator-tvm/blob/master/tests/python/relay/test_pass_partition_graph.py#L971-L973

@zhiics
Copy link
Member Author

zhiics commented Apr 11, 2020

@masahi Thanks for the review. I uncommented those lines. It worked fine.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just minor comments.

src/relay/backend/contrib/codegen_c/codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/contrib/codegen_c/codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/contrib/codegen_c/codegen.cc Outdated Show resolved Hide resolved
@masahi
Copy link
Member

masahi commented Apr 12, 2020

@zhiics Do you know why compiling manually inlined constants is massively faster and memory efficient for static array than heap one? For my education :)

I guess it is because for inlined array on heap the compiler needs to generate instruction for every value?

@zhiics
Copy link
Member Author

zhiics commented Apr 12, 2020

@masahi Yeah, I only have some hypothetic explanations which motivated me to make this change. I think the reason why heap was slow was because we generated a huge amount of stmts for array initialization. It makes the code size so huge even though many of them are simple. But each optimization and codegen on it would be still very slow (particularly they are pointers). However, after we move the array to the data segment, we can remove these stmts but only have one pointer. Therefore, the compilation should be much faster. In the latter case, linking and data loading would take more time, but it is still insignificant compared to compilation. Does this make sense?

Update: yeah, I didn't see your updated paragraph, but it is right.

@masahi
Copy link
Member

masahi commented Apr 12, 2020

Let's keep it open for a couple of days so that ARM people can have a look next week. @mbaret @lhutton1 @manupa-arm

@zhiics
Copy link
Member Author

zhiics commented Apr 13, 2020

@mbaret @lhutton1 @manupa-arm Could any of you take a look? I'd like to merge enhancement soon.

@zhiics
Copy link
Member Author

zhiics commented Apr 13, 2020

Let's bring this in since ARM folks haven't really changed the DNNL codegen code and the fix in partitioning is simple.

@zhiics zhiics merged commit 5958d60 into apache:master Apr 13, 2020
@zhiics zhiics deleted the partition branch April 13, 2020 21:06
@zhiics
Copy link
Member Author

zhiics commented Apr 13, 2020

Thanks @masahi @comaniac

@soiferj
Copy link
Contributor

soiferj commented Apr 14, 2020

Hi all, sorry for being late to the party, it's been a really busy last couple of months. I was just pulling these changes and trying them out, and I have a question on the handling of constants. @zhiics, is there a reason why you copy the entire content of the constants instead of just using the value of the pointer?

In other words, instead of float const_1[1000] = {......}, we could have float* const_1 = dl_tensor.data. We can pass that pointer around directly.

@zhiics
Copy link
Member Author

zhiics commented Apr 14, 2020

The reason is because we need to serialize it. We cannot save the pointer but we need the data.

@soiferj
Copy link
Contributor

soiferj commented Apr 14, 2020

You’re right, sorry about that. So if our target was CUDA, we would need to first memcpy back to the host?

It turns out the constant propagation in the PartitionGraph pass changed this behavior quite a bit.

Update: I think I see what’s happening - the value of the constant will be on the host. Nvm!

@zhiics
Copy link
Member Author

zhiics commented Apr 14, 2020

It wasn't really constant before. It was variables. Therefore, you wouldn't have constants in the external codegen. With constant propagation, we will have them and leave it for the external codegen to handle. I think there is a thread in the discussion forum.

@lhutton1
Copy link
Contributor

Apologies, it was bank holiday in the UK :) Although this has now been merged, I've had a look into the constant tensor issue previously and this method of writing out the constant tensor causes very long compilation time for models like VGG16. I couldn't find a nice way to solve this when I was looking into it, but just wanted to raise it here so people are aware. Is this a pattern that we want people to follow in the future?

@zhiics
Copy link
Member Author

zhiics commented Apr 14, 2020

This is what I think of we can do for CSourceModule style external codegen. I have some initial thoughts/work of having a different runtime so that we only need to serialize a relay program and interpret it (i.e. the one used as the minimal example). I don't have much cycle recently. I will send an RFC once I have bandwidth on it later. But anyway, let's not continue talking about it here as it is a different topic.

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
* Remove duplicated output args

* address comment

* fix codegen c

* improve comment

* VisitExprDefault_

* deduce type
zhiics added a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
* Remove duplicated output args

* address comment

* fix codegen c

* improve comment

* VisitExprDefault_

* deduce type
dpankratz pushed a commit to dpankratz/incubator-tvm that referenced this pull request Apr 24, 2020
* Remove duplicated output args

* address comment

* fix codegen c

* improve comment

* VisitExprDefault_

* deduce type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants