-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BYOC] Enhance partitioning and external codegen #5310
Conversation
Great! Can you also enable mobilenet exec in the dnnl fuse test, by uncommenting this? |
@masahi Thanks for the review. I uncommented those lines. It worked fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just minor comments.
@zhiics Do you know why compiling manually inlined constants is massively faster and memory efficient for static array than heap one? For my education :) I guess it is because for inlined array on heap the compiler needs to generate instruction for every value? |
@masahi Yeah, I only have some hypothetic explanations which motivated me to make this change. I think the reason why heap was slow was because we generated a huge amount of stmts for array initialization. It makes the code size so huge even though many of them are simple. But each optimization and codegen on it would be still very slow (particularly they are pointers). However, after we move the array to the data segment, we can remove these stmts but only have one pointer. Therefore, the compilation should be much faster. In the latter case, linking and data loading would take more time, but it is still insignificant compared to compilation. Does this make sense? Update: yeah, I didn't see your updated paragraph, but it is right. |
Let's keep it open for a couple of days so that ARM people can have a look next week. @mbaret @lhutton1 @manupa-arm |
@mbaret @lhutton1 @manupa-arm Could any of you take a look? I'd like to merge enhancement soon. |
Let's bring this in since ARM folks haven't really changed the DNNL codegen code and the fix in partitioning is simple. |
Hi all, sorry for being late to the party, it's been a really busy last couple of months. I was just pulling these changes and trying them out, and I have a question on the handling of constants. @zhiics, is there a reason why you copy the entire content of the constants instead of just using the value of the pointer? In other words, instead of |
The reason is because we need to serialize it. We cannot save the pointer but we need the data. |
You’re right, sorry about that. So if our target was CUDA, we would need to first memcpy back to the host? It turns out the constant propagation in the PartitionGraph pass changed this behavior quite a bit. Update: I think I see what’s happening - the value of the constant will be on the host. Nvm! |
It wasn't really constant before. It was variables. Therefore, you wouldn't have constants in the external codegen. With constant propagation, we will have them and leave it for the external codegen to handle. I think there is a thread in the discussion forum. |
Apologies, it was bank holiday in the UK :) Although this has now been merged, I've had a look into the constant tensor issue previously and this method of writing out the constant tensor causes very long compilation time for models like VGG16. I couldn't find a nice way to solve this when I was looking into it, but just wanted to raise it here so people are aware. Is this a pattern that we want people to follow in the future? |
This is what I think of we can do for CSourceModule style external codegen. I have some initial thoughts/work of having a different runtime so that we only need to serialize a relay program and interpret it (i.e. the one used as the minimal example). I don't have much cycle recently. I will send an RFC once I have bandwidth on it later. But anyway, let's not continue talking about it here as it is a different topic. |
* Remove duplicated output args * address comment * fix codegen c * improve comment * VisitExprDefault_ * deduce type
* Remove duplicated output args * address comment * fix codegen c * improve comment * VisitExprDefault_ * deduce type
* Remove duplicated output args * address comment * fix codegen c * improve comment * VisitExprDefault_ * deduce type
This PR
Removes the duplicated outputs to the followed subgraphs. Previously, there would be two outputs from A to (B, C) if both B and C use the output from A where actually one is enough.
Enables MobileNet test on DNNL by allocating the large constant array on the static section instead of the heap and then assigning then one by one. Note the compilation time would be slightly longer (< 2mins on a P2 instance).
Fixes the way we used to save the intermediate output by directly caching and returning it in the post order way.
@comaniac @masahi @soiferj @manupa-arm @trevor-m