-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Use opaque for where op #4324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can we look a bit into what is going on? e.g. too deep fusion or selection? This seems to be a quite arbitrary change |
ping @kevinthesun @icemelon9 Please take a look into this. I still think it is a quite arbitrary change, given that operators like relu can be implemented using where, and it makes sense to fuse in these cases. Perhaps we can find better ways to resolve the problem. |
Sorry about the delay. I'll look into this this week. |
After some benchmark, I do see there's slightly improvement on the CPU instances. On C5.9xl, with where being opaque, the latency of BERT is 34.51ms (std: 0.59ms), while with where being broadcast, the latency is 36.06ms (std: 0.50ms). The difference is about 1.5ms or 4%. But the difference on C5.4xl is smaller, opaque 46.99ms (std: 0.13ms) vs broadcast 47.87ms (std: 0.24ms), 1.8% improvement. I also evaluate this on GPU instance P3.2xl, which I didn't see much difference in performance. Given the improvement is not very signficant on CPU, I think let's don't make this change for now. In the future, we can potentially have a performance based op fusion pass which it can determin what kind of fusion gives the best performance. @tqchen @kevinthesun thoughts? |
@icemelon9 Looks like the performance gap is not huge for cpu. We can keep the current fusion pattern then. |
Agree. Close this PR for now. |
@tqchen @kevinthesun Here's the example I used.
Probably it's because batch_matmul schedule doesn't consider the situation that it's not the final output. I'll see if I can update the schedule to fix this issue. |
#4537 fix the batch_matmul schedule issue |
Fuse where op will cause significant performance degradation in BERT model