Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Use opaque for where op #4324

Closed
wants to merge 1 commit into from
Closed

Conversation

icemelon
Copy link
Member

Fuse where op will cause significant performance degradation in BERT model

@icemelon icemelon requested a review from kevinthesun November 13, 2019 22:44
Copy link
Contributor

@kevinthesun kevinthesun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tqchen
Copy link
Member

tqchen commented Nov 14, 2019

Can we look a bit into what is going on? e.g. too deep fusion or selection? This seems to be a quite arbitrary change

@tqchen tqchen added the status: need update need update based on feedbacks label Nov 20, 2019
@tqchen
Copy link
Member

tqchen commented Nov 20, 2019

ping @kevinthesun @icemelon9 Please take a look into this. I still think it is a quite arbitrary change, given that operators like relu can be implemented using where, and it makes sense to fuse in these cases.

Perhaps we can find better ways to resolve the problem.

@icemelon
Copy link
Member Author

Sorry about the delay. I'll look into this this week.

@icemelon
Copy link
Member Author

icemelon commented Nov 22, 2019

After some benchmark, I do see there's slightly improvement on the CPU instances. On C5.9xl, with where being opaque, the latency of BERT is 34.51ms (std: 0.59ms), while with where being broadcast, the latency is 36.06ms (std: 0.50ms). The difference is about 1.5ms or 4%. But the difference on C5.4xl is smaller, opaque 46.99ms (std: 0.13ms) vs broadcast 47.87ms (std: 0.24ms), 1.8% improvement. I also evaluate this on GPU instance P3.2xl, which I didn't see much difference in performance.

Given the improvement is not very signficant on CPU, I think let's don't make this change for now. In the future, we can potentially have a performance based op fusion pass which it can determin what kind of fusion gives the best performance.

@tqchen @kevinthesun thoughts?

@kevinthesun
Copy link
Contributor

@icemelon9 Looks like the performance gap is not huge for cpu. We can keep the current fusion pattern then.

@icemelon
Copy link
Member Author

Agree. Close this PR for now.

@icemelon icemelon closed this Nov 24, 2019
@icemelon
Copy link
Member Author

icemelon commented Dec 17, 2019

@tqchen @kevinthesun
After further investigation, I found that when the sequence length is 256, the fused batch_matmul + ones_like + where is significantly slower (~50%) than the non-fused version (code shown in below). I checked the schedule and found out that the culprit is likely to be the memory allocation in the fused op. This also explains why there's no much performance penalty when seq length is smaller.

Here's the example I used.

fn (%x: Tensor[(12, 256, 64), float32], %y: Tensor[(12, 256, 64), float32], %z: Tensor[(12, 256, 256), float32]) {
  %0 = nn.batch_matmul(%x, %y);
  %1 = ones_like(%0);
  where(%z, %0, %1)
}

Probably it's because batch_matmul schedule doesn't consider the situation that it's not the final output. I'll see if I can update the schedule to fix this issue.

@icemelon icemelon reopened this Dec 17, 2019
@icemelon
Copy link
Member Author

#4537 fix the batch_matmul schedule issue

@icemelon icemelon closed this Dec 18, 2019
@icemelon icemelon deleted the where branch December 18, 2019 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need update need update based on feedbacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants