[Relay] Use opaque for where op #4324

icemelon · 2019-11-13T01:33:06Z

Fuse where op will cause significant performance degradation in BERT model

kevinthesun

LGTM

tqchen · 2019-11-14T00:50:59Z

Can we look a bit into what is going on? e.g. too deep fusion or selection? This seems to be a quite arbitrary change

tqchen · 2019-11-20T17:39:53Z

ping @kevinthesun @icemelon9 Please take a look into this. I still think it is a quite arbitrary change, given that operators like relu can be implemented using where, and it makes sense to fuse in these cases.

Perhaps we can find better ways to resolve the problem.

icemelon · 2019-11-20T18:00:49Z

Sorry about the delay. I'll look into this this week.

icemelon · 2019-11-22T23:07:58Z

After some benchmark, I do see there's slightly improvement on the CPU instances. On C5.9xl, with where being opaque, the latency of BERT is 34.51ms (std: 0.59ms), while with where being broadcast, the latency is 36.06ms (std: 0.50ms). The difference is about 1.5ms or 4%. But the difference on C5.4xl is smaller, opaque 46.99ms (std: 0.13ms) vs broadcast 47.87ms (std: 0.24ms), 1.8% improvement. I also evaluate this on GPU instance P3.2xl, which I didn't see much difference in performance.

Given the improvement is not very signficant on CPU, I think let's don't make this change for now. In the future, we can potentially have a performance based op fusion pass which it can determin what kind of fusion gives the best performance.

@tqchen @kevinthesun thoughts?

kevinthesun · 2019-11-23T05:16:49Z

@icemelon9 Looks like the performance gap is not huge for cpu. We can keep the current fusion pattern then.

icemelon · 2019-11-24T00:14:13Z

Agree. Close this PR for now.

icemelon · 2019-12-17T01:59:41Z

@tqchen @kevinthesun
After further investigation, I found that when the sequence length is 256, the fused batch_matmul + ones_like + where is significantly slower (~50%) than the non-fused version (code shown in below). I checked the schedule and found out that the culprit is likely to be the memory allocation in the fused op. This also explains why there's no much performance penalty when seq length is smaller.

Here's the example I used.

fn (%x: Tensor[(12, 256, 64), float32], %y: Tensor[(12, 256, 64), float32], %z: Tensor[(12, 256, 256), float32]) {
  %0 = nn.batch_matmul(%x, %y);
  %1 = ones_like(%0);
  where(%z, %0, %1)
}

Probably it's because batch_matmul schedule doesn't consider the situation that it's not the final output. I'll see if I can update the schedule to fix this issue.

icemelon · 2019-12-18T00:03:10Z

#4537 fix the batch_matmul schedule issue

[Opt] Use opaque for where op

3edbd43

icemelon requested a review from kevinthesun November 13, 2019 22:44

kevinthesun approved these changes Nov 13, 2019

View reviewed changes

tqchen added the status: need update need update based on feedbacks label Nov 20, 2019

icemelon closed this Nov 24, 2019

icemelon reopened this Dec 17, 2019

icemelon closed this Dec 18, 2019

icemelon deleted the where branch December 18, 2019 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Relay] Use opaque for where op #4324

[Relay] Use opaque for where op #4324

icemelon commented Nov 13, 2019

kevinthesun left a comment

tqchen commented Nov 14, 2019

tqchen commented Nov 20, 2019 •

edited

Loading

icemelon commented Nov 20, 2019

icemelon commented Nov 22, 2019 •

edited

Loading

kevinthesun commented Nov 23, 2019

icemelon commented Nov 24, 2019

icemelon commented Dec 17, 2019 •

edited

Loading

icemelon commented Dec 18, 2019

[Relay] Use opaque for where op #4324

[Relay] Use opaque for where op #4324

Conversation

icemelon commented Nov 13, 2019

kevinthesun left a comment

Choose a reason for hiding this comment

tqchen commented Nov 14, 2019

tqchen commented Nov 20, 2019 • edited Loading

icemelon commented Nov 20, 2019

icemelon commented Nov 22, 2019 • edited Loading

kevinthesun commented Nov 23, 2019

icemelon commented Nov 24, 2019

icemelon commented Dec 17, 2019 • edited Loading

icemelon commented Dec 18, 2019

tqchen commented Nov 20, 2019 •

edited

Loading

icemelon commented Nov 22, 2019 •

edited

Loading

icemelon commented Dec 17, 2019 •

edited

Loading