[GPU] Optimize iGPU FC with prime number batch size #24893

riverlijunjie · 2024-06-07T03:15:13Z

Details: Solve iGPU FC low performance issue when FC batch size is not aligned with 2/4

Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096,
in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and
50ms for 577x4096->577x1024.
Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback
to default parameters, which leads to worst performance.
See blow figure: EU active is about 3.5% while XVE Thread occupancy almost is 100%, and global memory read bandwidth is 77 GB/s, which has reached hw bandwidth limitation (~75GB/s), it means that memory utilization in L3 cache is too low.

Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams,
which can benefit from the higer ratio of GFLOPS and Data read bandwidth.
Test result on MTL:

FC 257x4096->257x1024: latency improved from 23ms to 0.9ms

	master	PR to opt
CLIP visual	0.99 FPS	13.00 FPS
ViT_B	5.37 FPS	20.40 FPS
Vit_L	0.56 FPS	4.91 FPS

Tickets:

CVS-142833

…ligned with 2/4 Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096, in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and 50ms for 577x4096->577x1024 Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback to default parameters, which leads to worst performace. Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams, which can benefit from the higer ratio of GFLOPS and Data read bandwidth. Test result on MTL: FC 257x4096->257x1024: latency improved from 23ms to 0.9ms ViT B/16: 5.37 fps --> 20.40 fps ViT L/16: 0.56 fps --> 4.91 fps CLIP visual: 0.99 fps --> 13.00 fps

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp

yeonbok · 2024-08-16T03:14:27Z

LGTM if the regression check is done.

wenjiew · 2024-09-04T05:53:51Z

Can we conduct further code review to merge? @vladimir-paramuzov @geunhwan Thanks!

github-actions · 2024-09-20T00:22:12Z

This PR will be closed in a week because of 2 weeks of no activity.

openvinotoolkit#24893 Signed-off-by: Chen, Peter <[email protected]>

github-actions · 2024-10-22T00:23:09Z

This PR will be closed in a week because of 2 weeks of no activity.

yeonbok · 2024-10-23T21:42:08Z

I triggered our internal benchmark for MTL

…24893) ### Details: Solve iGPU FC low performance issue when FC batch size is not aligned with 2/4 - Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096, in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and 50ms for 577x4096->577x1024. - Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback to default parameters, which leads to worst performance. See blow figure: EU active is about 3.5% while XVE Thread occupancy almost is 100%, and global memory read bandwidth is 77 GB/s, which has reached hw bandwidth limitation (~75GB/s), it means that memory utilization in L3 cache is too low. ![image](https://github.com/openvinotoolkit/openvino/assets/31196718/a9debd4e-bc77-45ac-9942-01813b0d61ab) - Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams, which can benefit from the higer ratio of GFLOPS and Data read bandwidth. - Test result on MTL: ![image](https://github.com/openvinotoolkit/openvino/assets/31196718/8c6b566c-8389-419f-836e-eaab29f8ef02) FC 257x4096->257x1024: latency improved from 23ms to 0.9ms <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:Calibri;font-size:11.0pt'>  <div style='direction:ltr'> | master | PR to opt -- | -- | -- CLIP visual | 0.99 FPS | 13.00 FPS ViT_B | 5.37 FPS | 20.40 FPS Vit_L | 0.56 FPS | 4.91 FPS </div>  </body> </html> ### Tickets: - CVS-142833 --------- Co-authored-by: Chen Peter <[email protected]>

riverlijunjie requested review from a team as code owners June 7, 2024 03:15

github-actions bot added the category: GPU OpenVINO GPU plugin label Jun 7, 2024

riverlijunjie force-pushed the river/gpu_opt_unaligned_batch_size branch from c2843be to 42f7d3a Compare June 11, 2024 02:20

riverlijunjie added 4 commits June 11, 2024 10:32

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

42f7d3a

Avoid writing unnecesssary output buffer for fc_tiled kernel

bf874ad

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

8a17bea

riverlijunjie force-pushed the river/gpu_opt_unaligned_batch_size branch from 9c4ffd1 to d7bd900 Compare June 16, 2024 14:37

Fix bug

d7bd900

songbell reviewed Jul 15, 2024

View reviewed changes

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated Show resolved Hide resolved

songbell reviewed Jul 15, 2024

View reviewed changes

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated Show resolved Hide resolved

riverlijunjie added 2 commits July 22, 2024 16:02

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

f1d567a

update

03b5d39

peterchen-intel changed the title ~~[GPU] Solve iGPU FC low performance issue when FC batch size is not a…~~ [GPU] Optimize iGPU FC with prime number batch size Jul 29, 2024

riverlijunjie requested review from isanghao and vladimir-paramuzov July 30, 2024 07:07

vladimir-paramuzov added the under_perf_check label Aug 5, 2024

yeonbok reviewed Aug 5, 2024

View reviewed changes

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Show resolved Hide resolved

peterchen-intel requested a review from yeonbok August 15, 2024 13:05

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

fb9ffc1

peterchen-intel assigned vladimir-paramuzov Aug 20, 2024

wenjiew added this to the 2024.5 milestone Sep 2, 2024

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

58502ad

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

28fed45

peterchen-intel requested a review from ceciliapeng2011 September 5, 2024 04:09

github-actions bot added the Stale label Sep 20, 2024

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

ec09d22

github-actions bot removed the Stale label Sep 25, 2024

peterchen-intel added a commit to peterchen-intel/openvino that referenced this pull request Sep 25, 2024

PR#24893

5d2d0fc

openvinotoolkit#24893 Signed-off-by: Chen, Peter <[email protected]>

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

d585aaf

github-actions bot added the Stale label Oct 22, 2024

Merge branch 'master' into river/gpu_opt_unaligned_batch_size

d1778e4

yeonbok approved these changes Oct 23, 2024

View reviewed changes

yeonbok added this pull request to the merge queue Oct 23, 2024

github-actions bot removed the Stale label Oct 24, 2024

Merged via the queue into openvinotoolkit:master with commit f4e8b82 Oct 24, 2024
142 of 144 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Optimize iGPU FC with prime number batch size #24893

[GPU] Optimize iGPU FC with prime number batch size #24893

riverlijunjie commented Jun 7, 2024 •

edited by peterchen-intel

Loading

yeonbok commented Aug 16, 2024

wenjiew commented Sep 4, 2024

github-actions bot commented Sep 20, 2024

github-actions bot commented Oct 22, 2024

yeonbok commented Oct 23, 2024

[GPU] Optimize iGPU FC with prime number batch size #24893

[GPU] Optimize iGPU FC with prime number batch size #24893

Conversation

riverlijunjie commented Jun 7, 2024 • edited by peterchen-intel Loading

Details: Solve iGPU FC low performance issue when FC batch size is not aligned with 2/4

Tickets:

yeonbok commented Aug 16, 2024

wenjiew commented Sep 4, 2024

github-actions bot commented Sep 20, 2024

github-actions bot commented Oct 22, 2024

yeonbok commented Oct 23, 2024

riverlijunjie commented Jun 7, 2024 •

edited by peterchen-intel

Loading