Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Optimize iGPU FC with prime number batch size #24893

Merged

Conversation

riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented Jun 7, 2024

Details: Solve iGPU FC low performance issue when FC batch size is not aligned with 2/4

  • Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096,
    in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and
    50ms for 577x4096->577x1024.

  • Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback
    to default parameters, which leads to worst performance.
    See blow figure: EU active is about 3.5% while XVE Thread occupancy almost is 100%, and global memory read bandwidth is 77 GB/s, which has reached hw bandwidth limitation (~75GB/s), it means that memory utilization in L3 cache is too low.

image

  • Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams,
    which can benefit from the higer ratio of GFLOPS and Data read bandwidth.

  • Test result on MTL:

image
FC 257x4096->257x1024: latency improved from 23ms to 0.9ms

  master PR to opt
CLIP visual 0.99 FPS 13.00 FPS
ViT_B 5.37 FPS 20.40 FPS
Vit_L 0.56 FPS 4.91 FPS

Tickets:

@riverlijunjie riverlijunjie requested review from a team as code owners June 7, 2024 03:15
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jun 7, 2024
@riverlijunjie riverlijunjie force-pushed the river/gpu_opt_unaligned_batch_size branch from c2843be to 42f7d3a Compare June 11, 2024 02:20
…ligned with 2/4

   Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096,
         in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and
         50ms for 577x4096->577x1024

   Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback
         to default parameters, which leads to worst performace.

   Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams,
         which can benefit from the higer ratio of GFLOPS and Data read bandwidth.

   Test result on MTL:
             FC 257x4096->257x1024:   latency improved from 23ms to 0.9ms
             ViT B/16:    5.37 fps  -->  20.40 fps
             ViT L/16:    0.56 fps  -->  4.91 fps
             CLIP visual: 0.99 fps  -->  13.00 fps
@riverlijunjie riverlijunjie force-pushed the river/gpu_opt_unaligned_batch_size branch from 9c4ffd1 to d7bd900 Compare June 16, 2024 14:37
@peterchen-intel peterchen-intel changed the title [GPU] Solve iGPU FC low performance issue when FC batch size is not a… [GPU] Optimize iGPU FC with prime number batch size Jul 29, 2024
@yeonbok
Copy link
Contributor

yeonbok commented Aug 16, 2024

LGTM if the regression check is done.

@wenjiew
Copy link

wenjiew commented Sep 4, 2024

Can we conduct further code review to merge? @vladimir-paramuzov @geunhwan Thanks!

Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Sep 20, 2024
@github-actions github-actions bot removed the Stale label Sep 25, 2024
peterchen-intel added a commit to peterchen-intel/openvino that referenced this pull request Sep 25, 2024
openvinotoolkit#24893

Signed-off-by: Chen, Peter <[email protected]>
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Oct 22, 2024
@yeonbok
Copy link
Contributor

yeonbok commented Oct 23, 2024

I triggered our internal benchmark for MTL

@yeonbok yeonbok added this pull request to the merge queue Oct 23, 2024
@github-actions github-actions bot removed the Stale label Oct 24, 2024
Merged via the queue into openvinotoolkit:master with commit f4e8b82 Oct 24, 2024
142 of 144 checks passed
CuriousPanCake pushed a commit to CuriousPanCake/openvino that referenced this pull request Nov 6, 2024
…24893)

### Details: Solve iGPU FC low performance issue when FC batch size is
not aligned with 2/4

- Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT
models will adopt 257x4096 or 577x4096,
in this unligned batch size, iGPU will perform FC very slowly, about
23ms for 257x4096->257x1024 and
         50ms for 577x4096->577x1024.

- Root cause: When FC's batch size is not aligned with 2/4, it will not
choose best TuneParams and fallback
         to default parameters, which leads to worst performance.
See blow figure: EU active is about 3.5% while XVE Thread occupancy
almost is 100%, and global memory read bandwidth is 77 GB/s, which has
reached hw bandwidth limitation (~75GB/s), it means that memory
utilization in L3 cache is too low.


![image](https://github.com/openvinotoolkit/openvino/assets/31196718/a9debd4e-bc77-45ac-9942-01813b0d61ab)


- Solution: If FC's bactch size is not aligned with 2/4, we still can
use tile_b=16 with dispatch_bsv==1 as TuneParams,
which can benefit from the higer ratio of GFLOPS and Data read
bandwidth.

   - Test result on MTL:
   

![image](https://github.com/openvinotoolkit/openvino/assets/31196718/8c6b566c-8389-419f-836e-eaab29f8ef02)
FC 257x4096->257x1024: latency improved from 23ms to 0.9ms

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=OneNote.File>
<meta name=Generator content="Microsoft OneNote 15">
</head>

<body lang=en-US style='font-family:Calibri;font-size:11.0pt'>
<!--StartFragment-->

<div style='direction:ltr'>


  | master | PR to opt
-- | -- | --
CLIP visual | 0.99 FPS | 13.00 FPS
ViT_B | 5.37 FPS | 20.40 FPS
Vit_L | 0.56 FPS | 4.91 FPS



</div>

<!--EndFragment-->
</body>

</html>


### Tickets:
 - CVS-142833

---------

Co-authored-by: Chen Peter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants