-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Optimize iGPU FC with prime number batch size #24893
Merged
yeonbok
merged 13 commits into
openvinotoolkit:master
from
riverlijunjie:river/gpu_opt_unaligned_batch_size
Oct 24, 2024
Merged
[GPU] Optimize iGPU FC with prime number batch size #24893
yeonbok
merged 13 commits into
openvinotoolkit:master
from
riverlijunjie:river/gpu_opt_unaligned_batch_size
Oct 24, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
riverlijunjie
force-pushed
the
river/gpu_opt_unaligned_batch_size
branch
from
June 11, 2024 02:20
c2843be
to
42f7d3a
Compare
…ligned with 2/4 Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096, in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and 50ms for 577x4096->577x1024 Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback to default parameters, which leads to worst performace. Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams, which can benefit from the higer ratio of GFLOPS and Data read bandwidth. Test result on MTL: FC 257x4096->257x1024: latency improved from 23ms to 0.9ms ViT B/16: 5.37 fps --> 20.40 fps ViT L/16: 0.56 fps --> 4.91 fps CLIP visual: 0.99 fps --> 13.00 fps
riverlijunjie
force-pushed
the
river/gpu_opt_unaligned_batch_size
branch
from
June 16, 2024 14:37
9c4ffd1
to
d7bd900
Compare
songbell
reviewed
Jul 15, 2024
...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp
Outdated
Show resolved
Hide resolved
songbell
reviewed
Jul 15, 2024
...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp
Outdated
Show resolved
Hide resolved
peterchen-intel
changed the title
[GPU] Solve iGPU FC low performance issue when FC batch size is not a…
[GPU] Optimize iGPU FC with prime number batch size
Jul 29, 2024
yeonbok
reviewed
Aug 5, 2024
...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp
Show resolved
Hide resolved
LGTM if the regression check is done. |
Can we conduct further code review to merge? @vladimir-paramuzov @geunhwan Thanks! |
This PR will be closed in a week because of 2 weeks of no activity. |
peterchen-intel
added a commit
to peterchen-intel/openvino
that referenced
this pull request
Sep 25, 2024
openvinotoolkit#24893 Signed-off-by: Chen, Peter <[email protected]>
This PR will be closed in a week because of 2 weeks of no activity. |
yeonbok
approved these changes
Oct 23, 2024
I triggered our internal benchmark for MTL |
Merged
via the queue into
openvinotoolkit:master
with commit Oct 24, 2024
f4e8b82
142 of 144 checks passed
CuriousPanCake
pushed a commit
to CuriousPanCake/openvino
that referenced
this pull request
Nov 6, 2024
…24893) ### Details: Solve iGPU FC low performance issue when FC batch size is not aligned with 2/4 - Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096, in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and 50ms for 577x4096->577x1024. - Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback to default parameters, which leads to worst performance. See blow figure: EU active is about 3.5% while XVE Thread occupancy almost is 100%, and global memory read bandwidth is 77 GB/s, which has reached hw bandwidth limitation (~75GB/s), it means that memory utilization in L3 cache is too low. ![image](https://github.com/openvinotoolkit/openvino/assets/31196718/a9debd4e-bc77-45ac-9942-01813b0d61ab) - Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams, which can benefit from the higer ratio of GFLOPS and Data read bandwidth. - Test result on MTL: ![image](https://github.com/openvinotoolkit/openvino/assets/31196718/8c6b566c-8389-419f-836e-eaab29f8ef02) FC 257x4096->257x1024: latency improved from 23ms to 0.9ms <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:Calibri;font-size:11.0pt'> <!--StartFragment--> <div style='direction:ltr'> | master | PR to opt -- | -- | -- CLIP visual | 0.99 FPS | 13.00 FPS ViT_B | 5.37 FPS | 20.40 FPS Vit_L | 0.56 FPS | 4.91 FPS </div> <!--EndFragment--> </body> </html> ### Tickets: - CVS-142833 --------- Co-authored-by: Chen Peter <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details: Solve iGPU FC low performance issue when FC batch size is not aligned with 2/4
Desc: Sometimes FC input shape is not aligned with 2/4, such as ViT models will adopt 257x4096 or 577x4096,
in this unligned batch size, iGPU will perform FC very slowly, about 23ms for 257x4096->257x1024 and
50ms for 577x4096->577x1024.
Root cause: When FC's batch size is not aligned with 2/4, it will not choose best TuneParams and fallback
to default parameters, which leads to worst performance.
See blow figure: EU active is about 3.5% while XVE Thread occupancy almost is 100%, and global memory read bandwidth is 77 GB/s, which has reached hw bandwidth limitation (~75GB/s), it means that memory utilization in L3 cache is too low.
Solution: If FC's bactch size is not aligned with 2/4, we still can use tile_b=16 with dispatch_bsv==1 as TuneParams,
which can benefit from the higer ratio of GFLOPS and Data read bandwidth.
Test result on MTL:
FC 257x4096->257x1024: latency improved from 23ms to 0.9ms
Tickets: