-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Add SLM support for FC bf tiled kernel #21435
[GPU] Add SLM support for FC bf tiled kernel #21435
Conversation
f012c83
to
e838b4c
Compare
2eec4fd
to
c28adf9
Compare
…and add decompression scale post op support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a patch (sshlyapn#1) for model caching of dynamic models.
Please review and merge it into this PR.
added FullyConnected_bf_tiled::GetUpdateDispatchDataFunc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No critical comments from my side.
auto updated_layout = actual_layout; | ||
for (auto user : get_user_insts()) { | ||
// Since fake alignment is applicable for input tensor as well, make sure we allocate enough memory | ||
// to prevemt reading beyound the allocated memory bounds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prevent, beyond
if (tparams.tile_ofm != required_tile_ofm) | ||
return false; | ||
|
||
if (params.weights.GetDType() != WeightsType::INT4 && params.weights.GetDType() != WeightsType::UINT4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a check somewhere that available SLM size is big enough for given parameters? Haven't found one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maximum possible size of SLM allocation of current implementation is 8KB and it looks like all current HW met this requirement. But I will add this condition for clarity, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in PR #21555
auto dispatchData = SetDefault(prim_params, -1, execute_kernel_idx); | ||
kd.kernels[execute_kernel_idx].params.workGroups.global = dispatchData.gws; | ||
kd.kernels[execute_kernel_idx].params.workGroups.local = dispatchData.lws; | ||
kd.kernels[execute_kernel_idx].skip_execution = KernelData::SkipKernelExecution(prim_params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[offline discussion] That approach with multiple kernels for single KernelData
doesn't look like a future proof solution, so proposed to do an experiment later with returning multiple KernelData
objects from kernel selector which would mean that both kernels are suitable and should be dispatched based on some runtime check. It will likely require modification of primitive_impl
sub-classes and adding of a generic wrapper for multiple primitive_impls + condition to switch between those.
@@ -187,7 +186,14 @@ kernel_impl_params fully_connected_inst::get_fake_aligned_params(kernel_impl_par | |||
return std::move(orig_impl_param); | |||
} | |||
|
|||
size_t fake_align_base = (orig_impl_param.dev_type == cldnn::device_type::integrated_gpu) ? 16 : 8; | |||
size_t fake_align_base = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[random spot] Is this feature covered by existing test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some of tests cover new implementation as well, but I will add more tests in follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in PR #21555
* [GPU] Add SLM support for FC bf tiled kernel * Fix unaligned IFM leftovers processing in case of compressed weights and add decompression scale post op support * added FullyConnected_bf_tiled::GetUpdateDispatchDataFunc * updated FullyConnected_bf_tiled::GetUpdateDispatchDataFunc for two types of kernels --------- Co-authored-by: Kim, Eddy <[email protected]>
Details:
This patch implements SLM optimization for FC with compressed (INT4/UINT4) weights. The optimization is expected to improve processing of context sizes >= 241.
Tickets: