Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Add SLM support for FC bf tiled kernel #21435

Merged
merged 5 commits into from
Dec 6, 2023

Conversation

sshlyapn
Copy link
Contributor

@sshlyapn sshlyapn commented Dec 1, 2023

Details:

This patch implements SLM optimization for FC with compressed (INT4/UINT4) weights. The optimization is expected to improve processing of context sizes >= 241.

  • Added SLM optimization for FC bf_tiled kernel
  • Changed fake alignment rules for batches >= 256 for iGPU
  • Applied fake alignment to input tensors for FC (currently, it's implemented only for outputs, and causes out of bounds memory accesses of input buffer)
  • Implemented WA mechanism to have two precompiled shape agnostic kernels for FC (for small batches and large batches)
  • Fixed unaligned IFM leftovers processing
  • Added dump of all kernels' entry points in case of multiple-stage kernels instances

Tickets:

  • ticket-id

@sshlyapn sshlyapn added this to the 2023.3 milestone Dec 1, 2023
@sshlyapn sshlyapn requested review from a team as code owners December 1, 2023 16:52
@sshlyapn sshlyapn force-pushed the fc_slm_optimization branch 5 times, most recently from f012c83 to e838b4c Compare December 5, 2023 12:02
@sshlyapn sshlyapn force-pushed the fc_slm_optimization branch 3 times, most recently from 2eec4fd to c28adf9 Compare December 5, 2023 13:58
Copy link
Contributor

@e-ddykim e-ddykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a patch (sshlyapn#1) for model caching of dynamic models.
Please review and merge it into this PR.

Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No critical comments from my side.

auto updated_layout = actual_layout;
for (auto user : get_user_insts()) {
// Since fake alignment is applicable for input tensor as well, make sure we allocate enough memory
// to prevemt reading beyound the allocated memory bounds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prevent, beyond

if (tparams.tile_ofm != required_tile_ofm)
return false;

if (params.weights.GetDType() != WeightsType::INT4 && params.weights.GetDType() != WeightsType::UINT4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a check somewhere that available SLM size is big enough for given parameters? Haven't found one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum possible size of SLM allocation of current implementation is 8KB and it looks like all current HW met this requirement. But I will add this condition for clarity, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in PR #21555

auto dispatchData = SetDefault(prim_params, -1, execute_kernel_idx);
kd.kernels[execute_kernel_idx].params.workGroups.global = dispatchData.gws;
kd.kernels[execute_kernel_idx].params.workGroups.local = dispatchData.lws;
kd.kernels[execute_kernel_idx].skip_execution = KernelData::SkipKernelExecution(prim_params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[offline discussion] That approach with multiple kernels for single KernelData doesn't look like a future proof solution, so proposed to do an experiment later with returning multiple KernelData objects from kernel selector which would mean that both kernels are suitable and should be dispatched based on some runtime check. It will likely require modification of primitive_impl sub-classes and adding of a generic wrapper for multiple primitive_impls + condition to switch between those.

@@ -187,7 +186,14 @@ kernel_impl_params fully_connected_inst::get_fake_aligned_params(kernel_impl_par
return std::move(orig_impl_param);
}

size_t fake_align_base = (orig_impl_param.dev_type == cldnn::device_type::integrated_gpu) ? 16 : 8;
size_t fake_align_base = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[random spot] Is this feature covered by existing test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some of tests cover new implementation as well, but I will add more tests in follow up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in PR #21555

@p-durandin p-durandin enabled auto-merge (squash) December 6, 2023 11:18
@p-durandin p-durandin merged commit 152b4df into openvinotoolkit:master Dec 6, 2023
56 checks passed
akuporos pushed a commit to akuporos/openvino that referenced this pull request Dec 8, 2023
* [GPU] Add SLM support for FC bf tiled kernel

* Fix unaligned IFM leftovers processing in case of compressed weights and add decompression scale post op support

* added FullyConnected_bf_tiled::GetUpdateDispatchDataFunc

* updated FullyConnected_bf_tiled::GetUpdateDispatchDataFunc for two types of kernels

---------

Co-authored-by: Kim, Eddy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants