Subgroup load store cleanup #1879

oleksandr-pavlyk · 2024-10-25T20:19:57Z

This PR affects implementation of all element-wise functions for contiguous inputs. It replaces (vec_sz, n_vecs) template parameters used from the same value (4, 2) to (VecSize_v<T>, 1) which is type dependent, chosen to ensure that for each type work-items in the same sub-group load the entire cache line for each vector argument.

The number of iterations used by the kernel is always set to 1 to make memory access pattern in the work-group more predictable.

The work-groups size has been increased from 128 to 256 (8 threads with simd32 used) so that work-group occupies all threads of a single vector core.

The performance of element-wise functions, i.e. add, multiply has increased up to 30% for very large inputs (7300 million elements). No performance degradation was observed.

For example, testing addition of 330241024 element arrays, the kernel execution for type double went down from 13.5 ms to 10.5ms:

this PR:

"dpctl::tensor::kernels::add::add_contig_kernel<double, double, double, 1u, 1u>[SIMD32 {1290004; 1; 1} {256; 1; 1}]",          201,           2117591360,     5.300246,             10535280,              9817920,             14923680

main branch:

"dpctl::tensor::kernels::add::add_contig_kernel<double, double, double, 4u, 2u>[SIMD32 {322501; 1; 1} {128; 1; 1}]",          201,           2712198720,     5.414135,             13493525,             10416640,             15942560

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?
If this PR is a work in progress, are you opening the PR as a draft?

Use sg.get_max_local_range instead. The `sg.get_local_range` must perform lots of checks to determine if this is the last trailing sub-group in the work-group and its actual size may be smaller. We set the local work-group size to be 128, which is a multiple of any sub-group size, and hence get_local_range() always equals to get_max_local_raneg().

For short data types, each work-item may need to load several elements to ensure that it uses all the data from cache-line. For example, with simd32, we load 4 8-bit types (2 cache lines), 2 16-bit types, 1 32-bit and wider types. n_vec is set to 1, to avoid cache thrashing due to second iteration of some work-items beginning to access memory at higher addresses while some work-items continue working on the lower addresses causing cache evictions. The size of the work-groups was increated from 128 to 256, which is chosen so that all 8 threads of single vector with simd32 are used.

…ort function

github-actions · 2024-10-25T20:54:02Z

View rendered docs @ https://intelpython.github.io/dpctl/pulls/1879/index.html

github-actions · 2024-10-25T20:58:55Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_164 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-10-25T21:39:58Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_166 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-10-26T23:24:24Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_167 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-10-27T03:55:02Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_168 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-10-27T13:30:58Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_169 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-10-28T00:21:46Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_170 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

coveralls · 2024-10-28T00:25:37Z

coverage: 87.686% (+0.005%) from 87.681%
when pulling cef4359 on subgroup-load-store-cleanup
into 9b83bef on master.

github-actions · 2024-10-28T03:36:05Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_167 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

vec operator should also apply isnan for sycl::half

github-actions · 2024-10-28T18:42:12Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_167 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-10-28T21:52:38Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_168 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

oleksandr-pavlyk · 2024-11-11T13:35:37Z

Closing this PR

oleksandr-pavlyk added 5 commits October 23, 2024 16:49

Add UL suffix literal integral value

9240d5b

Clean-ups in binary/unary contig call operator

f623209

Set vec_sz and n_vecs in implementations of contig_impl for each supp…

4a22407

…ort function

oleksandr-pavlyk requested a review from ndgrigorian as a code owner October 25, 2024 20:19

oleksandr-pavlyk added 2 commits October 25, 2024 15:45

Add static assert

581e8bd

Add entry to changelog for improved performance of elementwise functions

63c82fc

oleksandr-pavlyk force-pushed the subgroup-load-store-cleanup branch from 6491fad to bd52b27 Compare October 28, 2024 02:58

Fix for test failure on AMD CPU.

cd783e0

vec operator should also apply isnan for sycl::half

oleksandr-pavlyk force-pushed the subgroup-load-store-cleanup branch from bd52b27 to cd783e0 Compare October 28, 2024 18:00

oleksandr-pavlyk mentioned this pull request Oct 28, 2024

Triage ef #1880

Closed

8 tasks

Make local work-groups size dependent on number of elements to process

cef4359

oleksandr-pavlyk closed this Nov 11, 2024

oleksandr-pavlyk deleted the subgroup-load-store-cleanup branch November 11, 2024 13:35

oleksandr-pavlyk mentioned this pull request Nov 12, 2024

Elementwise functions tuning #1889

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subgroup load store cleanup #1879

Subgroup load store cleanup #1879

oleksandr-pavlyk commented Oct 25, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Oct 26, 2024

github-actions bot commented Oct 27, 2024

github-actions bot commented Oct 27, 2024

github-actions bot commented Oct 28, 2024

coveralls commented Oct 28, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

oleksandr-pavlyk commented Nov 11, 2024

Subgroup load store cleanup #1879

Subgroup load store cleanup #1879

Conversation

oleksandr-pavlyk commented Oct 25, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Oct 26, 2024

github-actions bot commented Oct 27, 2024

github-actions bot commented Oct 27, 2024

github-actions bot commented Oct 28, 2024

coveralls commented Oct 28, 2024 • edited Loading

github-actions bot commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

oleksandr-pavlyk commented Nov 11, 2024

coveralls commented Oct 28, 2024 •

edited

Loading