-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subgroup load store cleanup #1879
Conversation
Use sg.get_max_local_range instead. The `sg.get_local_range` must perform lots of checks to determine if this is the last trailing sub-group in the work-group and its actual size may be smaller. We set the local work-group size to be 128, which is a multiple of any sub-group size, and hence get_local_range() always equals to get_max_local_raneg().
For short data types, each work-item may need to load several elements to ensure that it uses all the data from cache-line. For example, with simd32, we load 4 8-bit types (2 cache lines), 2 16-bit types, 1 32-bit and wider types. n_vec is set to 1, to avoid cache thrashing due to second iteration of some work-items beginning to access memory at higher addresses while some work-items continue working on the lower addresses causing cache evictions. The size of the work-groups was increated from 128 to 256, which is chosen so that all 8 threads of single vector with simd32 are used.
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1879/index.html |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_164 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_166 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_167 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_168 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_169 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_170 ran successfully. |
6491fad
to
bd52b27
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_167 ran successfully. |
vec operator should also apply isnan for sycl::half
bd52b27
to
cd783e0
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_167 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_168 ran successfully. |
Closing this PR |
This PR affects implementation of all element-wise functions for contiguous inputs. It replaces
(vec_sz, n_vecs)
template parameters used from the same value(4, 2)
to(VecSize_v<T>, 1)
which is type dependent, chosen to ensure that for each type work-items in the same sub-group load the entire cache line for each vector argument.The number of iterations used by the kernel is always set to 1 to make memory access pattern in the work-group more predictable.
The work-groups size has been increased from 128 to 256 (8 threads with simd32 used) so that work-group occupies all threads of a single vector core.
The performance of element-wise functions, i.e.
add
,multiply
has increased up to 30% for very large inputs (7300 million elements). No performance degradation was observed.For example, testing addition of
330241024
element arrays, the kernel execution for type double went down from 13.5 ms to 10.5ms:this PR:
main branch: