-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elementwise functions tuning #1889
base: master
Are you sure you want to change the base?
Conversation
dpctl/tensor/libtensor/include/kernels/elementwise_functions/vec_size_util.hpp
Outdated
Show resolved
Hide resolved
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1889/index.html |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_202 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_203 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_209 ran successfully. |
ae56e7b
to
4c0de00
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_204 ran successfully. |
Use sg.get_max_local_range instead. The `sg.get_local_range` must perform lots of checks to determine if this is the last trailing sub-group in the work-group and its actual size may be smaller. We set the local work-group size to be 128, which is a multiple of any sub-group size, and hence get_local_range() always equals to get_max_local_raneg(). The size of the work-groups was increated from 128 to 256, which is chosen so that all 8 threads of single vector with simd32 are used. Set vec_sz and n_vecs in implementations of contig_impl for each support function Make local work-groups size dependent on number of elements to process Fixes for type dispatching utils 1. Add missing include <type_traits> needed for std::true_type, and std::disjunction, std::conjunction 2. Replace std::bool_constant<std::same_v<T1, T2>> with direct and simpler std::same<T1, T2> in couple of instances Hide hyperparameter selection struct in anonymous namespace
vec operator should also apply isnan for sycl::half
4c0de00
to
70a0a3f
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_226 ran successfully. |
This PR revisits elementwise functions functors for contiguous inputs.
Since work-group size is chose so that it is always a multiple of any permissible sub-group size, there is no point in using more expensive
sg.get_local_range()
, so it is replaced with cheapersg.get_max_local_range()
. This change also slightly reduces the binary size due to leaner kernel (from36428264
bytes down to36345112
bytes).Implementations of each elementwise function for contiguous input can now set hyperparameters
vec_sz
andn_vecs
differently for different input types. This ability is applied toadd_contig_impl
for some modest performance improvement forint32_t
,uint32_t
,int64_t
,uint64_t
,float
anddouble
.Fixed missing check in implementation of
minimum
andmaximum
forsycl::half
type for vector inputs, which caused test failures on AMD CPUs in CI during earlier iterations of this work (Subgroup load store cleanup #1879).Added missing
include <type_traits>
in type dispatching headers, and simplified code.