Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elementwise functions tuning #1889

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

oleksandr-pavlyk
Copy link
Collaborator

This PR revisits elementwise functions functors for contiguous inputs.

  1. Since work-group size is chose so that it is always a multiple of any permissible sub-group size, there is no point in using more expensive sg.get_local_range(), so it is replaced with cheaper sg.get_max_local_range(). This change also slightly reduces the binary size due to leaner kernel (from 36428264 bytes down to 36345112 bytes).

  2. Implementations of each elementwise function for contiguous input can now set hyperparameters vec_sz and n_vecs differently for different input types. This ability is applied to add_contig_impl for some modest performance improvement for int32_t, uint32_t, int64_t, uint64_t, float and double.

  3. Fixed missing check in implementation of minimum and maximum for sycl::half type for vector inputs, which caused test failures on AMD CPUs in CI during earlier iterations of this work (Subgroup load store cleanup #1879).

  4. Added missing include <type_traits> in type dispatching headers, and simplified code.


  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

Copy link

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_202 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_203 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@coveralls
Copy link
Collaborator

coveralls commented Nov 12, 2024

Coverage Status

coverage: 87.725%. remained the same
when pulling 70a0a3f on elementwise-functions-tuning
into 34ae129 on master.

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_209 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_204 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

Use sg.get_max_local_range instead. The `sg.get_local_range` must perform
lots of checks to determine if this is the last trailing sub-group in the
work-group and its actual size may be smaller. We set the local work-group
size to be 128, which is a multiple of any sub-group size, and hence
get_local_range() always equals to get_max_local_raneg().

The size of the work-groups was increated from 128 to 256, which is
chosen so that all 8 threads of single vector with simd32 are used.

Set vec_sz and n_vecs in implementations of contig_impl for each support function

Make local work-groups size dependent on number of elements to process

Fixes for type dispatching utils

1. Add missing include <type_traits> needed for std::true_type, and
   std::disjunction, std::conjunction

2. Replace std::bool_constant<std::same_v<T1, T2>> with direct
   and simpler std::same<T1, T2> in couple of instances

Hide hyperparameter selection struct in anonymous namespace
vec operator should also apply isnan for sycl::half
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_226 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants