Skip to content

Commit

Permalink
[SYCL][PI] New device information descriptors: max_global_work_groups…
Browse files Browse the repository at this point in the history
… and max_work_groups (#4064)

SYCL currently does not provide a way to query a device to get the maximum **number of work groups** that can be submitted in each dimension as well as the number of work groups that can be submitted across all the dimensions.
This query does not exist in openCL, but now that GPU are offered through the PI, this query becomes more relevant as different vendors/devices have their own limits.

This commit implements the feature for the host device, level-zero, openCL, ROCm and CUDA. If the query is not applicable, the maximum acceptable value is returned.

Descriptors added:
 - ext_oneapi_max_global_work_groups
 - ext_oneapi_max_work_groups_1d
 - ext_oneapi_max_work_groups_2d
 - ext_oneapi_max_work_groups_3d

Feature test macro:
 - SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY defined to 1

Signed-off-by: Michel Migdal <[email protected]>
  • Loading branch information
Michoumichmich authored Oct 18, 2021
1 parent e95c184 commit 2fdf940
Show file tree
Hide file tree
Showing 13 changed files with 280 additions and 3 deletions.
84 changes: 84 additions & 0 deletions sycl/doc/extensions/MaxWorkGroupQueries/max_work_group_query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY

## Notice

This document describes an **experimental** API that applications can use to try
out a new feature. Future versions of this API may change in ways that are
incompatible with this experimental version.


## Introduction

This extension adds functionally two new device information descriptors. They provide the ability to query a device for the maximum numbers of work-groups that can be submitted in each dimension as well as globally (across all dimensions).

OpenCL never offered such query - which is probably why it is absent from SYCL. Now that SYCL supports back-ends where the maximum number of work-groups in each dimension can be different, having the ability to query that limit is crucial in writing safe and portable code.

## Feature test macro

As encouraged by the SYCL specification, a feature-test macro, `SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY`, is provided to determine whether this extension is implemented.

## New device descriptors

| Device descriptors | Return type | Description |
| ------------------------------------------------------ | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| info::device::ext_oneapi_max_work_groups_1d |  id<1> | Returns the maximum number of work-groups that can be submitted in each dimension of the `globalSize` of a `nd_range<1>`. The minimum value is `(1)` if the device is different than `info::device_type::custom`. |
| info::device::ext_oneapi_max_work_groups_2d |  id<2> | Returns the maximum number of work-groups that can be submitted in each dimension of the `globalSize` of a `nd_range<2>`. The minimum value is `(1, 1)` if the device is different than `info::device_type::custom`. |
| info::device::ext_oneapi_max_work_groups_3d |  id<3> | Returns the maximum number of work-groups that can be submitted in each dimension of the `globalSize` of a `nd_range<3>`. The minimum value is `(1, 1, 1)` if the device is different than `info::device_type::custom`. |
| info::device::ext_oneapi_max_global_work_groups |  size_t | Returns the maximum number of work-groups that can be submitted across all the dimensions. The minimum value is `1`. |

### Note

- The returned values have the same ordering as the `nd_range` arguments.
- The implementation does not guarantee that the user could select all the maximum numbers returned by `ext_oneapi_max_work_groups` at the same time. Thus the user should also check that the selected number of work-groups across all dimensions is smaller than the maximum global number returned by `ext_oneapi_max_global_work_groups`.

## Examples

```c++
sycl::device gpu = sycl::device{sycl::gpu_selector{}};
std::cout << gpu.get_info<sycl::info::device::name>() << '\n';

#ifdef SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY
sycl::id<3> groups = gpu.get_info<sycl::info::device::ext_oneapi_max_work_groups_3d>();
size_t global_groups = gpu.get_info<sycl::info::device::ext_oneapi_max_global_work_groups>();
std::cout << "Max number groups: x_max: " << groups[2] << " y_max: " << groups[1] << " z_max: " << groups[0] << '\n';
std::cout << "Max global number groups: " << global_groups << '\n';
#endif
```
Ouputs to the console:
```
NVIDIA ...
Max number groups: x_max: 2147483647 y_max: 65535 z_max: 65535
Max global number groups: 2147483647
```
See: [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities)
Then the following assertions should be satisfied at kernel submission:
```C++
sycl::nd_range<3> work_range(global_size, local_size);
assert(global_size[2] <= groups[2]
&& global_size[1] <= groups[1]
&& global_size[0] <= groups[0]);
assert(global_size[2] * global_size[1] * global_size[0] <= global_groups); //Make sure not to exceed integer representation size in the multiplication.
gpu_queue.submit(work_range, ...);
```

## Implementation

### Templated queries

Right now, DPC++ does not support templated device descriptors as they are defined in the SYCL specification section 4.6.4.2 "Device information descriptors". When the implementation supports this syntax, `ext_oneapi_max_work_groups_[1,2,3]d` should be replaced by the templated syntax: `ext_oneapi_max_work_groups<[1,2,3]>`.
### Consistency with existing checks

The implementation already checks when enqueuing a kernel that the global and per dimension work-group number is smaller than `std::numeric_limits<int>::max`. This check is implemented in `sycl/include/CL/sycl/handler.hpp`. For consistency, values returned by the two device descriptors are bound by this limit.

### Example of returned values

- If the device is the host or has an OpenCL back-end, the values returned - as they are not applicable - are the maximum values accepted at kernel submission (see `sycl/include/CL/sycl/handler.hpp`) which are currently `std::numeric_limits<int>::max`.
- CUDA: Back-end query using `CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_[X,Y,Z]`.
5 changes: 3 additions & 2 deletions sycl/doc/extensions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,10 @@ DPC++ extensions status:
| [SYCL_INTEL_group_sort](GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc) | Proposal | |
| [Invoke SIMD](InvokeSIMD/InvokeSIMD.asciidoc) | Proposal | |
| [Uniform](Uniform/Uniform.asciidoc) | Proposal | |
| [Assert](Assert/SYCL_ONEAPI_ASSERT.asciidoc) | Proposal | |
| [Matrix](Matrix/dpcpp-joint-matrix.asciidoc) | Partially supported(AMX AOT) | Not supported: dynamic-extent, wg and wi scopes, layouts other than packed|
| [Assert](Assert/SYCL_ONEAPI_ASSERT.asciidoc) | Proposal | |
| [Matrix](Matrix/dpcpp-joint-matrix.asciidoc) | Partially supported(AMX AOT) | Not supported: dynamic-extent, wg and wi scopes, layouts other than packed|
| [SYCL_INTEL_free_function_queries](FreeFunctionQueries/SYCL_INTEL_free_function_queries.asciidoc) | Supported (experimental) | |
| [EXT_ONEAPI_max_work_groups](MaxWorkGroupQueries/max_work_group_query.md) | Supported | |
| [SYCL_EXT_ONEAPI_DEVICE_GLOBAL](DeviceGlobal/SYCL_INTEL_device_global.asciidoc) | Proposal | |
| [SYCL_INTEL_bf16_conversion](Bf16Conversion/SYCL_INTEL_bf16_conversion.asciidoc) | Partially supported (Level Zero: GPU) | Currently available only on Xe HP GPU. ext_intel_bf16_conversion aspect is not supported. |
| [Property List](PropertyList/SYCL_EXT_ONEAPI_property_list.asciidoc) | Proposal | |
Expand Down
7 changes: 6 additions & 1 deletion sycl/include/CL/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,12 @@ typedef enum {
PI_DEVICE_INFO_MAX_MEM_BANDWIDTH = 0x10026,
PI_DEVICE_INFO_IMAGE_SRGB = 0x10027,
PI_DEVICE_INFO_ATOMIC_64 = 0x10110,
PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES = 0x10111
PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES = 0x10111,
PI_EXT_ONEAPI_DEVICE_INFO_MAX_GLOBAL_WORK_GROUPS = 0x20000,
PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_1D = 0x20001,
PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_2D = 0x20002,
PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_3D = 0x20003

} _pi_device_info;

typedef enum {
Expand Down
1 change: 1 addition & 0 deletions sycl/include/CL/sycl/feature_test.hpp.in
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ namespace sycl {
#if SYCL_BUILD_PI_HIP
#define SYCL_EXT_ONEAPI_BACKEND_HIP 1
#endif
#define SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY 1

} // namespace sycl
} // __SYCL_INLINE_NAMESPACE(cl)
4 changes: 4 additions & 0 deletions sycl/include/CL/sycl/info/device_traits.def
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,7 @@ __SYCL_PARAM_TRAITS_SPEC(device, ext_intel_max_mem_bandwidth, pi_uint64)
__SYCL_PARAM_TRAITS_SPEC(device, ext_intel_mem_channel, bool)
__SYCL_PARAM_TRAITS_SPEC(device, ext_oneapi_srgb, bool)
__SYCL_PARAM_TRAITS_SPEC(device, ext_intel_device_info_uuid, detail::uuid_type)
__SYCL_PARAM_TRAITS_SPEC(device, ext_oneapi_max_global_work_groups, size_t)
__SYCL_PARAM_TRAITS_SPEC(device, ext_oneapi_max_work_groups_1d, id<1>)
__SYCL_PARAM_TRAITS_SPEC(device, ext_oneapi_max_work_groups_2d, id<2>)
__SYCL_PARAM_TRAITS_SPEC(device, ext_oneapi_max_work_groups_3d, id<3>)
5 changes: 5 additions & 0 deletions sycl/include/CL/sycl/info/info_desc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,11 @@ enum class device : cl_device_info {
atomic64 = PI_DEVICE_INFO_ATOMIC_64,
atomic_memory_order_capabilities =
PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES,
ext_oneapi_max_global_work_groups =
PI_EXT_ONEAPI_DEVICE_INFO_MAX_GLOBAL_WORK_GROUPS,
ext_oneapi_max_work_groups_1d = PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_1D,
ext_oneapi_max_work_groups_2d = PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_2D,
ext_oneapi_max_work_groups_3d = PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_3D
};

enum class device_type : pi_uint64 {
Expand Down
26 changes: 26 additions & 0 deletions sycl/plugins/cuda/pi_cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -991,6 +991,32 @@ pi_result cuda_piDeviceGetInfo(pi_device device, pi_device_info param_name,
return getInfoArray(max_work_item_dimensions, param_value_size, param_value,
param_value_size_ret, return_sizes);
}

case PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_3D: {
size_t return_sizes[max_work_item_dimensions];
int max_x = 0, max_y = 0, max_z = 0;
cl::sycl::detail::pi::assertion(
cuDeviceGetAttribute(&max_x, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,
device->get()) == CUDA_SUCCESS);
cl::sycl::detail::pi::assertion(max_x >= 0);

cl::sycl::detail::pi::assertion(
cuDeviceGetAttribute(&max_y, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y,
device->get()) == CUDA_SUCCESS);
cl::sycl::detail::pi::assertion(max_y >= 0);

cl::sycl::detail::pi::assertion(
cuDeviceGetAttribute(&max_z, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Z,
device->get()) == CUDA_SUCCESS);
cl::sycl::detail::pi::assertion(max_z >= 0);

return_sizes[0] = size_t(max_x);
return_sizes[1] = size_t(max_y);
return_sizes[2] = size_t(max_z);
return getInfoArray(max_work_item_dimensions, param_value_size, param_value,
param_value_size_ret, return_sizes);
}

case PI_DEVICE_INFO_MAX_WORK_GROUP_SIZE: {
int max_work_group_size = 0;
cl::sycl::detail::pi::assertion(
Expand Down
26 changes: 26 additions & 0 deletions sycl/plugins/hip/pi_hip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -980,6 +980,32 @@ pi_result hip_piDeviceGetInfo(pi_device device, pi_device_info param_name,
return getInfoArray(max_work_item_dimensions, param_value_size, param_value,
param_value_size_ret, return_sizes);
}

case PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_3D: {
size_t return_sizes[max_work_item_dimensions];
int max_x = 0, max_y = 0, max_z = 0;
cl::sycl::detail::pi::assertion(
hipDeviceGetAttribute(&max_x, hipDeviceAttributeMaxGridDimX,
device->get()) == hipSuccess);
cl::sycl::detail::pi::assertion(max_x >= 0);

cl::sycl::detail::pi::assertion(
hipDeviceGetAttribute(&max_y, hipDeviceAttributeMaxGridDimY,
device->get()) == hipSuccess);
cl::sycl::detail::pi::assertion(max_y >= 0);

cl::sycl::detail::pi::assertion(
hipDeviceGetAttribute(&max_z, hipDeviceAttributeMaxGridDimZ,
device->get()) == hipSuccess);
cl::sycl::detail::pi::assertion(max_z >= 0);

return_sizes[0] = size_t(max_x);
return_sizes[1] = size_t(max_y);
return_sizes[2] = size_t(max_z);
return getInfoArray(max_work_item_dimensions, param_value_size, param_value,
param_value_size_ret, return_sizes);
}

case PI_DEVICE_INFO_MAX_WORK_GROUP_SIZE: {
int max_work_group_size = 0;
cl::sycl::detail::pi::assertion(
Expand Down
8 changes: 8 additions & 0 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2095,6 +2095,14 @@ pi_result piDeviceGetInfo(pi_device Device, pi_device_info ParamName,
Device->ZeDeviceComputeProperties->maxGroupSizeZ}};
return ReturnValue(MaxGroupSize);
}
case PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_3D: {
struct {
size_t Arr[3];
} MaxGroupCounts = {{Device->ZeDeviceComputeProperties->maxGroupCountX,
Device->ZeDeviceComputeProperties->maxGroupCountY,
Device->ZeDeviceComputeProperties->maxGroupCountZ}};
return ReturnValue(MaxGroupCounts);
}
case PI_DEVICE_INFO_MAX_CLOCK_FREQUENCY:
return ReturnValue(pi_uint32{Device->ZeDeviceProperties->coreClockRate});
case PI_DEVICE_INFO_ADDRESS_BITS: {
Expand Down
19 changes: 19 additions & 0 deletions sycl/plugins/opencl/pi_opencl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,25 @@ pi_result piDeviceGetInfo(pi_device device, pi_device_info paramName,
std::memcpy(paramValue, &result, sizeof(cl_bool));
return PI_SUCCESS;
}

case PI_EXT_ONEAPI_DEVICE_INFO_MAX_WORK_GROUPS_3D:
// Returns the maximum sizes of a work group for each dimension one
// could use to submit a kernel. There is no such query defined in OpenCL
// so we'll return the maximum value.
{
if (paramValueSizeRet)
*paramValueSizeRet = paramValueSize;
static constexpr size_t Max = (std::numeric_limits<size_t>::max)();
size_t *out = cast<size_t *>(paramValue);
if (paramValueSize >= sizeof(size_t))
out[0] = Max;
if (paramValueSize >= 2 * sizeof(size_t))
out[1] = Max;
if (paramValueSize >= 3 * sizeof(size_t))
out[2] = Max;
return PI_SUCCESS;
}

default:
cl_int result = clGetDeviceInfo(
cast<cl_device_id>(device), cast<cl_device_info>(paramName),
Expand Down
90 changes: 90 additions & 0 deletions sycl/source/detail/device_info.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,62 @@ template <> struct get_device_info<id<3>, info::device::max_work_item_sizes> {
}
};

template <>
struct get_device_info<size_t,
info::device::ext_oneapi_max_global_work_groups> {
static size_t get(RT::PiDevice dev, const plugin &Plugin) {
(void)dev; // Silence unused warning
(void)Plugin;
return static_cast<size_t>((std::numeric_limits<int>::max)());
}
};

template <>
struct get_device_info<id<1>, info::device::ext_oneapi_max_work_groups_1d> {
static id<1> get(RT::PiDevice dev, const plugin &Plugin) {
size_t result[3];
size_t Limit = get_device_info<
size_t, info::device::ext_oneapi_max_global_work_groups>::get(dev,
Plugin);
Plugin.call<PiApiKind::piDeviceGetInfo>(
dev,
pi::cast<RT::PiDeviceInfo>(info::device::ext_oneapi_max_work_groups_3d),
sizeof(result), &result, nullptr);
return id<1>(std::min(Limit, result[0]));
}
};

template <>
struct get_device_info<id<2>, info::device::ext_oneapi_max_work_groups_2d> {
static id<2> get(RT::PiDevice dev, const plugin &Plugin) {
size_t result[3];
size_t Limit = get_device_info<
size_t, info::device::ext_oneapi_max_global_work_groups>::get(dev,
Plugin);
Plugin.call<PiApiKind::piDeviceGetInfo>(
dev,
pi::cast<RT::PiDeviceInfo>(info::device::ext_oneapi_max_work_groups_3d),
sizeof(result), &result, nullptr);
return id<2>(std::min(Limit, result[1]), std::min(Limit, result[0]));
}
};

template <>
struct get_device_info<id<3>, info::device::ext_oneapi_max_work_groups_3d> {
static id<3> get(RT::PiDevice dev, const plugin &Plugin) {
size_t result[3];
size_t Limit = get_device_info<
size_t, info::device::ext_oneapi_max_global_work_groups>::get(dev,
Plugin);
Plugin.call<PiApiKind::piDeviceGetInfo>(
dev,
pi::cast<RT::PiDeviceInfo>(info::device::ext_oneapi_max_work_groups_3d),
sizeof(result), &result, nullptr);
return id<3>(std::min(Limit, result[2]), std::min(Limit, result[1]),
std::min(Limit, result[0]));
}
};

// Specialization for parent device
template <> struct get_device_info<device, info::device::parent_device> {
static device get(RT::PiDevice dev, const plugin &Plugin) {
Expand Down Expand Up @@ -526,6 +582,40 @@ inline id<3> get_device_info_host<info::device::max_work_item_sizes>() {
return {1, 1, 1};
}

template <>
inline constexpr size_t
get_device_info_host<info::device::ext_oneapi_max_global_work_groups>() {
// See handler.hpp for the maximum value :
return static_cast<size_t>((std::numeric_limits<int>::max)());
}

template <>
inline id<1>
get_device_info_host<info::device::ext_oneapi_max_work_groups_1d>() {
// See handler.hpp for the maximum value :
static constexpr size_t Limit =
get_device_info_host<info::device::ext_oneapi_max_global_work_groups>();
return {Limit};
}

template <>
inline id<2>
get_device_info_host<info::device::ext_oneapi_max_work_groups_2d>() {
// See handler.hpp for the maximum value :
static constexpr size_t Limit =
get_device_info_host<info::device::ext_oneapi_max_global_work_groups>();
return {Limit, Limit};
}

template <>
inline id<3>
get_device_info_host<info::device::ext_oneapi_max_work_groups_3d>() {
// See handler.hpp for the maximum value :
static constexpr size_t Limit =
get_device_info_host<info::device::ext_oneapi_max_global_work_groups>();
return {Limit, Limit, Limit};
}

template <>
inline size_t get_device_info_host<info::device::max_work_group_size>() {
// current value is the required minimum
Expand Down
4 changes: 4 additions & 0 deletions sycl/test/abi/sycl_symbols_linux.dump
Original file line number Diff line number Diff line change
Expand Up @@ -4371,3 +4371,7 @@ _ZNK2cl4sycl9exception8categoryEv
_ZNK2cl4sycl9kernel_id8get_nameEv
__sycl_register_lib
__sycl_unregister_lib
_ZNK2cl4sycl6device8get_infoILNS0_4info6deviceE131072EEENS3_12param_traitsIS4_XT_EE11return_typeEv
_ZNK2cl4sycl6device8get_infoILNS0_4info6deviceE131075EEENS3_12param_traitsIS4_XT_EE11return_typeEv
_ZNK2cl4sycl6device8get_infoILNS0_4info6deviceE131074EEENS3_12param_traitsIS4_XT_EE11return_typeEv
_ZNK2cl4sycl6device8get_infoILNS0_4info6deviceE131073EEENS3_12param_traitsIS4_XT_EE11return_typeEv
4 changes: 4 additions & 0 deletions sycl/test/abi/sycl_symbols_windows.dump
Original file line number Diff line number Diff line change
Expand Up @@ -4585,5 +4585,9 @@
?what@exception@sycl@cl@@UEBAPEBDXZ
?wrapIntoImageBuffer@MemoryManager@detail@sycl@cl@@SAPEAXV?$shared_ptr@Vcontext_impl@detail@sycl@cl@@@std@@PEAXPEAVSYCLMemObjI@234@@Z
DllMain
??$get_info@$0CAAAB@@device@sycl@cl@@QEBA?AV?$id@$00@12@XZ
??$get_info@$0CAAAA@@device@sycl@cl@@QEBA_KXZ
??$get_info@$0CAAAD@@device@sycl@cl@@QEBA?AV?$id@$02@12@XZ
??$get_info@$0CAAAC@@device@sycl@cl@@QEBA?AV?$id@$01@12@XZ
__sycl_register_lib
__sycl_unregister_lib

0 comments on commit 2fdf940

Please sign in to comment.