Skip to content

Commit

Permalink
Upstream staging 2024 03 22 (#351)
Browse files Browse the repository at this point in the history
* refactor: Use rocPRIM overloads for warp_scan::exclusive_scan wo initial value

Exclusive scans without an initial value are now present on the rocPRIM
public API. Use those instead of relying on "hidden" APIs added as a
workaround between the two libraries.

* Implemented WARP_EXCHANGE_SHUFFLE

* Testing WARP_EXCHANGE_SHUFFLE and refactored warp_exchange test suite

* Updated benchmarks with WARP_EXCHANGE_SHUFFLE

* Updated changelog

* NVCC build fixes and warning fixes

* Put host_warp_size_wrapper in ::hipcub::detail namespace

Previously it was in the global ::detail namespace.

* Preprocessor definitions needed for debug_synchronous deprecation

* adjacent_difference: deprecated debug_synchronous

* device histogram: deprecated debug synchronous

* Removed optional debug_synchronous argument in DeviceMemcpy

This argument was never on the CUB interface and shouldn't have been
added in the first place.

* device merge sort: deprecated debug_synchronous

* device partition: deprecated debug_synchronous

* device radix sort: deprecated debug_synchronous

* device reduce: deprecated debug_synchronous

* device run length encode: deprecate debug_synchronous

* device scan: deprecated debug_synchronous

* device segmented radix sort: deprecated debug_synchronous

* device segmented reduce: deprecated debug_synchronous

* Build warning free & enabled -Werror in CI

* device segmented sort: deprecated debug_synchronous

* device select: deprecated debug_synchronous

* Refactored HIPCUB_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR

* device SPMV: deprecated debug_synchronous and added missing test to CMakeLists

* Improved HIPCUB_DETAIL_RUNTIME_LOG_DEBUG_SYNCHRONOUS

* Fixed formatting

* ci: pass-failed warning does not imply failure

* Fixed documentation

* Updated changelog

* Removed DeviceSelectWarpSize from tests

* Removed DeviceSelectWarpSize from benchmarks

* Use hipcub's DiscardOutputIterator instead of custom one

* add device_copy
add test for device_copy
add benchmark for device_copy

* update docs

* fix format

* fix copyright date

* add device_copy to cub backend

* update changelog

* fix review comments

* fix format

* clarify warp scan interface

* Added hipcub::tuple

* Added decomposer overloads to BlockRadixSort

* Testing BlockRadixSort decomposer overloads

* Benchmarking BlockRadixSort decomposer overloads

* Added tuple_element_t to cub/tuple,hpp

* Fixed formatting

* Tidied and updated changelog

* DeviceRadixSort decomposer overloads (CUB backend)

* DeviceRadixSort decomposer overloads (rocprim backend)

* Testing DeviceRadixSort decomposer overloads

* Benchmarking DeviceRadixSort decomposer overloads

* Updated changelog

* Added select_plus_operator_host for calculating on host with double precision

* Host reference calculations are done in double precision in the tests

* Clang format on test_hipcub_block_scan.cpp

* Added precision for different types to test_utils and updated host scan using double precision during calculation

* Added more specific precision checks to the tests

* Changed precision for nvcc support

* Changed rocprim types to test_utils type in device_scan test

* Remove unused variable test device_scan

* Change transform for nvcc compiler device_scan test

* Added more precise precision checks for warp_scan and warp_reduce

* Remove cast_type from test_utils

* Templatize on overloaded operator instead of struct

* hibcup test device_reduce_by_key updated to assert with precision

* Added assert near with better precision to tests device_reduce, device spmv and thread_operators

* Changed for block_reduce and block_scan all asserts to assert_near

* Replaced is_plus by is_add in test_utils

* specify architecture for rocprim build

---------

Co-authored-by: Gergely Meszaros <[email protected]>
Co-authored-by: Lőrinc Serfőző <[email protected]>
Co-authored-by: Beatriz Navidad Vilches <[email protected]>
Co-authored-by: Nol Moonen <[email protected]>
Co-authored-by: Nick Breed <[email protected]>
  • Loading branch information
6 people authored Apr 3, 2024
1 parent 72d340c commit a9b8970
Show file tree
Hide file tree
Showing 101 changed files with 14,186 additions and 6,886 deletions.
5 changes: 3 additions & 2 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2023 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -94,6 +94,7 @@ copyright-date:
-D BUILD_TEST=OFF
-D BUILD_EXAMPLE=OFF
-D ROCM_DEP_ROCMCORE=OFF
-D GPU_TARGETS="$GPU_TARGETS"
-B $CI_PROJECT_DIR/rocPRIM/build
-S $CI_PROJECT_DIR/rocPRIM
- cd $CI_PROJECT_DIR/rocPRIM/build
Expand All @@ -113,7 +114,7 @@ build:rocm:
- cmake
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_CXX_FLAGS="-Wall -Wextra"
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror -Wno-error=pass-failed"
-D CMAKE_BUILD_TYPE=Release
-D BUILD_TEST=ON
-D BUILD_EXAMPLE=ON
Expand Down
21 changes: 20 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,30 @@ Documentation for hipCUB is available at

## (Unreleased) hipCUB-3.2.0 for ROCm 6.2.0

### Added
* Add `DeviceCopy` function to have parity with CUB.
* In the rocPRIM backend, added `enum WarpExchangeAlgorithm`, which is used as the new optional template argument for `WarpExchange`.
* The potential values for the enum are `WARP_EXCHANGE_SMEM` and `WARP_EXCHANGE_SHUFFLE`.
* `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations.
* `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory, but the `ItemsPerThread` must be a divisor of `WarpSize`.
* Added `tuple.hpp` which defines templates `hipcub::tuple`, `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`.
* Added new overloaded member functions to `BlockRadixSort` and `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom
type (`key_type`) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implement
`operator(const key_type&)` which returns a `hipcub::tuple` of references pointing to members of `key_type`.

### Changed

* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.

### Fixed

* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
It now derives the accumulator type as the result of the binary operator.
* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
* `debug_synchronous` has been deprecated in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous`
to the device functions results in a deprecation warning both at runtime and at compile time.
* The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time
by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition.
* The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition.

## (Unreleased) hipCUB-3.1.0 for ROCm 6.1.0

Expand Down
3 changes: 2 additions & 1 deletion benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2020-2023 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2020-2024 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -80,6 +80,7 @@ add_hipcub_benchmark(benchmark_block_run_length_decode.cpp)
add_hipcub_benchmark(benchmark_block_scan.cpp)
add_hipcub_benchmark(benchmark_block_shuffle.cpp)
add_hipcub_benchmark(benchmark_device_adjacent_difference.cpp)
add_hipcub_benchmark(benchmark_device_batch_copy.cpp)
add_hipcub_benchmark(benchmark_device_batch_memcpy.cpp)
add_hipcub_benchmark(benchmark_device_histogram.cpp)
add_hipcub_benchmark(benchmark_device_memory.cpp)
Expand Down
103 changes: 62 additions & 41 deletions benchmark/benchmark_block_radix_sort.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2020-2022 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2020-2024 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -60,6 +60,23 @@ struct helper_blocked_blocked
hipcub::BlockRadixSort<T, BlockSize, ItemsPerThread, T> sort;
sort.Sort(keys, values);
}

template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread])
{
using custom_t = benchmark_utils::custom_type<T>;
hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread> sort;
sort.Sort(keys, benchmark_utils::custom_type_decomposer<custom_t>{});
}

template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread],
benchmark_utils::custom_type<T> (&values)[ItemsPerThread])
{
using custom_t = benchmark_utils::custom_type<T>;
hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread, custom_t> sort;
sort.Sort(keys, values, benchmark_utils::custom_type_decomposer<custom_t>{});
}
};

struct helper_blocked_striped
Expand All @@ -84,6 +101,25 @@ struct helper_blocked_striped
hipcub::BlockRadixSort<T, BlockSize, ItemsPerThread, T> sort;
sort.SortBlockedToStriped(keys, values);
}

template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread])
{
using custom_t = benchmark_utils::custom_type<T>;
hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread> sort;
sort.SortBlockedToStriped(keys, benchmark_utils::custom_type_decomposer<custom_t>{});
}

template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread],
benchmark_utils::custom_type<T> (&values)[ItemsPerThread])
{
using custom_t = benchmark_utils::custom_type<T>;
hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread, custom_t> sort;
sort.SortBlockedToStriped(keys,
values,
benchmark_utils::custom_type_decomposer<custom_t>{});
}
};

template<class Helper,
Expand All @@ -93,8 +129,8 @@ template<class Helper,
unsigned int Trials>
__global__ __launch_bounds__(BlockSize) void sort_keys_kernel(const T* input, T* output)
{
const unsigned int lid = hipThreadIdx_x;
const unsigned int block_offset = hipBlockIdx_x * ItemsPerThread * BlockSize;
const unsigned int lid = threadIdx.x;
const unsigned int block_offset = blockIdx.x * ItemsPerThread * BlockSize;

T keys[ItemsPerThread];
Helper::template load<BlockSize>(lid, input + block_offset, keys);
Expand All @@ -115,8 +151,8 @@ template<class Helper,
unsigned int Trials>
__global__ __launch_bounds__(BlockSize) void sort_pairs_kernel(const T* input, T* output)
{
const unsigned int lid = hipThreadIdx_x;
const unsigned int block_offset = hipBlockIdx_x * ItemsPerThread * BlockSize;
const unsigned int lid = threadIdx.x;
const unsigned int block_offset = blockIdx.x * ItemsPerThread * BlockSize;

T keys[ItemsPerThread];
T values[ItemsPerThread];
Expand Down Expand Up @@ -186,25 +222,13 @@ void run_benchmark(benchmark::State& state,

if(benchmark_kind == benchmark_kinds::sort_keys)
{
hipLaunchKernelGGL(
HIP_KERNEL_NAME(sort_keys_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>),
dim3(size / items_per_block),
dim3(BlockSize),
0,
stream,
d_input,
d_output);
sort_keys_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>
<<<dim3(size / items_per_block), dim3(BlockSize), 0, stream>>>(d_input, d_output);
}
else if(benchmark_kind == benchmark_kinds::sort_pairs)
{
hipLaunchKernelGGL(
HIP_KERNEL_NAME(sort_pairs_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>),
dim3(size / items_per_block),
dim3(BlockSize),
0,
stream,
d_input,
d_output);
sort_pairs_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>
<<<dim3(size / items_per_block), dim3(BlockSize), 0, stream>>>(d_input, d_output);
}
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipDeviceSynchronize());
Expand Down Expand Up @@ -246,27 +270,24 @@ void add_benchmarks(benchmark_kinds benchmark_kind
hipStream_t stream,
size_t size)
{
using custom_int_t = benchmark_utils::custom_type<int>;

std::vector<benchmark::internal::Benchmark*> bs = {
BENCHMARK_TYPE(int, 64),
BENCHMARK_TYPE(int, 128),
BENCHMARK_TYPE(int, 192),
BENCHMARK_TYPE(int, 256),
BENCHMARK_TYPE(int, 320),
BENCHMARK_TYPE(int, 512),

BENCHMARK_TYPE(int8_t, 64),
BENCHMARK_TYPE(int8_t, 128),
BENCHMARK_TYPE(int8_t, 192),
BENCHMARK_TYPE(int8_t, 256),
BENCHMARK_TYPE(int8_t, 320),
BENCHMARK_TYPE(int8_t, 512),

BENCHMARK_TYPE(long long, 64),
BENCHMARK_TYPE(long long, 128),
BENCHMARK_TYPE(long long, 192),
BENCHMARK_TYPE(long long, 256),
BENCHMARK_TYPE(long long, 320),
BENCHMARK_TYPE(long long, 512),
BENCHMARK_TYPE(int, 64), BENCHMARK_TYPE(int, 128),
BENCHMARK_TYPE(int, 192), BENCHMARK_TYPE(int, 256),
BENCHMARK_TYPE(int, 320), BENCHMARK_TYPE(int, 512),

BENCHMARK_TYPE(int8_t, 64), BENCHMARK_TYPE(int8_t, 128),
BENCHMARK_TYPE(int8_t, 192), BENCHMARK_TYPE(int8_t, 256),
BENCHMARK_TYPE(int8_t, 320), BENCHMARK_TYPE(int8_t, 512),

BENCHMARK_TYPE(long long, 64), BENCHMARK_TYPE(long long, 128),
BENCHMARK_TYPE(long long, 192), BENCHMARK_TYPE(long long, 256),
BENCHMARK_TYPE(long long, 320), BENCHMARK_TYPE(long long, 512),

BENCHMARK_TYPE(custom_int_t, 64), BENCHMARK_TYPE(custom_int_t, 128),
BENCHMARK_TYPE(custom_int_t, 192), BENCHMARK_TYPE(custom_int_t, 256),
BENCHMARK_TYPE(custom_int_t, 320), BENCHMARK_TYPE(custom_int_t, 512),
};

benchmarks.insert(benchmarks.end(), bs.begin(), bs.end());
Expand Down
4 changes: 2 additions & 2 deletions benchmark/benchmark_block_shuffle.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -265,7 +265,7 @@ void add_benchmarks_type(const std::string& name,
benchmarks.insert(benchmarks.end(), bs.begin(), bs.end());
}

#define CREATE_BENCHMARKS(T) add_benchmarks_type<Benchmark, int>(name, benchmarks, stream, size, #T)
#define CREATE_BENCHMARKS(T) add_benchmarks_type<Benchmark, T>(name, benchmarks, stream, size, #T)

template<class Benchmark>
void add_benchmarks(const std::string& name,
Expand Down
10 changes: 4 additions & 6 deletions benchmark/benchmark_device_adjacent_difference.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -106,8 +106,6 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
{
using output_type = T;

static constexpr bool debug_synchronous = false;

// Generate data
const std::vector<T> input = benchmark_utils::get_random_data<T>(size, 1, 100);

Expand All @@ -129,7 +127,8 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
std::size_t temp_storage_size{};
void* d_temp_storage = nullptr;

const auto launch = [&] {
const auto launch = [&]
{
return dispatch_adjacent_difference(left_tag,
copy_tag,
d_temp_storage,
Expand All @@ -138,8 +137,7 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
d_output,
size,
hipcub::Sum{},
stream,
debug_synchronous);
stream);
};
HIP_CHECK(launch());
HIP_CHECK(hipMalloc(&d_temp_storage, temp_storage_size));
Expand Down
Loading

0 comments on commit a9b8970

Please sign in to comment.