Upstream staging 2024 03 22 (#351)

* refactor: Use rocPRIM overloads for warp_scan::exclusive_scan wo initial value Exclusive scans without an initial value are now present on the rocPRIM public API. Use those instead of relying on "hidden" APIs added as a workaround between the two libraries. * Implemented WARP_EXCHANGE_SHUFFLE * Testing WARP_EXCHANGE_SHUFFLE and refactored warp_exchange test suite * Updated benchmarks with WARP_EXCHANGE_SHUFFLE * Updated changelog * NVCC build fixes and warning fixes * Put host_warp_size_wrapper in ::hipcub::detail namespace Previously it was in the global ::detail namespace. * Preprocessor definitions needed for debug_synchronous deprecation * adjacent_difference: deprecated debug_synchronous * device histogram: deprecated debug synchronous * Removed optional debug_synchronous argument in DeviceMemcpy This argument was never on the CUB interface and shouldn't have been added in the first place. * device merge sort: deprecated debug_synchronous * device partition: deprecated debug_synchronous * device radix sort: deprecated debug_synchronous * device reduce: deprecated debug_synchronous * device run length encode: deprecate debug_synchronous * device scan: deprecated debug_synchronous * device segmented radix sort: deprecated debug_synchronous * device segmented reduce: deprecated debug_synchronous * Build warning free & enabled -Werror in CI * device segmented sort: deprecated debug_synchronous * device select: deprecated debug_synchronous * Refactored HIPCUB_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR * device SPMV: deprecated debug_synchronous and added missing test to CMakeLists * Improved HIPCUB_DETAIL_RUNTIME_LOG_DEBUG_SYNCHRONOUS * Fixed formatting * ci: pass-failed warning does not imply failure * Fixed documentation * Updated changelog * Removed DeviceSelectWarpSize from tests * Removed DeviceSelectWarpSize from benchmarks * Use hipcub's DiscardOutputIterator instead of custom one * add device_copy add test for device_copy add benchmark for device_copy * update docs * fix format * fix copyright date * add device_copy to cub backend * update changelog * fix review comments * fix format * clarify warp scan interface * Added hipcub::tuple * Added decomposer overloads to BlockRadixSort * Testing BlockRadixSort decomposer overloads * Benchmarking BlockRadixSort decomposer overloads * Added tuple_element_t to cub/tuple,hpp * Fixed formatting * Tidied and updated changelog * DeviceRadixSort decomposer overloads (CUB backend) * DeviceRadixSort decomposer overloads (rocprim backend) * Testing DeviceRadixSort decomposer overloads * Benchmarking DeviceRadixSort decomposer overloads * Updated changelog * Added select_plus_operator_host for calculating on host with double precision * Host reference calculations are done in double precision in the tests * Clang format on test_hipcub_block_scan.cpp * Added precision for different types to test_utils and updated host scan using double precision during calculation * Added more specific precision checks to the tests * Changed precision for nvcc support * Changed rocprim types to test_utils type in device_scan test * Remove unused variable test device_scan * Change transform for nvcc compiler device_scan test * Added more precise precision checks for warp_scan and warp_reduce * Remove cast_type from test_utils * Templatize on overloaded operator instead of struct * hibcup test device_reduce_by_key updated to assert with precision * Added assert near with better precision to tests device_reduce, device spmv and thread_operators * Changed for block_reduce and block_scan all asserts to assert_near * Replaced is_plus by is_add in test_utils * specify architecture for rocprim build --------- Co-authored-by: Gergely Meszaros <[email protected]> Co-authored-by: Lőrinc Serfőző <[email protected]> Co-authored-by: Beatriz Navidad Vilches <[email protected]> Co-authored-by: Nol Moonen <[email protected]> Co-authored-by: Nick Breed <[email protected]>
ROCm · Apr 3, 2024 · a9b8970 · a9b8970
1 parent 72d340c
commit a9b8970
Show file tree

Hide file tree

Showing 101 changed files with 14,186 additions and 6,886 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -1,6 +1,6 @@
 # MIT License
 #
-# Copyright (c) 2017-2023 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
@@ -94,6 +94,7 @@ copyright-date:
       -D BUILD_TEST=OFF
       -D BUILD_EXAMPLE=OFF
       -D ROCM_DEP_ROCMCORE=OFF
+      -D GPU_TARGETS="$GPU_TARGETS"
       -B $CI_PROJECT_DIR/rocPRIM/build
       -S $CI_PROJECT_DIR/rocPRIM
     - cd $CI_PROJECT_DIR/rocPRIM/build
@@ -113,7 +114,7 @@ build:rocm:
     - cmake
       -G Ninja
       -D CMAKE_CXX_COMPILER="$AMDCLANG"
-      -D CMAKE_CXX_FLAGS="-Wall -Wextra"
+      -D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror -Wno-error=pass-failed"
       -D CMAKE_BUILD_TYPE=Release
       -D BUILD_TEST=ON
       -D BUILD_EXAMPLE=ON

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,11 +5,30 @@ Documentation for hipCUB is available at
 
 ## (Unreleased) hipCUB-3.2.0 for ROCm 6.2.0
 
+### Added
+* Add `DeviceCopy` function to have parity with CUB.
+* In the rocPRIM backend, added `enum WarpExchangeAlgorithm`, which is used as the new optional template argument for `WarpExchange`.
+  * The potential values for the enum are `WARP_EXCHANGE_SMEM` and `WARP_EXCHANGE_SHUFFLE`.
+  * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations.
+  * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory, but the `ItemsPerThread` must be a divisor of `WarpSize`.
+* Added `tuple.hpp` which defines templates `hipcub::tuple`, `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`.
+* Added new overloaded member functions to `BlockRadixSort` and `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom
+  type (`key_type`) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implement
+  `operator(const key_type&)` which returns a `hipcub::tuple` of references pointing to members of `key_type`.
+
+### Changed
+
+* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
+
 ### Fixed
 
 * Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
   It now derives the accumulator type as the result of the binary operator.
-* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
+* `debug_synchronous` has been deprecated in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous`
+  to the device functions results in a deprecation warning both at runtime and at compile time.
+  * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time
+    by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition.
+  * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition.
 
 ## (Unreleased) hipCUB-3.1.0 for ROCm 6.1.0
 

diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt
@@ -1,6 +1,6 @@
 # MIT License
 #
-# Copyright (c) 2020-2023 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2020-2024 Advanced Micro Devices, Inc. All rights reserved.
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
@@ -80,6 +80,7 @@ add_hipcub_benchmark(benchmark_block_run_length_decode.cpp)
 add_hipcub_benchmark(benchmark_block_scan.cpp)
 add_hipcub_benchmark(benchmark_block_shuffle.cpp)
 add_hipcub_benchmark(benchmark_device_adjacent_difference.cpp)
+add_hipcub_benchmark(benchmark_device_batch_copy.cpp)
 add_hipcub_benchmark(benchmark_device_batch_memcpy.cpp)
 add_hipcub_benchmark(benchmark_device_histogram.cpp)
 add_hipcub_benchmark(benchmark_device_memory.cpp)

diff --git a/benchmark/benchmark_block_radix_sort.cpp b/benchmark/benchmark_block_radix_sort.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2020-2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2020-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -60,6 +60,23 @@ struct helper_blocked_blocked
         hipcub::BlockRadixSort<T, BlockSize, ItemsPerThread, T> sort;
         sort.Sort(keys, values);
     }
+
+    template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
+    HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread])
+    {
+        using custom_t = benchmark_utils::custom_type<T>;
+        hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread> sort;
+        sort.Sort(keys, benchmark_utils::custom_type_decomposer<custom_t>{});
+    }
+
+    template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
+    HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread],
+                                   benchmark_utils::custom_type<T> (&values)[ItemsPerThread])
+    {
+        using custom_t = benchmark_utils::custom_type<T>;
+        hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread, custom_t> sort;
+        sort.Sort(keys, values, benchmark_utils::custom_type_decomposer<custom_t>{});
+    }
 };
 
 struct helper_blocked_striped
@@ -84,6 +101,25 @@ struct helper_blocked_striped
         hipcub::BlockRadixSort<T, BlockSize, ItemsPerThread, T> sort;
         sort.SortBlockedToStriped(keys, values);
     }
+
+    template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
+    HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread])
+    {
+        using custom_t = benchmark_utils::custom_type<T>;
+        hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread> sort;
+        sort.SortBlockedToStriped(keys, benchmark_utils::custom_type_decomposer<custom_t>{});
+    }
+
+    template<unsigned int BlockSize, class T, unsigned int ItemsPerThread>
+    HIPCUB_DEVICE static void sort(benchmark_utils::custom_type<T> (&keys)[ItemsPerThread],
+                                   benchmark_utils::custom_type<T> (&values)[ItemsPerThread])
+    {
+        using custom_t = benchmark_utils::custom_type<T>;
+        hipcub::BlockRadixSort<custom_t, BlockSize, ItemsPerThread, custom_t> sort;
+        sort.SortBlockedToStriped(keys,
+                                  values,
+                                  benchmark_utils::custom_type_decomposer<custom_t>{});
+    }
 };
 
 template<class Helper,
@@ -93,8 +129,8 @@ template<class Helper,
          unsigned int Trials>
 __global__ __launch_bounds__(BlockSize) void sort_keys_kernel(const T* input, T* output)
 {
-    const unsigned int lid = hipThreadIdx_x;
-    const unsigned int block_offset = hipBlockIdx_x * ItemsPerThread * BlockSize;
+    const unsigned int lid          = threadIdx.x;
+    const unsigned int block_offset = blockIdx.x * ItemsPerThread * BlockSize;
 
     T keys[ItemsPerThread];
     Helper::template load<BlockSize>(lid, input + block_offset, keys);
@@ -115,8 +151,8 @@ template<class Helper,
          unsigned int Trials>
 __global__ __launch_bounds__(BlockSize) void sort_pairs_kernel(const T* input, T* output)
 {
-    const unsigned int lid = hipThreadIdx_x;
-    const unsigned int block_offset = hipBlockIdx_x * ItemsPerThread * BlockSize;
+    const unsigned int lid          = threadIdx.x;
+    const unsigned int block_offset = blockIdx.x * ItemsPerThread * BlockSize;
 
     T keys[ItemsPerThread];
     T values[ItemsPerThread];
@@ -186,25 +222,13 @@ void run_benchmark(benchmark::State& state,
 
         if(benchmark_kind == benchmark_kinds::sort_keys)
         {
-            hipLaunchKernelGGL(
-                HIP_KERNEL_NAME(sort_keys_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>),
-                dim3(size / items_per_block),
-                dim3(BlockSize),
-                0,
-                stream,
-                d_input,
-                d_output);
+            sort_keys_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>
+                <<<dim3(size / items_per_block), dim3(BlockSize), 0, stream>>>(d_input, d_output);
         }
         else if(benchmark_kind == benchmark_kinds::sort_pairs)
         {
-            hipLaunchKernelGGL(
-                HIP_KERNEL_NAME(sort_pairs_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>),
-                dim3(size / items_per_block),
-                dim3(BlockSize),
-                0,
-                stream,
-                d_input,
-                d_output);
+            sort_pairs_kernel<Helper, T, BlockSize, ItemsPerThread, Trials>
+                <<<dim3(size / items_per_block), dim3(BlockSize), 0, stream>>>(d_input, d_output);
         }
         HIP_CHECK(hipPeekAtLastError());
         HIP_CHECK(hipDeviceSynchronize());
@@ -246,27 +270,24 @@ void add_benchmarks(benchmark_kinds                               benchmark_kind
                     hipStream_t                                   stream,
                     size_t                                        size)
 {
+    using custom_int_t = benchmark_utils::custom_type<int>;
+
     std::vector<benchmark::internal::Benchmark*> bs = {
-        BENCHMARK_TYPE(int, 64),
-        BENCHMARK_TYPE(int, 128),
-        BENCHMARK_TYPE(int, 192),
-        BENCHMARK_TYPE(int, 256),
-        BENCHMARK_TYPE(int, 320),
-        BENCHMARK_TYPE(int, 512),
-
-        BENCHMARK_TYPE(int8_t, 64),
-        BENCHMARK_TYPE(int8_t, 128),
-        BENCHMARK_TYPE(int8_t, 192),
-        BENCHMARK_TYPE(int8_t, 256),
-        BENCHMARK_TYPE(int8_t, 320),
-        BENCHMARK_TYPE(int8_t, 512),
-
-        BENCHMARK_TYPE(long long, 64),
-        BENCHMARK_TYPE(long long, 128),
-        BENCHMARK_TYPE(long long, 192),
-        BENCHMARK_TYPE(long long, 256),
-        BENCHMARK_TYPE(long long, 320),
-        BENCHMARK_TYPE(long long, 512),
+        BENCHMARK_TYPE(int, 64),           BENCHMARK_TYPE(int, 128),
+        BENCHMARK_TYPE(int, 192),          BENCHMARK_TYPE(int, 256),
+        BENCHMARK_TYPE(int, 320),          BENCHMARK_TYPE(int, 512),
+
+        BENCHMARK_TYPE(int8_t, 64),        BENCHMARK_TYPE(int8_t, 128),
+        BENCHMARK_TYPE(int8_t, 192),       BENCHMARK_TYPE(int8_t, 256),
+        BENCHMARK_TYPE(int8_t, 320),       BENCHMARK_TYPE(int8_t, 512),
+
+        BENCHMARK_TYPE(long long, 64),     BENCHMARK_TYPE(long long, 128),
+        BENCHMARK_TYPE(long long, 192),    BENCHMARK_TYPE(long long, 256),
+        BENCHMARK_TYPE(long long, 320),    BENCHMARK_TYPE(long long, 512),
+
+        BENCHMARK_TYPE(custom_int_t, 64),  BENCHMARK_TYPE(custom_int_t, 128),
+        BENCHMARK_TYPE(custom_int_t, 192), BENCHMARK_TYPE(custom_int_t, 256),
+        BENCHMARK_TYPE(custom_int_t, 320), BENCHMARK_TYPE(custom_int_t, 512),
     };
 
     benchmarks.insert(benchmarks.end(), bs.begin(), bs.end());

diff --git a/benchmark/benchmark_block_shuffle.cpp b/benchmark/benchmark_block_shuffle.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -265,7 +265,7 @@ void add_benchmarks_type(const std::string&                            name,
     benchmarks.insert(benchmarks.end(), bs.begin(), bs.end());
 }
 
-#define CREATE_BENCHMARKS(T) add_benchmarks_type<Benchmark, int>(name, benchmarks, stream, size, #T)
+#define CREATE_BENCHMARKS(T) add_benchmarks_type<Benchmark, T>(name, benchmarks, stream, size, #T)
 
 template<class Benchmark>
 void add_benchmarks(const std::string&                            name,

diff --git a/benchmark/benchmark_device_adjacent_difference.cpp b/benchmark/benchmark_device_adjacent_difference.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -106,8 +106,6 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
 {
     using output_type = T;
 
-    static constexpr bool debug_synchronous = false;
-
     // Generate data
     const std::vector<T> input = benchmark_utils::get_random_data<T>(size, 1, 100);
 
@@ -129,7 +127,8 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
     std::size_t temp_storage_size{};
     void*       d_temp_storage = nullptr;
 
-    const auto launch = [&] {
+    const auto launch = [&]
+    {
         return dispatch_adjacent_difference(left_tag,
                                             copy_tag,
                                             d_temp_storage,
@@ -138,8 +137,7 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
                                             d_output,
                                             size,
                                             hipcub::Sum{},
-                                            stream,
-                                            debug_synchronous);
+                                            stream);
     };
     HIP_CHECK(launch());
     HIP_CHECK(hipMalloc(&d_temp_storage, temp_storage_size));