`thrust::all_of` is slower than a naive reduction #720

jrhemstad · 2019-09-17T23:05:04Z

In a thrust::all_of, when the first element that violates the predicate is discovered, the computation can be aborted, i.e., an "early exit".

For example, imagine you are given a thrust::device_vector<int64_t> and want to check if any of the values are negative. You could do this with a thrust::all_of or with a thrust::count_if:

thrust::device_vector<int64_t> values(...);
bool all_positive = thrust::all_of(values.begin(), values.end(), [](auto v){return v > 0;});
bool all_positive = values.size() == thrust::count_if(values.begin(), values.end(), [](auto v){return v > 0;});

count_if must read everything in values, whereas all_of can shortcut if an early exit exists. Therefore, I would expect all_of to out perform count_if when one or more negative values exist. If no negative values are present, then both all_of and count_if must read everything in values and I would expect their performance to be roughly equivalent.

However, this is not the case. I have found that the performance of thrust::all_of is extremely erratic with a 10x difference between the best and worst performance. Furthermore, an all_of is always slower than a naive reduction as in count_if.

Here are the results of performing 100 trials of the example I described above on an input size of 100,000,000 million int64_t elements on a GV100.

No Early Exit

	mean (us)	min (us)	max (us)
`all_of`	75269	56403	104922
`count_if`	3124	1686	4413

Single Early Exit

	mean (us)	min (us)	max (us)
`all_of`	51620	9346	370845
`count_if`	3100	1703	5158

As you can see, whether or not an early exit exists, all_of is always significantly slower than a count_if.

Looking at the profile of all_of (attached), it appears that the reason it is so slow is because a single invocation of all_of results in ~50 invocations of DeviceReduceKernel. I suspect this is because the implementation of all_of does a set of batched reductions in attempt to avoid reading the entire input when an early exit exists. However, launching all of these small kernels (each with their own allocation/free) results in a significant amount of overhead. This overhead is exacerbated by the fact that each batch is executed on the same stream, meaning there is no overlap or concurrency between batches.

I suspect a better implementation would launch a single kernel, where threads occasionally poll an atomic flag to check if an early exit exists, at which point they exit the computation. Or, forgo an attempt at an early exit and just do the naive reduction like in count_if.

nsys_profile.zip

Reproducer code:

// compile with `nvcc --std=c++14 -O3 --expt-extended-lambda thrust_logical.cu -o thrust_logical -lnvToolsExt`
#include <cxxabi.h>
#include <nvToolsExt.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/random.h>
#include <chrono>
#include <limits>

template <typename T>
struct time_result {
  T min{std::numeric_limits<T>::max()};
  T max{std::numeric_limits<T>::lowest()};
  T mean{0};
  T sum{0};
  std::size_t count{0};

  void add_measurement(T new_duration) {
    ++count;
    sum += new_duration;
    mean = sum / count;
    min = std::min(min, new_duration);
    max = std::max(max, new_duration);
  }

  std::string to_string() {
    return std::string{
        "count: " + std::to_string(count) + " mean: " + std::to_string(mean) +
        " min: " + std::to_string(min) + " max: " + std::to_string(max)};
  }
};

template <typename Duration = std::chrono::microseconds, typename F,
          typename... Args>
typename Duration::rep time_it(std::string const& name, F&& fun,
                               Args&&... args) {
  const auto begin = std::chrono::high_resolution_clock::now();
  nvtxRangePushA(name.c_str());
  std::forward<F>(fun)(std::forward<Args>(args)...);
  nvtxRangePop();
  const auto end = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<Duration>(end - begin).count();
}

template <typename Duration = std::chrono::microseconds,
          typename InputGenerator, typename F, typename... Args>
auto time_trial(std::string const& name, std::size_t num_trials, InputGenerator&& generator, F&& f,
                Args&&... args) {
  time_result<typename Duration::rep> result{};
  for (auto i = 0; i < num_trials; ++i) {
    auto input = generator();
    result.add_measurement(time_it<Duration>(name, std::forward<F>(f), input,
                                             std::forward<Args>(args)...));
  }
  return result;
}

struct is_positive {
  template <typename T>
  bool __device__ operator()(T v) {
    return v > 0;
  }
};

int main(void) {
  constexpr std::size_t input_size{100'000'000};
  constexpr std::size_t num_trials{100};

  auto all_of = [](auto const& values) {
    return thrust::all_of(thrust::device, values.begin(), values.end(),
                          is_positive{});
  };

  auto count_if = [](auto const& values) {
    return thrust::count_if(thrust::device, values.begin(), values.end(),
                            is_positive{});
  };

  auto no_early_out = []() {
    nvtxRangePushA("no early out input");
    thrust::device_vector<int64_t> values(input_size, 1);
    cudaDeviceSynchronize();
    nvtxRangePop();
    return values;
  };

  auto early_out = []() {
    nvtxRangePushA("early out input");
    thrust::device_vector<int64_t> values(input_size, 1);
    thrust::default_random_engine engine(
        std::chrono::high_resolution_clock::now().time_since_epoch().count());
    thrust::uniform_int_distribution<std::size_t> distribution{0, input_size};
    auto random_location = distribution(engine);
    values[random_location] = -1;
    cudaDeviceSynchronize();
    nvtxRangePop();
    return values;
  };

  std::cout << "No early out(us):\n";
  std::cout << "all of: "
            << time_trial("all_of", num_trials, no_early_out, all_of).to_string()
            << std::endl;
  std::cout << "count if: "
            << time_trial("count_if", num_trials, no_early_out, count_if).to_string()
            << std::endl;


  std::cout << std::endl << "With early out(us):\n";
  std::cout << "all of: "
            << time_trial("all_of", num_trials, early_out, all_of).to_string()
            << std::endl;
  std::cout << "count if: "
            << time_trial("count_if", num_trials, early_out, count_if).to_string()
            << std::endl;


  return 0;
}

Tasks

[FEA]: Improve performance of thrust::all_of. #2113

The text was updated successfully, but these errors were encountered:

jrhemstad · 2019-09-18T18:11:05Z

I updated the original issue to reflect a correction I made in my benchmark code where the location of the "early out" element changes for each trial. This exposed significantly more variance in the all_of results and shows that it is always slower than a count_if.

karthikeyann · 2019-10-16T18:37:56Z

thrust::all_of and thrust::any_of are implemented using thrust::find_if,
/usr/local/cuda-10.0/targets/x86_64-linux/include/thrust/system/detail/generic/find.inl

 91   // this implementation breaks up the sequence into separate intervals
 92   // in an attempt to early-out as soon as a value is found
 93
 94   // TODO incorporate sizeof(InputType) into interval_threshold and round to multiple of 32
 95   const difference_type interval_threshold = 1 << 20;
 96   const difference_type interval_size = (thrust::min)(interval_threshold, n);

could implementing this // TODO solve the performance issue?
(if this issue does not arise or is less severe for smaller datatype, this might solve the issue)

karthikeyann · 2019-10-16T19:53:24Z

@jrhemstad
Additional details:
thrust::count_if uses transform_reduce, which uses thrust::plus
thrust::find_if uses reduce on a tuple(pred, index) with thrust::min operator on index.

index is not necessary for all_of or any_of. This tuple<bool, size_t> will consume more registers too.

Alternative implementation of thrust::any_of could be using transform_reduce with thrust::maximum or thrust::logical_or operator on pred result.
Similarly for thrust::all_of using transform_reduce with thrust::minimum or thrust::logical_and.
Benchmarked using following extra code.

  auto reduce_and = [](auto const& values) {
    return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
                            is_positive{}, true, thrust::logical_and<bool>{} );
  };

  auto reduce_min = [](auto const& values) {
    return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
                            is_positive{}, false, thrust::minimum<bool>{} );
  };

Runtime are similar for count, logical_and, minimum.

I created new thrust::any_of using transform_reduce with logical_or, and used it for thrust::all_of (along with early exit). This is faster.

No early out(us):
all of: count: 100 mean: 62023 min: 57710 max: 65030
count if: count: 100 mean: 2478 min: 1577 max: 3471
reduce and: count: 100 mean: 2469 min: 1588 max: 3476
reduce min: count: 100 mean: 2494 min: 1566 max: 3477
new all_of: count: 100 mean: 933 min: 352 max: 1942

With early out(us):
all of: count: 100 mean: 33574 min: 4148 max: 114289
count if: count: 100 mean: 2466 min: 1615 max: 2926
reduce and: count: 100 mean: 2536 min: 1577 max: 3792
reduce min: count: 100 mean: 2518 min: 1585 max: 3528
new all_of: count: 100 mean: 918 min: 381 max: 1949

karthikeyann · 2019-10-16T20:00:25Z

new_any_of and new_all_of implementation attached with

Reproducer code:

thrust_logical.cu.zip

jrhemstad · 2019-10-16T20:04:53Z

I don't know what new all_of is, but there must be something wrong with the implementation because these numbers are impossible:

No early out(us):
new all_of: count: 100 mean: 933 min: 352 max: 1942

With early out(us):
new all_of: count: 100 mean: 918 min: 381 max: 1949

If no early out exists, then you need to read all 100,000,000 int64_t elements in the input.

(100,000,000 * 8B) / 352us -> 2.2 TB/s

That's well over the 900GB/s theoretical peak of a V100 GPU.

jrhemstad · 2019-10-16T20:13:36Z

I would expect any reduction based implementation to perform the same (as your results show). Since reduction is bandwidth bound, it doesn't really matter what your binary operator is (sum, or, and, etc.) in the reduction.

Furthermore, your results are fishy because if an early out does not exist, then the new all_of implementation should not be any faster than any of the other reduction based implementations. Since your new all_of is just doing a batched transform_reduce, how could it be faster than just doing a single transform_reduce?

karthikeyann · 2019-10-16T20:49:02Z

You are right. My implementation has a bug.
I fixed it and have the updated benchmarks.

No early out(us):
all of: count: 100 mean: 63981 min: 59478 max: 67889
count if: count: 100 mean: 2515 min: 1588 max: 3482
reduce and: count: 100 mean: 2454 min: 1582 max: 2946
reduce min: count: 100 mean: 2462 min: 1573 max: 3475
new all_of: count: 100 mean: 38071 min: 34610 max: 102533

With early out(us):
all of: count: 100 mean: 36221 min: 6262 max: 64545
count if: count: 100 mean: 2434 min: 1565 max: 2843
reduce and: count: 100 mean: 2448 min: 1624 max: 3183
reduce min: count: 100 mean: 2462 min: 1562 max: 3470
new all_of: count: 100 mean: 24840 min: 934 max: 94396

new_all_of is slower. In fact, max time is worst among all. (early out min is only faster!)

jrhemstad · 2019-10-16T21:32:02Z

@karthikeyann Those results look much more like what I would expect.

While the new all_of can be faster, the fact that on average it is 10x slower confirms in my mind that the extra complexity of trying to take advantage of an early out actually harms performance in the general case.

…11202) The current implementation of `cudf::contains(column_view, scalar)` uses `thrust::find` and `thrust::any_of` (which also calls `thrust::find_if` under the hood). These thrust APIs were known to have performance regression (https://github.com/NVIDIA/thrust/issues/1016). This PR replaces `thrust::find` and `thrust::any_of` in `cudf::contains` by `thrust::count_if`, which improves performance significantly. Benchmarks show that the run time can be reduced as much as 80% after modification, or up to 5X speedup. Closes #3806. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Bradley Dice (https://github.com/bdice) URL: #11202

Related to: NVIDIA#720

jrhemstad changed the title ~~thrust::all_of is very slow without an early exit~~ thrust::all_of is slower than a naive reduction Sep 18, 2019

jrhemstad mentioned this issue Oct 14, 2019

[FEA] Add libcudf any() and all() reductions rapidsai/cudf#1874

Closed

This was referenced Nov 12, 2019

[REVIEW] Define and implement new stream compaction APIs copy_if, drop_nulls, apply_boolean_mask, drop_duplicate and unique_count. rapidsai/cudf#3303

Merged

[REVIEW] Define and implement new search APIs rapidsai/cudf#3229

Merged

jrhemstad mentioned this issue Nov 26, 2019

[REVIEW] Port scatter to libcudf++ rapidsai/cudf#3354

Merged

1 task

galipremsagar mentioned this issue May 8, 2020

[FEA] Add a new parameter for all_integer & all_float to evaluate null values as True rapidsai/cudf#5136

Closed

mythrocks mentioned this issue Feb 8, 2021

Refactor dictionary support for reductions any/all rapidsai/cudf#7242

Merged

ChuckHastings mentioned this issue Mar 25, 2021

[REVIEW] Implement C/CUDA RandomWalks functionality rapidsai/cugraph#1439

Merged

jrhemstad mentioned this issue Mar 29, 2021

Fix NaN handling in drop_list_duplicates rapidsai/cudf#7662

Merged

ttnghia mentioned this issue Apr 13, 2021

Refactor libcudf strings concatenate to use make_strings_children rapidsai/cudf#7937

Merged

ChuckHastings mentioned this issue Apr 28, 2021

[ENH] Added Random Walks COO convertor and profiling rapidsai/cugraph#1531

Merged

harrism mentioned this issue Apr 28, 2021

Implement concatenate_rows for list type rapidsai/cudf#8049

Merged

harrism mentioned this issue Sep 20, 2021

Add struct type support for drop_list_duplicates rapidsai/cudf#9202

Merged

jrhemstad mentioned this issue Sep 29, 2021

[FEA] Reduce building time NVIDIA/cuCollections#93

Closed

alliepiper mentioned this issue Jan 12, 2022

Python Comparison Scripts NVIDIA/nvbench#9

Open

GregoryKimball mentioned this issue Jun 30, 2022

[FEA] Performance - search contains using slow approach rapidsai/cudf#3806

Closed

ttnghia mentioned this issue Jul 6, 2022

Improve performance for cudf::contains when searching for a scalar rapidsai/cudf#11202

Merged

mythrocks mentioned this issue Aug 10, 2022

Sanitize percentile_approx() output for empty input rapidsai/cudf#11498

Merged

vyasr mentioned this issue Oct 24, 2022

Performance improvement in JSON Tree traversal rapidsai/cudf#11919

Merged

3 tasks

ttnghia mentioned this issue Dec 14, 2022

Add code to detect possible malformed page data in parquet files. rapidsai/cudf#12360

Merged

mythrocks mentioned this issue Feb 10, 2023

Throw an exception if an unsupported page encoding is detected in Parquet reader rapidsai/cudf#12754

Merged

3 tasks

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

seunghwak mentioned this issue Apr 17, 2023

Update select_random_vertices to sample from a given distributed set or from (0, V] rapidsai/cugraph#3455

Merged

github-project-automation bot added this to CCCL Nov 8, 2023

github-project-automation bot moved this to Todo in CCCL Nov 8, 2023

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

alliepiper mentioned this issue Mar 2, 2024

Speedup of any_of and find_if operations #712

Closed

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jun 14, 2024

Implement thrust::all_of/any_of via thrust::count_if

f68721c

Related to: NVIDIA#720

bernhardmgruber mentioned this issue Jun 14, 2024

Improve thrust::equals #1870

Closed

gevtushenko assigned gonidelis Jun 17, 2024

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jun 24, 2024

Implement thrust::all_of/any_of via thrust::count_if

9b6f790

Related to: NVIDIA#720

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jun 24, 2024

Implement thrust::all_of/any_of via thrust::count_if

1d3085a

Related to: NVIDIA#720

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jul 5, 2024

Implement thrust::all_of/any_of via thrust::count_if

abdb32e

Related to: NVIDIA#720

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jul 5, 2024

Implement thrust::all_of/any_of via thrust::count_if

974b0bb

Related to: NVIDIA#720

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jul 8, 2024

Implement thrust::all_of/any_of via thrust::count_if

1daea9e

Related to: NVIDIA#720

wmaxey moved this from Todo to In Progress in CCCL Jul 17, 2024

gonidelis mentioned this issue Jul 30, 2024

[FEA]: Improve performance of thrust::all_of. #2113

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`thrust::all_of` is slower than a naive reduction #720

`thrust::all_of` is slower than a naive reduction #720

jrhemstad commented Sep 17, 2019 •

edited

Loading

jrhemstad commented Sep 18, 2019

karthikeyann commented Oct 16, 2019 •

edited

Loading

karthikeyann commented Oct 16, 2019

karthikeyann commented Oct 16, 2019

jrhemstad commented Oct 16, 2019

jrhemstad commented Oct 16, 2019

karthikeyann commented Oct 16, 2019

jrhemstad commented Oct 16, 2019

thrust::all_of is slower than a naive reduction #720

thrust::all_of is slower than a naive reduction #720

Comments

jrhemstad commented Sep 17, 2019 • edited Loading

No Early Exit

Single Early Exit

Tasks

jrhemstad commented Sep 18, 2019

karthikeyann commented Oct 16, 2019 • edited Loading

karthikeyann commented Oct 16, 2019

karthikeyann commented Oct 16, 2019

jrhemstad commented Oct 16, 2019

jrhemstad commented Oct 16, 2019

karthikeyann commented Oct 16, 2019

jrhemstad commented Oct 16, 2019

`thrust::all_of` is slower than a naive reduction #720

`thrust::all_of` is slower than a naive reduction #720

jrhemstad commented Sep 17, 2019 •

edited

Loading

karthikeyann commented Oct 16, 2019 •

edited

Loading