Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sort benchmark #278

Closed
wants to merge 23 commits into from
Closed

Add sort benchmark #278

wants to merge 23 commits into from

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Apr 21, 2020

No description provided.

@aprokop aprokop added testing Anything to do with tests and CI performance Something is slower than it should be labels Apr 21, 2020
@aprokop aprokop force-pushed the sort_benchmark branch 2 times, most recently from 441f39f to cca7303 Compare April 21, 2020 19:09
@aprokop aprokop marked this pull request as ready for review April 21, 2020 19:09
@aprokop
Copy link
Contributor Author

aprokop commented Apr 21, 2020

So far, I cannot reproduce results from here, at least on my workstation.

@masterleinad
Copy link
Collaborator

Do you see anything similar to my observations in #222?

@aprokop
Copy link
Contributor Author

aprokop commented Apr 22, 2020

I think yes, for smaller sizes Serial outperforms any Cuda, for both sort and permutations. Though that was on random data.

@masterleinad
Copy link
Collaborator

Looks like there is some guard missing:

In file included from /var/jenkins/workspace/ArborX_PR-278/benchmarks/sort/sort_benchmark.cpp:21:
/var/jenkins/workspace/ArborX_PR-278/benchmarks/sort/pss_parallel_stable_sort.hpp:108:12: error: use of undeclared identifier 'omp_get_max_threads'
  auto t = omp_get_max_threads();
           ^
1 error generated when compiling for sm_70.


Kokkos::BinSort<ViewType, CompType, typename ViewType::device_type,
SizeType>
bin_sort(view, CompType(n / 2, result.min_val, result.max_val), true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me why we set the maximum number of bins to n/2 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall, maybe @dalg24 remembers. Maybe this is one of the things we need to look at.

int const n = in.extent(0);
Kokkos::parallel_for(
Kokkos::RangePolicy<ExecutionSpace>(0, n),
KOKKOS_LAMBDA(int const i) { out(permute(i)) = in(i); });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be interesting to see if the read access is non-consecutive and write is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking the same thing, maybe introduce reverse permutation option.

benchmarks/sort/sort_benchmark.cpp Outdated Show resolved Hide resolved
benchmarks/sort/sort_benchmark.cpp Outdated Show resolved Hide resolved
benchmarks/sort/sort_benchmark.cpp Outdated Show resolved Hide resolved
benchmarks/sort/sort_benchmark.cpp Outdated Show resolved Hide resolved
@aprokop
Copy link
Contributor Author

aprokop commented Apr 26, 2020

I have a new branch sort_benchmark_update in my repo that allows to compare some Cuda Kokkos kernels with running those copied to host and executed there. However, I'm not sure what I did during refactoring, but some timings in the new branch (even Kokkos Serial) are significantly slower than coming from this branch. Would appreciate some help.

@masterleinad
Copy link
Collaborator

I have a new branch sort_benchmark_update in my repo that allows to compare some Cuda Kokkos kernels with running those copied to host and executed there. However, I'm not sure what I did during refactoring, but some timings in the new branch (even Kokkos Serial) are significantly slower than coming from this branch. Would appreciate some help.

So you obtained some timings already? Did you save your numbers somewhere?

@masterleinad
Copy link
Collaborator

That's some of the numbers I am seeing

sort_10:apply_permutation<Kokkos_Serial>/10/manual_time                          1 us          2 us     560675
sort_10:apply_permutation<Kokkos_OpenMP>/10/manual_time                          9 us         10 us      76886
sort_10:apply_permutation<Kokkos_Cuda>/10/manual_time                           11 us         12 us      62390
sort_1e2:apply_permutation<Kokkos_Serial>/100/manual_time                          2 us          2 us     466290
sort_1e2:apply_permutation<Kokkos_OpenMP>/100/manual_time                         10 us         11 us      68370
sort_1e2:apply_permutation<Kokkos_Cuda>/100/manual_time                           11 us         12 us      63734
sort_1e3:apply_permutation<Kokkos_Serial>/1000/manual_time                          4 us          4 us     181814
sort_1e3:apply_permutation<Kokkos_OpenMP>/1000/manual_time                         17 us         17 us      38531
sort_1e3:apply_permutation<Kokkos_Cuda>/1000/manual_time                           10 us         11 us      69827
sort_1e4:apply_permutation<Kokkos_Serial>/10000/manual_time                         30 us         31 us      23890
sort_1e4:apply_permutation<Kokkos_OpenMP>/10000/manual_time                        102 us        103 us       6786
sort_1e4:apply_permutation<Kokkos_Cuda>/10000/manual_time                           10 us         10 us      72407
sort_2e3:apply_permutation<Kokkos_Serial>/2000/manual_time                          7 us          7 us     105497
sort_2e3:apply_permutation<Kokkos_OpenMP>/2000/manual_time                         37 us         37 us      19499
sort_2e3:apply_permutation<Kokkos_Cuda>/2000/manual_time                           11 us         12 us      62461
sort_3e3:apply_permutation<Kokkos_Serial>/3000/manual_time                          9 us         10 us      71311
sort_3e3:apply_permutation<Kokkos_OpenMP>/3000/manual_time                         35 us         35 us      20418
sort_3e3:apply_permutation<Kokkos_Cuda>/3000/manual_time                           10 us         11 us      62284
sort_4e3:apply_permutation<Kokkos_Serial>/4000/manual_time                         12 us         13 us      60270
sort_4e3:apply_permutation<Kokkos_OpenMP>/4000/manual_time                         58 us         59 us      11607
sort_4e3:apply_permutation<Kokkos_Cuda>/4000/manual_time                           11 us         12 us      63973
sort_5e3:apply_permutation<Kokkos_Serial>/5000/manual_time                         14 us         15 us      47432
sort_5e3:apply_permutation<Kokkos_OpenMP>/5000/manual_time                         64 us         65 us       8979
sort_5e3:apply_permutation<Kokkos_Cuda>/5000/manual_time                           10 us         11 us      69617
sort_6e3:apply_permutation<Kokkos_Serial>/6000/manual_time                         17 us         18 us      42626
sort_6e3:apply_permutation<Kokkos_OpenMP>/6000/manual_time                         97 us         97 us       7122
sort_6e3:apply_permutation<Kokkos_Cuda>/6000/manual_time                           11 us         12 us      62130
sort_7e3:apply_permutation<Kokkos_Serial>/7000/manual_time                         20 us         21 us      34622
sort_7e3:apply_permutation<Kokkos_OpenMP>/7000/manual_time                        127 us        128 us       5607
sort_7e3:apply_permutation<Kokkos_Cuda>/7000/manual_time                           10 us         10 us      70036
sort_8e3:apply_permutation<Kokkos_Serial>/8000/manual_time                         24 us         24 us      31461
sort_8e3:apply_permutation<Kokkos_OpenMP>/8000/manual_time                        114 us        115 us       6493
sort_8e3:apply_permutation<Kokkos_Cuda>/8000/manual_time                           10 us         11 us      68817
sort_9e3:apply_permutation<Kokkos_Serial>/9000/manual_time                         25 us         26 us      25509
sort_9e3:apply_permutation<Kokkos_OpenMP>/9000/manual_time                         94 us         94 us       7402
sort_9e3:apply_permutation<Kokkos_Cuda>/9000/manual_time                           10 us         11 us      66405

masterleinad and others added 2 commits April 30, 2020 12:08
This avoids doing extra work (e.g., creating extra views) for regular
use case (memory is accessible from execution space) which affects
timings for small sizes.
benchmarks/sort/sort_benchmark.cpp Outdated Show resolved Hide resolved
int const n = view.extent(0);

auto view_mirror =
Kokkos::create_mirror_view_and_copy(execution_space{}, view);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use instances, it would make sense to at least reuse it if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean.

int const n = view.extent(0);

auto begin_ptr = thrust::device_ptr<ValueType>(view.data());
auto end_ptr = thrust::device_ptr<ValueType>(view.data() + n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thrust::device_ptr<value_type> begin(Kokkos::View<value_type *, memory_space> const &v) { return {v.data()}; }
thrust::device_ptr<value_type> end(Kokkos::View<value_type *, memory_space> const &v) { return {v.data() + v.size()}; }

@masterleinad
Copy link
Collaborator

Some results on Summit:

sort<Kokkos_Serial>/10/manual_time_median            12.3 us         13.0 us           10
sort<StdSort_Serial>/10/manual_time_median          0.600 us         1.15 us           10
sort<Kokkos_OpenMP>/10/manual_time_median             274 us          285 us           10
sort<PSS_OpenMP>/10/manual_time_median               40.2 us         51.1 us           10
sort<GnuParallel_OpenMP>/10/manual_time_median      0.898 us         1.45 us           10

sort<Kokkos_Serial>/100/manual_time_median            30.9 us         31.6 us           10
sort<StdSort_Serial>/100/manual_time_median           12.4 us         12.9 us           10
sort<Kokkos_OpenMP>/100/manual_time_median             265 us          277 us           10
sort<PSS_OpenMP>/100/manual_time_median               45.0 us         57.1 us           10
sort<GnuParallel_OpenMP>/100/manual_time_median       17.8 us         18.4 us           10

sort<Kokkos_Serial>/900/manual_time_median             205 us          206 us           10
sort<StdSort_Serial>/900/manual_time_median            191 us          192 us           10
sort<Kokkos_OpenMP>/900/manual_time_median             416 us          430 us           10
sort<PSS_OpenMP>/900/manual_time_median                136 us          150 us           10
sort<GnuParallel_OpenMP>/900/manual_time_median        237 us          238 us           10

sort<Kokkos_Serial>/1000/manual_time_median             233 us          234 us           10
sort<StdSort_Serial>/1000/manual_time_median            213 us          213 us           10
sort<Kokkos_OpenMP>/1000/manual_time_median             473 us          487 us           10
sort<PSS_OpenMP>/1000/manual_time_median                153 us          167 us           10
sort<GnuParallel_OpenMP>/1000/manual_time_median        316 us          330 us           10

sort<Kokkos_Serial>/1100/manual_time_median             257 us          258 us           10
sort<StdSort_Serial>/1100/manual_time_median            241 us          242 us           10
sort<Kokkos_OpenMP>/1100/manual_time_median             434 us          448 us           10
sort<PSS_OpenMP>/1100/manual_time_median                164 us          178 us           10
sort<GnuParallel_OpenMP>/1100/manual_time_median        381 us          396 us           10

sort<Kokkos_Serial>/1200/manual_time_median             276 us          277 us           10
sort<StdSort_Serial>/1200/manual_time_median            268 us          269 us           10
sort<Kokkos_OpenMP>/1200/manual_time_median             477 us          491 us           10
sort<PSS_OpenMP>/1200/manual_time_median                177 us          192 us           10
sort<GnuParallel_OpenMP>/1200/manual_time_median        336 us          351 us           10

sort<Kokkos_Serial>/1500/manual_time_median             349 us          350 us           10
sort<StdSort_Serial>/1500/manual_time_median            346 us          347 us           10
sort<Kokkos_OpenMP>/1500/manual_time_median             550 us          565 us           10
sort<PSS_OpenMP>/1500/manual_time_median                216 us          231 us           10
sort<GnuParallel_OpenMP>/1500/manual_time_median        417 us          433 us           10

sort<Kokkos_Serial>/2000/manual_time_median             460 us          461 us           10
sort<StdSort_Serial>/2000/manual_time_median            475 us          476 us           10
sort<Kokkos_OpenMP>/2000/manual_time_median             645 us          660 us           10
sort<PSS_OpenMP>/2000/manual_time_median                293 us          309 us           10
sort<GnuParallel_OpenMP>/2000/manual_time_median        436 us          452 us           10

sort<Kokkos_Serial>/4000/manual_time_median             932 us          933 us           10
sort<StdSort_Serial>/4000/manual_time_median           1040 us         1042 us           10
sort<Kokkos_OpenMP>/4000/manual_time_median             823 us          844 us           10
sort<PSS_OpenMP>/4000/manual_time_median                588 us          609 us           10
sort<GnuParallel_OpenMP>/4000/manual_time_median       1014 us         1035 us           10

sort<Kokkos_Serial>/8000/manual_time_median            1937 us         1939 us           10
sort<StdSort_Serial>/8000/manual_time_median           2225 us         2227 us           10
sort<Kokkos_OpenMP>/8000/manual_time_median            1393 us         1422 us           10
sort<PSS_OpenMP>/8000/manual_time_median               1217 us         1245 us           10
sort<GnuParallel_OpenMP>/8000/manual_time_median       1607 us         1636 us           10

sort<Kokkos_Serial>/16000/manual_time_median            3850 us         3852 us           10
sort<StdSort_Serial>/16000/manual_time_median           4791 us         4793 us           10
sort<Kokkos_OpenMP>/16000/manual_time_median            2687 us         2732 us           10
sort<PSS_OpenMP>/16000/manual_time_median               2567 us         2614 us           10
sort<GnuParallel_OpenMP>/16000/manual_time_median       2938 us         2985 us           10

sort<Kokkos_Serial>/32000/manual_time_median            9974 us        10100 us           10
sort<StdSort_Serial>/32000/manual_time_median          14778 us        14906 us           10
sort<Kokkos_OpenMP>/32000/manual_time_median            5469 us         5604 us           10
sort<PSS_OpenMP>/32000/manual_time_median               5390 us         5524 us           10
sort<GnuParallel_OpenMP>/32000/manual_time_median       6365 us         6498 us           10

sort<Kokkos_Serial>/64000/manual_time_median           19870 us        20085 us           10
sort<StdSort_Serial>/64000/manual_time_median          29788 us        29997 us           10
sort<Kokkos_OpenMP>/64000/manual_time_median            9440 us         9653 us           10
sort<PSS_OpenMP>/64000/manual_time_median              11397 us        11624 us           10
sort<GnuParallel_OpenMP>/64000/manual_time_median      13021 us        13256 us           10

sort<Kokkos_Serial>/128000/manual_time_median           34259 us        34662 us           10
sort<StdSort_Serial>/128000/manual_time_median          49748 us        50154 us           10
sort<Kokkos_OpenMP>/128000/manual_time_median           18938 us        19398 us           10
sort<PSS_OpenMP>/128000/manual_time_median              24335 us        24241 us           10
sort<GnuParallel_OpenMP>/128000/manual_time_median      29428 us        29906 us           10

So StdSort is faster up to 1500 values. Except for really small values PSS_OpenMP is the fastest OpenMP variant up to around 32000 values after that Kokkos_OpenMP, i.e. BinSort, wins.
On my notebook, I am seeing the first difference at 1000 values and GnuParallel_OpenMP is always the faster OpenMP variant and fastest from around 1000 values on.

@aprokop
Copy link
Contributor Author

aprokop commented May 2, 2020

Results from the previous comment for easier reading (just Time column):

Sort 10 100 900 1000 1100 1200 1500 2000 4000 8000 16000 32000 64000 128000
Kokkos_Serial 12 31 205 233 257 276 349 460 932 1937 3850 9974 19870 34259
StdSort_Serial 1 12 191 213 241 268 346 475 1040 2225 4791 14778 29788 49748
Kokkos_OpenMP 274 265 416 473 434 477 550 645 823 1393 2687 5469 9440 18938
PSS_OpenMP 40 45 136 153 164 177 216 293 588 1217 2567 5390 11397 24335
GnuParallel_OpenMP 1 18 237 316 381 336 417 436 1014 1607 2938 6365 13021 29428

@masterleinad
Copy link
Collaborator

Results on CADES Condo

sort<Kokkos_Serial>/10/manual_time_median            5.65 us         5.79 us           10
sort<StdSort_Serial>/10/manual_time_median          0.136 us        0.250 us           10
sort<Kokkos_OpenMP>/10/manual_time_median             102 us          103 us           10
sort<PSS_OpenMP>/10/manual_time_median                200 us          200 us           10
sort<GnuParallel_OpenMP>/10/manual_time_median      0.125 us        0.225 us           10

sort<Kokkos_Serial>/100/manual_time_median            8.91 us         9.04 us           10
sort<StdSort_Serial>/100/manual_time_median          0.862 us        0.951 us           10
sort<Kokkos_OpenMP>/100/manual_time_median             162 us          161 us           10
sort<PSS_OpenMP>/100/manual_time_median               2208 us         2207 us           10
sort<GnuParallel_OpenMP>/100/manual_time_median      0.754 us        0.832 us           10

sort<Kokkos_Serial>/900/manual_time_median            47.5 us         47.8 us           10
sort<StdSort_Serial>/900/manual_time_median           22.6 us         22.9 us           10
sort<Kokkos_OpenMP>/900/manual_time_median             184 us          185 us           10
sort<PSS_OpenMP>/900/manual_time_median               2156 us         2154 us           10
sort<GnuParallel_OpenMP>/900/manual_time_median       19.4 us         19.6 us           10

sort<Kokkos_Serial>/1000/manual_time_median            30.7 us         30.9 us           10
sort<StdSort_Serial>/1000/manual_time_median           42.1 us         42.3 us           10
sort<Kokkos_OpenMP>/1000/manual_time_median             189 us          188 us           10
sort<PSS_OpenMP>/1000/manual_time_median               2332 us         2331 us           10
sort<GnuParallel_OpenMP>/1000/manual_time_median       88.4 us         89.3 us           10

sort<Kokkos_Serial>/1100/manual_time_median            36.2 us         36.4 us           10
sort<StdSort_Serial>/1100/manual_time_median           47.3 us         47.7 us           10
sort<Kokkos_OpenMP>/1100/manual_time_median             194 us          181 us           10
sort<PSS_OpenMP>/1100/manual_time_median               1640 us         1637 us           10
sort<GnuParallel_OpenMP>/1100/manual_time_median       57.0 us         57.0 us           10

sort<Kokkos_Serial>/1200/manual_time_median            34.9 us         35.1 us           10
sort<StdSort_Serial>/1200/manual_time_median           39.7 us         40.0 us           10
sort<Kokkos_OpenMP>/1200/manual_time_median             244 us          246 us           10
sort<PSS_OpenMP>/1200/manual_time_median               2271 us         2271 us           10
sort<GnuParallel_OpenMP>/1200/manual_time_median       90.4 us         91.4 us           10

sort<Kokkos_Serial>/1500/manual_time_median            44.1 us         44.1 us           10
sort<StdSort_Serial>/1500/manual_time_median           83.3 us         83.9 us           10
sort<Kokkos_OpenMP>/1500/manual_time_median             203 us          202 us           10
sort<PSS_OpenMP>/1500/manual_time_median               2361 us         2362 us           10
sort<GnuParallel_OpenMP>/1500/manual_time_median       75.2 us         75.6 us           10

sort<Kokkos_Serial>/2000/manual_time_median            53.8 us         54.2 us           10
sort<StdSort_Serial>/2000/manual_time_median           93.3 us         93.6 us           10
sort<Kokkos_OpenMP>/2000/manual_time_median             247 us          248 us           10
sort<PSS_OpenMP>/2000/manual_time_median               2178 us         2172 us           10
sort<GnuParallel_OpenMP>/2000/manual_time_median       81.7 us         82.6 us           10

sort<Kokkos_Serial>/4000/manual_time_median             141 us          142 us           10
sort<StdSort_Serial>/4000/manual_time_median            327 us          328 us           10
sort<Kokkos_OpenMP>/4000/manual_time_median          216663 us        49479 us           10
sort<PSS_OpenMP>/4000/manual_time_median              26278 us         9786 us           10
sort<GnuParallel_OpenMP>/4000/manual_time_median      74421 us        23616 us           10

sort<Kokkos_Serial>/8000/manual_time_median             536 us          539 us           10
sort<StdSort_Serial>/8000/manual_time_median            879 us          882 us           10
sort<Kokkos_OpenMP>/8000/manual_time_median          229902 us        75167 us           10
sort<PSS_OpenMP>/8000/manual_time_median              27045 us         9699 us           10
sort<GnuParallel_OpenMP>/8000/manual_time_median      70039 us        23930 us           10

sort<Kokkos_Serial>/16000/manual_time_median            1052 us         1053 us           10
sort<StdSort_Serial>/16000/manual_time_median           2059 us         2064 us           10
sort<Kokkos_OpenMP>/16000/manual_time_median          229119 us        70178 us           10
sort<PSS_OpenMP>/16000/manual_time_median              20491 us         8086 us           10
sort<GnuParallel_OpenMP>/16000/manual_time_median        317 us          327 us           10

sort<Kokkos_Serial>/32000/manual_time_median            1705 us         1740 us           10
sort<StdSort_Serial>/32000/manual_time_median           4174 us         9231 us           10
sort<Kokkos_OpenMP>/32000/manual_time_median            1158 us         1365 us           10
sort<PSS_OpenMP>/32000/manual_time_median               2112 us         2135 us           10
sort<GnuParallel_OpenMP>/32000/manual_time_median        251 us          265 us           10

sort<Kokkos_Serial>/64000/manual_time_median            7874 us        10454 us           10
sort<StdSort_Serial>/64000/manual_time_median          11196 us        13511 us           10
sort<Kokkos_OpenMP>/64000/manual_time_median             848 us          869 us           10
sort<PSS_OpenMP>/64000/manual_time_median               1561 us         1580 us           10
sort<GnuParallel_OpenMP>/64000/manual_time_median        375 us          389 us           10

sort<Kokkos_Serial>/128000/manual_time_median           17848 us        19064 us           10
sort<StdSort_Serial>/128000/manual_time_median          22936 us        24243 us           10
sort<Kokkos_OpenMP>/128000/manual_time_median            1831 us         1848 us           10
sort<PSS_OpenMP>/128000/manual_time_median               1838 us         1863 us           10
sort<GnuParallel_OpenMP>/128000/manual_time_median        692 us          718 us           10

I am not quite sure what happened for 4000, 8000 and 16000. Possibly some other process running despite requesting one full node.

@masterleinad
Copy link
Collaborator

These are the Summit CUDA results:

apply_permutation<Kokkos_Cuda>/10/manual_time_median                               9.09 us         9.14 us           10
apply_permutation<Kokkos_Cuda_Host>/10/manual_time_median                           410 us          405 us           10
sort_and_compute_permutation<Kokkos_Cuda>/10/manual_time_median                     271 us          300 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/10/manual_time_median               58.6 us         77.1 us           10
sort_and_compute_permutation<Thrust_Cuda>/10/manual_time_median                     108 us          136 us           10
sort<Kokkos_Cuda>/10/manual_time_median                                             280 us          300 us           10
sort<Kokkos_Cuda_Host>/10/manual_time_median                                       58.9 us         77.3 us           10
sort<Thrust_Cuda>/10/manual_time_median                                            46.6 us         66.1 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/10/manual_time_median               284 us          315 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/10/manual_time_median          520 us          549 us           10
apply_permutation<Kokkos_Cuda_Serial>/10/manual_time_median                        74.8 us         74.8 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/10/manual_time_median             58.6 us         77.3 us           10
sort<Kokkos_Cuda_Serial>/10/manual_time_median                                     58.7 us         77.2 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/10/manual_time_median        134 us          153 us           10

apply_permutation<Kokkos_Cuda>/100/manual_time_median                               9.06 us         9.10 us           10
apply_permutation<Kokkos_Cuda_Host>/100/manual_time_median                           409 us          408 us           10
sort_and_compute_permutation<Kokkos_Cuda>/100/manual_time_median                     274 us          302 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/100/manual_time_median               60.1 us         78.8 us           10
sort_and_compute_permutation<Thrust_Cuda>/100/manual_time_median                     108 us          136 us           10
sort<Kokkos_Cuda>/100/manual_time_median                                             285 us          305 us           10
sort<Kokkos_Cuda_Host>/100/manual_time_median                                       61.1 us         79.8 us           10
sort<Thrust_Cuda>/100/manual_time_median                                            46.3 us         65.7 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/100/manual_time_median               287 us          318 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/100/manual_time_median          530 us          560 us           10
apply_permutation<Kokkos_Cuda_Serial>/100/manual_time_median                        75.9 us         76.0 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/100/manual_time_median             60.8 us         79.7 us           10
sort<Kokkos_Cuda_Serial>/100/manual_time_median                                     61.0 us         79.7 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/100/manual_time_median        137 us          156 us           10

apply_permutation<Kokkos_Cuda>/900/manual_time_median                               9.33 us         9.37 us           10
apply_permutation<Kokkos_Cuda_Host>/900/manual_time_median                           414 us          410 us           10
sort_and_compute_permutation<Kokkos_Cuda>/900/manual_time_median                     283 us          312 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/900/manual_time_median                102 us          122 us           10
sort_and_compute_permutation<Thrust_Cuda>/900/manual_time_median                     109 us          137 us           10
sort<Kokkos_Cuda>/900/manual_time_median                                             289 us          309 us           10
sort<Kokkos_Cuda_Host>/900/manual_time_median                                        102 us          121 us           10
sort<Thrust_Cuda>/900/manual_time_median                                            46.5 us         65.7 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/900/manual_time_median               287 us          318 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/900/manual_time_median          644 us          654 us           10
apply_permutation<Kokkos_Cuda_Serial>/900/manual_time_median                        80.5 us         80.5 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/900/manual_time_median              101 us          120 us           10
sort<Kokkos_Cuda_Serial>/900/manual_time_median                                      102 us          120 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/900/manual_time_median        183 us          203 us           10

apply_permutation<Kokkos_Cuda>/1000/manual_time_median                               9.06 us         9.10 us           10
apply_permutation<Kokkos_Cuda_Host>/1000/manual_time_median                           409 us          408 us           10
sort_and_compute_permutation<Kokkos_Cuda>/1000/manual_time_median                     278 us          306 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/1000/manual_time_median                108 us          128 us           10
sort_and_compute_permutation<Thrust_Cuda>/1000/manual_time_median                     109 us          137 us           10
sort<Kokkos_Cuda>/1000/manual_time_median                                             287 us          307 us           10
sort<Kokkos_Cuda_Host>/1000/manual_time_median                                        109 us          128 us           10
sort<Thrust_Cuda>/1000/manual_time_median                                            46.7 us         66.2 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/1000/manual_time_median               289 us          320 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/1000/manual_time_median          631 us          663 us           10
apply_permutation<Kokkos_Cuda_Serial>/1000/manual_time_median                        81.4 us         81.5 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/1000/manual_time_median              108 us          128 us           10
sort<Kokkos_Cuda_Serial>/1000/manual_time_median                                      109 us          128 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/1000/manual_time_median        192 us          211 us           10

apply_permutation<Kokkos_Cuda>/1100/manual_time_median                               9.25 us         9.29 us           10
apply_permutation<Kokkos_Cuda_Host>/1100/manual_time_median                           410 us          409 us           10
sort_and_compute_permutation<Kokkos_Cuda>/1100/manual_time_median                     279 us          307 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/1100/manual_time_median                116 us          136 us           10
sort_and_compute_permutation<Thrust_Cuda>/1100/manual_time_median                     110 us          138 us           10
sort<Kokkos_Cuda>/1100/manual_time_median                                             288 us          308 us           10
sort<Kokkos_Cuda_Host>/1100/manual_time_median                                        117 us          136 us           10
sort<Thrust_Cuda>/1100/manual_time_median                                            47.0 us         66.4 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/1100/manual_time_median               290 us          322 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/1100/manual_time_median          647 us          679 us           10
apply_permutation<Kokkos_Cuda_Serial>/1100/manual_time_median                        82.1 us         82.1 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/1100/manual_time_median              116 us          136 us           10
sort<Kokkos_Cuda_Serial>/1100/manual_time_median                                      117 us          136 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/1100/manual_time_median        200 us          220 us           10

apply_permutation<Kokkos_Cuda>/1200/manual_time_median                               9.29 us         9.33 us           10
apply_permutation<Kokkos_Cuda_Host>/1200/manual_time_median                           410 us          408 us           10
sort_and_compute_permutation<Kokkos_Cuda>/1200/manual_time_median                     276 us          305 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/1200/manual_time_median                123 us          142 us           10
sort_and_compute_permutation<Thrust_Cuda>/1200/manual_time_median                     110 us          138 us           10
sort<Kokkos_Cuda>/1200/manual_time_median                                             284 us          304 us           10
sort<Kokkos_Cuda_Host>/1200/manual_time_median                                        123 us          142 us           10
sort<Thrust_Cuda>/1200/manual_time_median                                            47.5 us         66.9 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/1200/manual_time_median               287 us          319 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/1200/manual_time_median          660 us          686 us           10
apply_permutation<Kokkos_Cuda_Serial>/1200/manual_time_median                        82.2 us         82.3 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/1200/manual_time_median              122 us          142 us           10
sort<Kokkos_Cuda_Serial>/1200/manual_time_median                                      123 us          142 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/1200/manual_time_median        207 us          226 us           10

apply_permutation<Kokkos_Cuda>/1500/manual_time_median                               9.26 us         9.30 us           10
apply_permutation<Kokkos_Cuda_Host>/1500/manual_time_median                           410 us          409 us           10
sort_and_compute_permutation<Kokkos_Cuda>/1500/manual_time_median                     278 us          307 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/1500/manual_time_median                144 us          163 us           10
sort_and_compute_permutation<Thrust_Cuda>/1500/manual_time_median                     110 us          138 us           10
sort<Kokkos_Cuda>/1500/manual_time_median                                             287 us          307 us           10
sort<Kokkos_Cuda_Host>/1500/manual_time_median                                        145 us          164 us           10
sort<Thrust_Cuda>/1500/manual_time_median                                            47.5 us         67.0 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/1500/manual_time_median               290 us          322 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/1500/manual_time_median          703 us          731 us           10
apply_permutation<Kokkos_Cuda_Serial>/1500/manual_time_median                        84.3 us         84.3 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/1500/manual_time_median              144 us          164 us           10
sort<Kokkos_Cuda_Serial>/1500/manual_time_median                                      145 us          164 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/1500/manual_time_median        230 us          250 us           10

apply_permutation<Kokkos_Cuda>/2000/manual_time_median                               9.41 us         9.45 us           10
apply_permutation<Kokkos_Cuda_Host>/2000/manual_time_median                           414 us          412 us           10
sort_and_compute_permutation<Kokkos_Cuda>/2000/manual_time_median                     285 us          313 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/2000/manual_time_median                183 us          203 us           10
sort_and_compute_permutation<Thrust_Cuda>/2000/manual_time_median                     113 us          141 us           10
sort<Kokkos_Cuda>/2000/manual_time_median                                             294 us          315 us           10
sort<Kokkos_Cuda_Host>/2000/manual_time_median                                        184 us          203 us           10
sort<Thrust_Cuda>/2000/manual_time_median                                            48.1 us         67.6 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/2000/manual_time_median               295 us          326 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/2000/manual_time_median          796 us          820 us           10
apply_permutation<Kokkos_Cuda_Serial>/2000/manual_time_median                        87.9 us         88.0 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/2000/manual_time_median              183 us          203 us           10
sort<Kokkos_Cuda_Serial>/2000/manual_time_median                                      184 us          203 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/2000/manual_time_median        273 us          293 us           10

apply_permutation<Kokkos_Cuda>/4000/manual_time_median                               9.30 us         9.34 us           10
apply_permutation<Kokkos_Cuda_Host>/4000/manual_time_median                           424 us          420 us           10
sort_and_compute_permutation<Kokkos_Cuda>/4000/manual_time_median                     284 us          312 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/4000/manual_time_median                341 us          361 us           10
sort_and_compute_permutation<Thrust_Cuda>/4000/manual_time_median                     113 us          141 us           10
sort<Kokkos_Cuda>/4000/manual_time_median                                             290 us          310 us           10
sort<Kokkos_Cuda_Host>/4000/manual_time_median                                        340 us          360 us           10
sort<Thrust_Cuda>/4000/manual_time_median                                            48.3 us         67.8 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/4000/manual_time_median               292 us          324 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/4000/manual_time_median         1088 us         1115 us           10
apply_permutation<Kokkos_Cuda_Serial>/4000/manual_time_median                        98.4 us         98.4 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/4000/manual_time_median              339 us          359 us           10
sort<Kokkos_Cuda_Serial>/4000/manual_time_median                                      341 us          361 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/4000/manual_time_median        440 us          460 us           10

apply_permutation<Kokkos_Cuda>/8000/manual_time_median                               9.26 us         9.30 us           10
apply_permutation<Kokkos_Cuda_Host>/8000/manual_time_median                           436 us          432 us           10
sort_and_compute_permutation<Kokkos_Cuda>/8000/manual_time_median                     284 us          313 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/8000/manual_time_median                675 us          696 us           10
sort_and_compute_permutation<Thrust_Cuda>/8000/manual_time_median                     276 us          305 us           10
sort<Kokkos_Cuda>/8000/manual_time_median                                             295 us          316 us           10
sort<Kokkos_Cuda_Host>/8000/manual_time_median                                        679 us          699 us           10
sort<Thrust_Cuda>/8000/manual_time_median                                             266 us          286 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/8000/manual_time_median               299 us          331 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/8000/manual_time_median         1557 us         1581 us           10
apply_permutation<Kokkos_Cuda_Serial>/8000/manual_time_median                         120 us          120 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/8000/manual_time_median              674 us          694 us           10
sort<Kokkos_Cuda_Serial>/8000/manual_time_median                                      673 us          693 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/8000/manual_time_median        793 us          813 us           10

apply_permutation<Kokkos_Cuda>/16000/manual_time_median                               9.23 us         9.27 us           10
apply_permutation<Kokkos_Cuda_Host>/16000/manual_time_median                           467 us          462 us           10
sort_and_compute_permutation<Kokkos_Cuda>/16000/manual_time_median                     285 us          314 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/16000/manual_time_median               1413 us         1434 us           10
sort_and_compute_permutation<Thrust_Cuda>/16000/manual_time_median                     273 us          301 us           10
sort<Kokkos_Cuda>/16000/manual_time_median                                             286 us          305 us           10
sort<Kokkos_Cuda_Host>/16000/manual_time_median                                       1409 us         1429 us           10
sort<Thrust_Cuda>/16000/manual_time_median                                             264 us          283 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/16000/manual_time_median               296 us          328 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/16000/manual_time_median         2622 us         2645 us           10
apply_permutation<Kokkos_Cuda_Serial>/16000/manual_time_median                         164 us          164 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/16000/manual_time_median             1411 us         1432 us           10
sort<Kokkos_Cuda_Serial>/16000/manual_time_median                                     1416 us         1436 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/16000/manual_time_median       1578 us         1599 us           10

apply_permutation<Kokkos_Cuda>/32000/manual_time_median                               9.37 us         9.41 us           10
apply_permutation<Kokkos_Cuda_Host>/32000/manual_time_median                           512 us          506 us           10
sort_and_compute_permutation<Kokkos_Cuda>/32000/manual_time_median                     291 us          321 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/32000/manual_time_median               2905 us         2925 us           10
sort_and_compute_permutation<Thrust_Cuda>/32000/manual_time_median                     975 us         1011 us           10
sort<Kokkos_Cuda>/32000/manual_time_median                                             304 us          325 us           10
sort<Kokkos_Cuda_Host>/32000/manual_time_median                                       2907 us         2928 us           10
sort<Thrust_Cuda>/32000/manual_time_median                                             957 us          982 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/32000/manual_time_median               308 us          342 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/32000/manual_time_median         4201 us         4227 us           10
apply_permutation<Kokkos_Cuda_Serial>/32000/manual_time_median                         245 us          245 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/32000/manual_time_median             2896 us         2917 us           10
sort<Kokkos_Cuda_Serial>/32000/manual_time_median                                     2905 us         2926 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/32000/manual_time_median       3146 us         3167 us           10

apply_permutation<Kokkos_Cuda>/64000/manual_time_median                               9.12 us         9.16 us           10
apply_permutation<Kokkos_Cuda_Host>/64000/manual_time_median                           591 us          588 us           10
sort_and_compute_permutation<Kokkos_Cuda>/64000/manual_time_median                     691 us          997 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/64000/manual_time_median               6151 us         6172 us           10
sort_and_compute_permutation<Thrust_Cuda>/64000/manual_time_median                     963 us          998 us           10
sort<Kokkos_Cuda>/64000/manual_time_median                                             975 us         1000 us           10
sort<Kokkos_Cuda_Host>/64000/manual_time_median                                       6128 us         6149 us           10
sort<Thrust_Cuda>/64000/manual_time_median                                             952 us          977 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/64000/manual_time_median               704 us         1019 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/64000/manual_time_median         7539 us         7562 us           10
apply_permutation<Kokkos_Cuda_Serial>/64000/manual_time_median                         401 us          401 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/64000/manual_time_median             6127 us         6147 us           10
sort<Kokkos_Cuda_Serial>/64000/manual_time_median                                     6144 us         6164 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/64000/manual_time_median       6528 us         6548 us           10

apply_permutation<Kokkos_Cuda>/128000/manual_time_median                               9.20 us         9.24 us           10
apply_permutation<Kokkos_Cuda_Host>/128000/manual_time_median                           759 us          756 us           10
sort_and_compute_permutation<Kokkos_Cuda>/128000/manual_time_median                     317 us          347 us           10
sort_and_compute_permutation<Kokkos_Cuda_Host>/128000/manual_time_median              13097 us        13118 us           10
sort_and_compute_permutation<Thrust_Cuda>/128000/manual_time_median                     960 us          996 us           10
sort<Kokkos_Cuda>/128000/manual_time_median                                             320 us          341 us           10
sort<Kokkos_Cuda_Host>/128000/manual_time_median                                      13052 us        13070 us           10
sort<Thrust_Cuda>/128000/manual_time_median                                             264 us          284 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda>/128000/manual_time_median               972 us         1012 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Host>/128000/manual_time_median        14601 us        14633 us           10
apply_permutation<Kokkos_Cuda_Serial>/128000/manual_time_median                         714 us          714 us           10
sort_and_compute_permutation<Kokkos_Cuda_Serial>/128000/manual_time_median            13041 us        13060 us           10
sort<Kokkos_Cuda_Serial>/128000/manual_time_median                                    13089 us        13107 us           10
sort_compute_and_apply_permutation<Kokkos_Cuda_Serial>/128000/manual_time_median      13769 us        13788 us           10

Basically showing for sort_and_compute_permutation that copying to the host makes sense for less than 1000 values, using Thust up to 16000 values and BinSort for a larger number of values.
For sort using Thrust also seems to be fastest.

@aprokop aprokop marked this pull request as draft August 13, 2020 18:54
@aprokop
Copy link
Contributor Author

aprokop commented Mar 12, 2021

Closing for now due to the lack of activity and interest. Will resurrect if need arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slower than it should be testing Anything to do with tests and CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants