-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sort benchmark #278
Add sort benchmark #278
Conversation
441f39f
to
cca7303
Compare
So far, I cannot reproduce results from here, at least on my workstation. |
Do you see anything similar to my observations in #222? |
I think yes, for smaller sizes Serial outperforms any Cuda, for both sort and permutations. Though that was on random data. |
This is a modified version of the original PSS (https://github.com/wjakob/pss) taken from https://github.com/SNLComputation/omega_h/tree/144a1c16b5f4c8dd66e55e8d43e9c37af0b5890b/tpl/pss
Looks like there is some guard missing:
|
benchmarks/sort/sort_benchmark.cpp
Outdated
|
||
Kokkos::BinSort<ViewType, CompType, typename ViewType::device_type, | ||
SizeType> | ||
bin_sort(view, CompType(n / 2, result.min_val, result.max_val), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me why we set the maximum number of bins to n/2
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall, maybe @dalg24 remembers. Maybe this is one of the things we need to look at.
benchmarks/sort/sort_benchmark.cpp
Outdated
int const n = in.extent(0); | ||
Kokkos::parallel_for( | ||
Kokkos::RangePolicy<ExecutionSpace>(0, n), | ||
KOKKOS_LAMBDA(int const i) { out(permute(i)) = in(i); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be interesting to see if the read access is non-consecutive and write is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking the same thing, maybe introduce reverse permutation option.
I have a new branch |
So you obtained some timings already? Did you save your numbers somewhere? |
That's some of the numbers I am seeing
|
This avoids doing extra work (e.g., creating extra views) for regular use case (memory is accessible from execution space) which affects timings for small sizes.
benchmarks/sort/sort_benchmark.cpp
Outdated
int const n = view.extent(0); | ||
|
||
auto view_mirror = | ||
Kokkos::create_mirror_view_and_copy(execution_space{}, view); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use instances, it would make sense to at least reuse it if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean.
int const n = view.extent(0); | ||
|
||
auto begin_ptr = thrust::device_ptr<ValueType>(view.data()); | ||
auto end_ptr = thrust::device_ptr<ValueType>(view.data() + n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrust::device_ptr<value_type> begin(Kokkos::View<value_type *, memory_space> const &v) { return {v.data()}; }
thrust::device_ptr<value_type> end(Kokkos::View<value_type *, memory_space> const &v) { return {v.data() + v.size()}; }
Some results on
So StdSort is faster up to 1500 values. Except for really small values PSS_OpenMP is the fastest OpenMP variant up to around 32000 values after that |
Results from the previous comment for easier reading (just Time column):
|
Results on
I am not quite sure what happened for 4000, 8000 and 16000. Possibly some other process running despite requesting one full node. |
These are the
Basically showing for |
Closing for now due to the lack of activity and interest. Will resurrect if need arises. |
No description provided.