-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort pr #199
Sort pr #199
Conversation
merge 22.03
for more information, see https://pre-commit.ci
cunumeric/deferred.py
Outdated
@@ -32,6 +32,7 @@ | |||
UnaryRedCode, | |||
) | |||
from .linalg.cholesky import cholesky | |||
from .sorting import sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a minor quibble: why don't we just name things sort
everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no particular reasoning behind this - I will change it.
cunumeric/sorting.py
Outdated
swapped_copy.copy(swapped, deep=True) | ||
|
||
# run sort on last axis | ||
sort_result = output.runtime.create_empty_thunk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this thunk necessary if swapped_copy
is already a copy we can mutate? Can we do the sorting in place using swapped_copy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep the logic simple here. The underlying code dose not support input==output at this time. I could change this, but this will still not always be an option (with argsort the input is of different type than the output).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code skips the copy whenever possible.
cunumeric/sorting.py
Outdated
|
||
if output.ndim > 1: | ||
task.add_broadcast(input.base, input.ndim - 1) | ||
elif output.runtime.num_gpus > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking this again, but why do we use NCCL when there's only one GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this
Added a couple of minor comments. I'll make another pass tomorrow. |
src/cunumeric/sort/sort_omp.cc
Outdated
{ | ||
if (argptr == nullptr) { | ||
// sort (in place) | ||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe your GPU code can be repurposed here, as Thrust can use OpenMP as a device. It'd be interesting to check if that performs any better than this code. I suggest we do that once we wrap up the upcoming release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the manual pragmas and added the omp execution policy to the thrust call for now. This might not be optimal in all scenarios but keeps it simple until we decide to focus on it.
for more information, see https://pre-commit.ci
Added limited sort-support. All communication has to be done as part of task preparation.
1-D (or flattened) sort is only supported for non-distributed data and will be broadcasted.
N-D data will swap the sort-axis to the last dimension, ensure c-order and broadcast the last dimension in order to sort in a single process
Sort is performed by
** std::stable_sort (CPU)
** thrust::stable_sort (OMP)
** thrust::stable_sort (GPU - complex data)
** cub::DeviceRadixSort, cub::DeviceSegmentedRadixSort (GPU - all primitive types)
merged with NCCL-branch for distributed 1D data on GPU