-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NonsilentRegion GPU, implemented in terms of the CPU version #3874
Add NonsilentRegion GPU, implemented in terms of the CPU version #3874
Conversation
Signed-off-by: Joaquin Anton <[email protected]>
At this moment, the 'gpu' backend of this operator is implemented in terms of the 'cpu' | ||
implementation. This results in a device-to-host copy of the inputs and a host-to-device copy of the | ||
outputs. While using the 'gpu' implementation of this operator doesn't add any performance | ||
benefit on its own, using it might make sense in order to enable moving previous operations to the GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benefit on its own, using it might make sense in order to enable moving previous operations to the GPU. | |
benefit on its own, using it might make sense in order to enable moving preceding operations in the pipeline to the GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
template <typename CPUOperator> | ||
class FalseGPUOperator : public Operator<GPUBackend> { | ||
public: | ||
FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(3, 0, false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(3, 0, false) { | |
FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(PIPELINE_NUMBER_OF_THREADS, REAL_DEVICE_ID, true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason for set_affine=true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This locates CPU threads on the cores that are directly connected with the GPU. Check https://docs.nvidia.com/deploy/nvml-api/group__nvmlAffinity.html.
|
||
bool SetupImpl(std::vector<OutputDesc> &output_desc, | ||
const workspace_t<GPUBackend> &ws) override { | ||
if (cpu_ws_.NumInput() == 0 && cpu_ws_.NumOutput() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to set CPU input and output buffer to be pinned, otherwise we may jam the GPU execution flow even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} else { | ||
// Some GPU operators might accept CPU inputs (e.g. Slice) | ||
auto& cpu_input = ws.Input<CPUBackend>(input_idx); | ||
cpu_inputs_[input_idx]->Copy(cpu_input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we share instead of copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (ws.InputIsType<GPUBackend>(0)) { | ||
auto& gpu_input = ws.Input<GPUBackend>(input_idx); | ||
cpu_inputs_[input_idx]->Copy(gpu_input, AccessOrder(ws.stream())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should copy inside RunImpl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I had to move Setup and Run to RunImpl
Signed-off-by: Joaquin Anton <[email protected]>
a646457
to
246d287
Compare
auto ret = cpu_impl_.Setup(output_desc_, cpu_ws_); | ||
|
||
for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) { | ||
auto &desc = output_desc_[output_idx]; | ||
cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto ret = cpu_impl_.Setup(output_desc_, cpu_ws_); | |
for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) { | |
auto &desc = output_desc_[output_idx]; | |
cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type); | |
} | |
if (cpu_impl_.Setup(output_desc_, cpu_ws_)) { | |
assert(output_desc_.size() == cpu_ws_.NumOutput()); | |
for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) { | |
auto &desc = output_desc_[output_idx]; | |
cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type); | |
} | |
} |
890427f
to
fe1fd3d
Compare
|
||
for (int output_idx = 0; output_idx < ws.NumOutput(); output_idx++) { | ||
const auto& cpu_output = cpu_ws_.Output<CPUBackend>(output_idx); | ||
ws.Output<GPUBackend>(output_idx).Copy(cpu_output, AccessOrder(ws.stream())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not to create AccessOrder in a loop - unless you know the device id.
AccessOrder(stream, device) is cheap
AccessOrder(stream) is much more expensive (needs to look up the device id by stream).
Signed-off-by: Joaquin Anton <[email protected]>
fe1fd3d
to
fba84c0
Compare
explicit FalseGPUOperator(const OpSpec &spec) | ||
: Operator<GPUBackend>(spec), | ||
cpu_impl_(spec), | ||
thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true /** set_affine */ ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I'm not sure if we need this argument name.
thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true /** set_affine */ ) { | |
thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true) { |
|
||
protected: | ||
bool CanInferOutputs() const override { | ||
// To run Setup we need to first copy from device to host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to copy all data or just CPU arguments?
I have mixed feelings regarding moving the whole setup to Run, but it is up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To run Setup we need to have a valid HostWorkspace
9a67630
to
de0edfc
Compare
!build |
CI MESSAGE: [4788313]: BUILD STARTED |
CI MESSAGE: [4788313]: BUILD FAILED |
de0edfc
to
afcfbb0
Compare
!build |
CI MESSAGE: [4788843]: BUILD STARTED |
CI MESSAGE: [4788843]: BUILD FAILED |
Signed-off-by: Joaquin Anton <[email protected]>
afcfbb0
to
5a36d12
Compare
!build |
CI MESSAGE: [4790314]: BUILD STARTED |
CI MESSAGE: [4790314]: BUILD PASSED |
…DIA#3874) Signed-off-by: Joaquin Anton <[email protected]>
…DIA#3874) Signed-off-by: Joaquin Anton <[email protected]>
Signed-off-by: Joaquin Anton [email protected]
Category:
New feature
Description:
Adds a NonsilentRegion GPU operator, implemented by wrapping the CPU implementation.
Effectively, it runs the CPU operator under the hood, and copies the memory from device to host and viceversa.
This on its own doesn't add any performance benefit, but it enables moving previous operators in the pipeline to the GPU (e.g. resampling)
Added a FalseGPUOperator template that can be used to apply the same pattern to other ops.
Additional information:
Affected modules and functionalities:
NonsilentRegion
Key points relevant for the review:
All
Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: NSILENT.02
In fact it doesn't really implement the requirement, as it doesn't use the GPU for processing, but it is closely related to that requirement.
JIRA TASK: DALI-2777