Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NonsilentRegion GPU, implemented in terms of the CPU version #3874

Merged

Conversation

jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented May 6, 2022

Signed-off-by: Joaquin Anton [email protected]

Category:

New feature

Description:

Adds a NonsilentRegion GPU operator, implemented by wrapping the CPU implementation.
Effectively, it runs the CPU operator under the hood, and copies the memory from device to host and viceversa.
This on its own doesn't add any performance benefit, but it enables moving previous operators in the pipeline to the GPU (e.g. resampling)
Added a FalseGPUOperator template that can be used to apply the same pattern to other ops.

Additional information:

Affected modules and functionalities:

NonsilentRegion

Key points relevant for the review:

All

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: NSILENT.02
In fact it doesn't really implement the requirement, as it doesn't use the GPU for processing, but it is closely related to that requirement.

JIRA TASK: DALI-2777

At this moment, the 'gpu' backend of this operator is implemented in terms of the 'cpu'
implementation. This results in a device-to-host copy of the inputs and a host-to-device copy of the
outputs. While using the 'gpu' implementation of this operator doesn't add any performance
benefit on its own, using it might make sense in order to enable moving previous operations to the GPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
benefit on its own, using it might make sense in order to enable moving previous operations to the GPU.
benefit on its own, using it might make sense in order to enable moving preceding operations in the pipeline to the GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

template <typename CPUOperator>
class FalseGPUOperator : public Operator<GPUBackend> {
public:
FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(3, 0, false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(3, 0, false) {
FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(PIPELINE_NUMBER_OF_THREADS, REAL_DEVICE_ID, true) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason for set_affine=true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locates CPU threads on the cores that are directly connected with the GPU. Check https://docs.nvidia.com/deploy/nvml-api/group__nvmlAffinity.html.


bool SetupImpl(std::vector<OutputDesc> &output_desc,
const workspace_t<GPUBackend> &ws) override {
if (cpu_ws_.NumInput() == 0 && cpu_ws_.NumOutput() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to set CPU input and output buffer to be pinned, otherwise we may jam the GPU execution flow even more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else {
// Some GPU operators might accept CPU inputs (e.g. Slice)
auto& cpu_input = ws.Input<CPUBackend>(input_idx);
cpu_inputs_[input_idx]->Copy(cpu_input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we share instead of copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 70 to 72
if (ws.InputIsType<GPUBackend>(0)) {
auto& gpu_input = ws.Input<GPUBackend>(input_idx);
cpu_inputs_[input_idx]->Copy(gpu_input, AccessOrder(ws.stream()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should copy inside RunImpl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I had to move Setup and Run to RunImpl

Signed-off-by: Joaquin Anton <[email protected]>
@jantonguirao jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from a646457 to 246d287 Compare May 9, 2022 09:08
Comment on lines 98 to 103
auto ret = cpu_impl_.Setup(output_desc_, cpu_ws_);

for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) {
auto &desc = output_desc_[output_idx];
cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto ret = cpu_impl_.Setup(output_desc_, cpu_ws_);
for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) {
auto &desc = output_desc_[output_idx];
cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type);
}
if (cpu_impl_.Setup(output_desc_, cpu_ws_)) {
assert(output_desc_.size() == cpu_ws_.NumOutput());
for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) {
auto &desc = output_desc_[output_idx];
cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type);
}
}

@jantonguirao jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from 890427f to fe1fd3d Compare May 9, 2022 13:08

for (int output_idx = 0; output_idx < ws.NumOutput(); output_idx++) {
const auto& cpu_output = cpu_ws_.Output<CPUBackend>(output_idx);
ws.Output<GPUBackend>(output_idx).Copy(cpu_output, AccessOrder(ws.stream()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not to create AccessOrder in a loop - unless you know the device id.
AccessOrder(stream, device) is cheap
AccessOrder(stream) is much more expensive (needs to look up the device id by stream).

Signed-off-by: Joaquin Anton <[email protected]>
@jantonguirao jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from fe1fd3d to fba84c0 Compare May 9, 2022 13:11
explicit FalseGPUOperator(const OpSpec &spec)
: Operator<GPUBackend>(spec),
cpu_impl_(spec),
thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true /** set_affine */ ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I'm not sure if we need this argument name.

Suggested change
thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true /** set_affine */ ) {
thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true) {


protected:
bool CanInferOutputs() const override {
// To run Setup we need to first copy from device to host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy all data or just CPU arguments?
I have mixed feelings regarding moving the whole setup to Run, but it is up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To run Setup we need to have a valid HostWorkspace

@jantonguirao jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from 9a67630 to de0edfc Compare May 9, 2022 14:20
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4788313]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4788313]: BUILD FAILED

@jantonguirao jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from de0edfc to afcfbb0 Compare May 9, 2022 15:53
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4788843]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4788843]: BUILD FAILED

@jantonguirao jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from afcfbb0 to 5a36d12 Compare May 9, 2022 17:58
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4790314]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4790314]: BUILD PASSED

@jantonguirao jantonguirao merged commit a14ab64 into NVIDIA:main May 10, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
@JanuszL JanuszL mentioned this pull request Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants