-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU jpeg decoder: add batch support and hardware decoding #8496
Conversation
Summary: I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise. Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call. If it's alright, I'll be happy to refactor decode_jpeg to follow this convention in a follow up PR. Test Plan: 1. pytest test -vvv 2. ufmt format torchvision 3. flake8 torchvision Reviewers: Subscribers: Tasks: Tags:
This reverts commit c5810ff.
…nto add_gpu_encode
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8496
Note: Links to docs will display an error until the docs builds have been completed. ❌ 13 New Failures, 6 Unrelated FailuresAs of commit efa746d with merge base 5242d6a (): NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing out some comments on the C++ code.
const torch::Tensor& data, | ||
ImageReadMode mode, | ||
C10_EXPORT std::vector<torch::Tensor> decode_jpegs_cuda( | ||
const std::vector<torch::Tensor>& encoded_images, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not a single torch::Tensor that's stacked as opposed to a std::vector of Tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the encoded jpegs are represented as variable length byte streams of type tensor(1). They cannot be stacked into a batch size they don't have the same dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request: add that as a comment in the code itself for future readers
|
||
if (cudaJpegDecoder == nullptr || device != cudaJpegDecoder->target_device) { | ||
if (cudaJpegDecoder != nullptr) | ||
delete cudaJpegDecoder.release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do cudaJpegDecoder.reset() instead of manually deleting
if (cudaJpegDecoder != nullptr) | ||
delete cudaJpegDecoder.release(); | ||
cudaJpegDecoder = std::make_unique<CUDAJpegDecoder>(device); | ||
std::atexit([]() { delete cudaJpegDecoder.release(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use .reset() here as well instead of manually deleting?
|
||
CUDAJpegDecoder::~CUDAJpegDecoder() { | ||
/* | ||
The below code works on Mac and Linux, but fails on Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to use a #ifdef _WIN32 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about it but I understand that C++ order of destruction is generally undefined, so even if it passes on the specific Mac and Linux versions I've tested it on, it still is undefined behavior and may fail at any time.
const torch::Device target_device; | ||
const c10::cuda::CUDAStream stream; | ||
|
||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why protected and not private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real reason. Happy to change to private
nvjpegStatus_t status; | ||
cudaError_t cudaStatus; | ||
|
||
cudaStatus = cudaStreamSynchronize(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how they do it here: https://github.com/NVIDIA/CUDALibrarySamples/blob/f17940ac4e705bf47a8c39f5365925c1665f6c98/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp#L36
After buffers are allocated they synchronize before starting the decoding
} | ||
} | ||
|
||
cudaStatus = cudaStreamSynchronize(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have the outer CUDAEvent in the caller of this function if you already wait for all ops to finish here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 563 is needed because I need to do some pruning right after. The outer CUDAEvent is needed to synchronize the internal CUDAJpegDecoder::stream with the calling code's current stream before returning the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing out some comments on the C++ code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I only have minor comments. Let's wait for Nicolas to review the python changes too.
nvjpegJpegStream_t jpeg_streams[2]; | ||
nvjpegDecodeParams_t nvjpeg_decode_params; | ||
nvjpegJpegDecoder_t nvjpeg_decoder; | ||
bool hw_decode_available{true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more defensive to set this to false by default and only set it to true at the end of the function at line228 of the decode_jmpegs_cuda.cpp file
const torch::Tensor& data, | ||
ImageReadMode mode, | ||
C10_EXPORT std::vector<torch::Tensor> decode_jpegs_cuda( | ||
const std::vector<torch::Tensor>& encoded_images, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request: add that as a comment in the code itself for future readers
status = nvjpegDecodeJpegDevice( | ||
nvjpeg_handle, | ||
nvjpeg_decoder, | ||
nvjpeg_decoupled_state, | ||
&sw_output_buffer[i], | ||
stream); | ||
TORCH_CHECK( | ||
status == NVJPEG_STATUS_SUCCESS, | ||
"Failed to decode jpeg stream: ", | ||
status); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is needed if the decoding is done in software on the host.
Why is another decode on the GPU needed here?
(Add a comment in the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many different types of jpegs, for out purposes most notably baseline and progressive. The main difference between the two is that progressive jpegs encapsulate multiple renderings of the same image at different resolutions. Baseline jpegs can be decoded on the GPU right away, but progressive jpegs need some preprocessing on the host before the GPU can process them. Added a comment
std::vector<nvjpegImage_t> hw_output_buffer; | ||
|
||
// other JPEG types such as progressive JPEGs can be decoded one-by-one in | ||
// software slow :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more details here about the software decode process since it appears from the code below that some work is done on the GPU even in this case (and 2 transfers are needed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to comment on line 400
const c10::cuda::CUDAStream stream; | ||
|
||
private: | ||
std::tuple< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the return type is a tuple, it's hard to tell what it returns (other than the type).
i.e. it's not obvious that the last element is the number of channels. Can you add a comment about the return type? EDIT: I noticed you do have a comment in the implementation. Maybe move that here?
Even more readable would be a struct with proper member names.
torch.uint8 and device cpu | ||
- output_format (nvjpegOutputFormat_t): NVJPEG_OUTPUT_RGB, NVJPEG_OUTPUT_Y | ||
or NVJPEG_OUTPUT_UNCHANGED | ||
- device (torch::Device): The desired CUDA device for the returned Tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a stale comment since there is no device arg
// which is related to the subsampling used I'm not sure why this is the | ||
// case, but for now we're just using RGB and later removing channels from | ||
// grayscale images. | ||
output_format = NVJPEG_OUTPUT_UNCHANGED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would it be simpler to just use NVJPEG_OUTPUT_RGB here since you are assuming this expands the channels anyway?
Also add a TODO to investigate and fix this behavior of pruning
namespace image { | ||
|
||
std::mutex decoderMutex; | ||
std::unique_ptr<CUDAJpegDecoder> cudaJpegDecoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe use gCudaJpegDecoder to indicate it's a global variable?
We do not have a solution to this problem at the moment, so we'll | ||
just leak the libnvjpeg & cuda variables for the time being and hope | ||
that the CUDA runtime handles cleanup for us. | ||
Please send a PR if you have a solution for this problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One request: maybe try the driver API to see if cuda is available?
int dummy;
if (cuDeviceGetCount (&dummy) == CUDA_SUCESS) {
...
}
If that doesn't work don't bother with a unique_ptr for the global variable and just use a regular pointer.
Add a comment above the global variable saying this is not a unique_ptr because our destructor could race with cuda's own destructors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I tried that and it didn't work :(
"[torchvision.io.encode_jpeg(img) for img in decoded_images_device_trunc]", | ||
"torchvision.io.encode_jpeg(decoded_images_device_trunc)", | ||
], | ||
["unfused", "fused"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but batched seems like a better term than fused since it appears to be batching images, not fusing kernels necessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the images are batched it uses a fused kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @deekay42
Hey @NicolasHug! You merged this PR, but no labels were added. |
…8496) Summary: Co-authored-by: Nicolas Hug <[email protected]> Differential Revision: D60903713 fbshipit-source-id: f0f9908e2be6436372132a575c7ff066129a1f78
Over 8000 imgs/s on 1 A100 GPU