GPU jpeg decoder: add batch support and hardware decoding #8496

deekay42 · 2024-06-17T19:14:04Z

Over 8000 imgs/s on 1 A100 GPU


Platform: Linux-5.12.0-0_fbk7_zion_6511_gd766966f605a-x86_64-with-glibc2.34
Logical CPUs: 192

CUDA device: NVIDIA PG509-210
Total Memory: 84.99 GB

Mean image size: 551x676
[---------------------------------------------------------------- Image Decoding ----------------------------------------------------------------]
                                                                                                        |  1 images  |  100 images  |  1000 images
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.decode_jpeg(img, device='cpu') for img in encoded_images_trunc]    |   3301.9   |   271141.6   |   2541465.3 
      CPU (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cpu')                       |   3239.7   |   288522.8   |   2596394.3 
      CUDA (unfused): [torchvision.io.decode_jpeg(img, device='cuda') for img in encoded_images_trunc]  |    603.7   |    60097.8   |    573783.4 
      CUDA (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cuda')                     |    600.6   |    12972.6   |    127654.8 
12 threads: --------------------------------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.decode_jpeg(img, device='cpu') for img in encoded_images_trunc]    |   3330.5   |   272498.9   |   2552944.3 
      CPU (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cpu')                       |   3339.7   |   257796.7   |   2511005.4 
      CUDA (unfused): [torchvision.io.decode_jpeg(img, device='cuda') for img in encoded_images_trunc]  |    603.8   |    59138.0   |    588341.4 
      CUDA (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cuda')                     |    605.0   |    13163.7   |    127891.4 
24 threads: --------------------------------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.decode_jpeg(img, device='cpu') for img in encoded_images_trunc]    |   3227.5   |   276357.8   |   2518914.3 
      CPU (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cpu')                       |   3277.7   |   257554.9   |   2497894.3 
      CUDA (unfused): [torchvision.io.decode_jpeg(img, device='cuda') for img in encoded_images_trunc]  |    607.9   |    58306.1   |    583932.6 
      CUDA (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cuda')                     |    653.2   |    12604.1   |    124130.5 

Times are in microseconds (us).```

Summary: I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise. Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call. If it's alright, I'll be happy to refactor decode_jpeg to follow this convention in a follow up PR. Test Plan: 1. pytest test -vvv 2. ufmt format torchvision 3. flake8 torchvision Reviewers: Subscribers: Tasks: Tags:

This reverts commit c5810ff.

…nto add_gpu_encode

pytorch-bot · 2024-06-17T19:14:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8496

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 6 Unrelated Failures

As of commit efa746d with merge base 5242d6a ():

NEW FAILURES - The following jobs have failed:

CMake / macos (macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
CMake / windows (windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Lint / bc (gh)
Process completed with exit code 1.
Tests / unittests-macos (3.10, macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-macos (3.11, macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-macos (3.12, macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-macos (3.8, macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-macos (3.9, macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-windows (3.10, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.11, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.12, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.8, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.9, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

CMake / linux (linux.12xlarge, cpu) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-linux (3.10, linux.12xlarge, cpu) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-linux (3.11, linux.12xlarge, cpu) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-linux (3.12, linux.12xlarge, cpu) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-linux (3.8, linux.12xlarge, cpu) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128
Tests / unittests-linux (3.9, linux.12xlarge, cpu) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ahmadsharif1

Flushing out some comments on the C++ code.

ahmadsharif1 · 2024-06-26T00:11:58Z

torchvision/csrc/io/image/cuda/encode_decode_jpegs_cuda.h

-    const torch::Tensor& data,
-    ImageReadMode mode,
+C10_EXPORT std::vector<torch::Tensor> decode_jpegs_cuda(
+    const std::vector<torch::Tensor>& encoded_images,


Why is this not a single torch::Tensor that's stacked as opposed to a std::vector of Tensors?

Because the encoded jpegs are represented as variable length byte streams of type tensor(1). They cannot be stacked into a batch size they don't have the same dimensions.

Request: add that as a comment in the code itself for future readers

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

ahmadsharif1 · 2024-06-26T00:18:36Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+
+  if (cudaJpegDecoder == nullptr || device != cudaJpegDecoder->target_device) {
+    if (cudaJpegDecoder != nullptr)
+      delete cudaJpegDecoder.release();


You can do cudaJpegDecoder.reset() instead of manually deleting

ahmadsharif1 · 2024-06-26T00:19:50Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+    if (cudaJpegDecoder != nullptr)
+      delete cudaJpegDecoder.release();
+    cudaJpegDecoder = std::make_unique<CUDAJpegDecoder>(device);
+    std::atexit([]() { delete cudaJpegDecoder.release(); });


Use .reset() here as well instead of manually deleting?

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

ahmadsharif1 · 2024-06-26T00:47:29Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+
+CUDAJpegDecoder::~CUDAJpegDecoder() {
+  /*
+  The below code works on Mac and Linux, but fails on Windows.


Do you want to use a #ifdef _WIN32 here?

I've thought about it but I understand that C++ order of destruction is generally undefined, so even if it passes on the specific Mac and Linux versions I've tested it on, it still is undefined behavior and may fail at any time.

ahmadsharif1 · 2024-06-26T00:48:19Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.h

+  const torch::Device target_device;
+  const c10::cuda::CUDAStream stream;
+
+ protected:


Out of curiosity, why protected and not private?

No real reason. Happy to change to private

ahmadsharif1 · 2024-06-26T00:50:51Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+  nvjpegStatus_t status;
+  cudaError_t cudaStatus;
+
+  cudaStatus = cudaStreamSynchronize(stream);


Why is this needed?

That's how they do it here: https://github.com/NVIDIA/CUDALibrarySamples/blob/f17940ac4e705bf47a8c39f5365925c1665f6c98/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp#L36
After buffers are allocated they synchronize before starting the decoding

ahmadsharif1 · 2024-06-26T00:53:19Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+    }
+  }
+
+  cudaStatus = cudaStreamSynchronize(stream);


Why do you have the outer CUDAEvent in the caller of this function if you already wait for all ops to finish here?

Line 563 is needed because I need to do some pruning right after. The outer CUDAEvent is needed to synchronize the internal CUDAJpegDecoder::stream with the calling code's current stream before returning the results

ahmadsharif1

Flushing out some comments on the C++ code.

ahmadsharif1

Looks good. I only have minor comments. Let's wait for Nicolas to review the python changes too.

ahmadsharif1 · 2024-07-05T15:38:10Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.h

+  nvjpegJpegStream_t jpeg_streams[2];
+  nvjpegDecodeParams_t nvjpeg_decode_params;
+  nvjpegJpegDecoder_t nvjpeg_decoder;
+  bool hw_decode_available{true};


It would be more defensive to set this to false by default and only set it to true at the end of the function at line228 of the decode_jmpegs_cuda.cpp file

ahmadsharif1 · 2024-07-05T15:40:32Z

torchvision/csrc/io/image/cuda/encode_decode_jpegs_cuda.h

-    const torch::Tensor& data,
-    ImageReadMode mode,
+C10_EXPORT std::vector<torch::Tensor> decode_jpegs_cuda(
+    const std::vector<torch::Tensor>& encoded_images,


Request: add that as a comment in the code itself for future readers

ahmadsharif1 · 2024-07-05T15:49:04Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+      status = nvjpegDecodeJpegDevice(
+          nvjpeg_handle,
+          nvjpeg_decoder,
+          nvjpeg_decoupled_state,
+          &sw_output_buffer[i],
+          stream);
+      TORCH_CHECK(
+          status == NVJPEG_STATUS_SUCCESS,
+          "Failed to decode jpeg stream: ",
+          status);
+    }


I don't understand why this is needed if the decoding is done in software on the host.

Why is another decode on the GPU needed here?

(Add a comment in the code)

There are many different types of jpegs, for out purposes most notably baseline and progressive. The main difference between the two is that progressive jpegs encapsulate multiple renderings of the same image at different resolutions. Baseline jpegs can be decoded on the GPU right away, but progressive jpegs need some preprocessing on the host before the GPU can process them. Added a comment

ahmadsharif1 · 2024-07-05T15:49:42Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+  std::vector<nvjpegImage_t> hw_output_buffer;
+
+  // other JPEG types such as progressive JPEGs can be decoded one-by-one in
+  // software slow :(


Add more details here about the software decode process since it appears from the code below that some work is done on the GPU even in this case (and 2 transfers are needed)?

Added to comment on line 400

ahmadsharif1 · 2024-07-05T15:59:57Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.h

+  const c10::cuda::CUDAStream stream;
+
+ private:
+  std::tuple<


Since the return type is a tuple, it's hard to tell what it returns (other than the type).

i.e. it's not obvious that the last element is the number of channels. Can you add a comment about the return type? EDIT: I noticed you do have a comment in the implementation. Maybe move that here?

Even more readable would be a struct with proper member names.

ahmadsharif1 · 2024-07-05T16:14:15Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+    torch.uint8 and device cpu
+    - output_format (nvjpegOutputFormat_t): NVJPEG_OUTPUT_RGB, NVJPEG_OUTPUT_Y
+    or NVJPEG_OUTPUT_UNCHANGED
+    - device (torch::Device): The desired CUDA device for the returned Tensors


This is a stale comment since there is no device arg

ahmadsharif1 · 2024-07-05T16:16:56Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+      // which is related to the subsampling used I'm not sure why this is the
+      // case, but for now we're just using RGB and later removing channels from
+      // grayscale images.
+      output_format = NVJPEG_OUTPUT_UNCHANGED;


Nit: would it be simpler to just use NVJPEG_OUTPUT_RGB here since you are assuming this expands the channels anyway?

Also add a TODO to investigate and fix this behavior of pruning

ahmadsharif1 · 2024-07-05T16:23:26Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+namespace image {
+
+std::mutex decoderMutex;
+std::unique_ptr<CUDAJpegDecoder> cudaJpegDecoder;


Nit: maybe use gCudaJpegDecoder to indicate it's a global variable?

ahmadsharif1 · 2024-07-05T16:29:48Z

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp

+  We do not have a solution to this problem at the moment, so we'll
+  just leak the libnvjpeg & cuda variables for the time being and hope
+  that the CUDA runtime handles cleanup for us.
+  Please send a PR if you have a solution for this problem.


One request: maybe try the driver API to see if cuda is available?

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g52b5ce05cb8c5fb6831b2c0ff2887c74

int dummy;
if (cuDeviceGetCount (&dummy) == CUDA_SUCESS) {
...
}

If that doesn't work don't bother with a unique_ptr for the global variable and just use a regular pointer.

Add a comment above the global variable saying this is not a unique_ptr because our destructor could race with cuda's own destructors

Yea I tried that and it didn't work :(

ahmadsharif1 · 2024-07-05T16:38:45Z

benchmarks/encoding_decoding.py

+                        "[torchvision.io.encode_jpeg(img) for img in decoded_images_device_trunc]",
+                        "torchvision.io.encode_jpeg(decoded_images_device_trunc)",
+                    ],
+                    ["unfused", "fused"],


I could be wrong, but batched seems like a better term than fused since it appears to be batching images, not fusing kernels necessarily.

If the images are batched it uses a fused kernel

NicolasHug

Thanks a ton @deekay42

github-actions · 2024-08-07T14:01:43Z

Hey @NicolasHug!

You merged this PR, but no labels were added.
The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

…8496) Summary: Co-authored-by: Nicolas Hug <[email protected]> Differential Revision: D60903713 fbshipit-source-id: f0f9908e2be6436372132a575c7ff066129a1f78

deekay42 and others added 30 commits April 22, 2024 22:17

fix test cases

4cc30cb

fix lints

2db02f0

fix lints2

6acef83

latest round of updates

ae0450d

fix lints

a799c53

Ignore mypy

c5810ff

Add comment

ff40253

minor test refactor

0972863

Merge branch 'main' of github.com:pytorch/vision into add_gpu_encode

4ce658d

Merge branch 'pytorch:main' into add_gpu_encode

65372a3

Caching nvjpeg vars across calls

62e072a

Update if nvjpeg not found

b3d06cb

Adding gpu decode

fcf8a78

Update if nvjpeg not found

f190d99

merge

c471db8

Merge branch 'main' of github.com:pytorch/vision into add_gpu_encode

b5eaa89

Revert "Ignore mypy"

5051050

This reverts commit c5810ff.

Add comment

136f790

minor changes to address ahmad's comments

0a88d27

Merge branch 'add_gpu_encode' of https://github.com/deekay42/vision i…

df60183

…nto add_gpu_encode

add dtor log messages

f3c8a72

Skip CUDA cleanup altogether

117d1f1

Merge branch 'main' into add_gpu_encode

21eca4c

Merge branch 'add_gpu_encode' into add_gpu_decode

64f2cf9

disable cleanup

156e250

Merge branch 'add_gpu_decode'

3efb658

disable cleanup

5f77eea

merge

ac8edd2

Merge branch 'add_gpu_encode' into add_gpu_decode

cebe75f

deekay42 added 2 commits June 17, 2024 10:53

Merge branch 'deekay42-add_gpu_decode'

2e60784

merge

01a5621

facebook-github-bot added the cla signed label Jun 17, 2024

deekay42 marked this pull request as ready for review June 17, 2024 20:27

deekay42 requested review from NicolasHug and ahmadsharif1 June 17, 2024 20:32

ahmadsharif1 reviewed Jun 26, 2024

View reviewed changes

ahmad's comments

ccdafd4

deekay42 force-pushed the add_gpu_decode branch from 98b9267 to ccdafd4 Compare June 26, 2024 20:59

ahmadsharif1 reviewed Jul 5, 2024

View reviewed changes

NicolasHug added 4 commits August 5, 2024 03:10

Merge branch 'main' of github.com:pytorch/vision into add_gpu_decode

c44599d

Fix syntax

25ca905

self address a few comments / nits

43b317b

lint

223f8a0

NicolasHug approved these changes Aug 5, 2024

View reviewed changes

NicolasHug changed the title ~~Add gpu decode~~ GPU jpeg decoder: add batch support and hardware decoding Aug 6, 2024

deekay42 and others added 4 commits August 6, 2024 10:07

ahmads comments 2

863cf76

lint

fc28c60

lint

dcd1c07

Merge branch 'main' into add_gpu_decode

efa746d

NicolasHug merged commit 0d80848 into pytorch:main Aug 7, 2024
41 of 60 checks passed

NicolasHug added enhancement module: io labels Aug 7, 2024

NicolasHug mentioned this pull request Sep 2, 2024

Potential improvements to jpeg decoding on GPU #3848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU jpeg decoder: add batch support and hardware decoding #8496

GPU jpeg decoder: add batch support and hardware decoding #8496

deekay42 commented Jun 17, 2024

pytorch-bot bot commented Jun 17, 2024 •

edited

Loading

ahmadsharif1 left a comment

ahmadsharif1 Jun 26, 2024

deekay42 Jun 26, 2024

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jun 26, 2024

ahmadsharif1 Jun 26, 2024

ahmadsharif1 Jun 26, 2024

deekay42 Jun 26, 2024

ahmadsharif1 Jun 26, 2024

deekay42 Jun 26, 2024

ahmadsharif1 Jun 26, 2024

deekay42 Jun 26, 2024

ahmadsharif1 Jun 26, 2024

deekay42 Jun 26, 2024

ahmadsharif1 left a comment

ahmadsharif1 left a comment

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jul 5, 2024

deekay42 Aug 6, 2024

ahmadsharif1 Jul 5, 2024

deekay42 Aug 6, 2024

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jul 5, 2024

ahmadsharif1 Jul 5, 2024

deekay42 Aug 6, 2024

ahmadsharif1 Jul 5, 2024

deekay42 Aug 6, 2024

NicolasHug left a comment

github-actions bot commented Aug 7, 2024

GPU jpeg decoder: add batch support and hardware decoding #8496

GPU jpeg decoder: add batch support and hardware decoding #8496

Conversation

deekay42 commented Jun 17, 2024

pytorch-bot bot commented Jun 17, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8496

❌ 13 New Failures, 6 Unrelated Failures

ahmadsharif1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmadsharif1 left a comment

Choose a reason for hiding this comment

ahmadsharif1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 7, 2024

pytorch-bot bot commented Jun 17, 2024 •

edited

Loading