Skip to content

v14: latest libraries

Compare
Choose a tag to compare
@github-actions github-actions released this 25 Apr 01:01
· 116 commits to master since this release

Compared to the previous stable (v13.2) release:

General

vsmlrt.py

  • Plugin invocation order in the get_plugin_path() function is sorted to reduce memory consumption.
  • Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
  • Added support for SCUNet models for image denoising.

TRT

plugin and runtime libraries

  • Upgraded to TensorRT 10.0.1.
  • Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
  • Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
  • Reduced engine build time.
  • Added long path support for engines on Windows.
  • cuDNN is no longer a strict runtime dependency.

vsmlrt.py

  • The cuDNN tactic is no longer enabled by default.
  • TF32 acceleration is disabled by default.
  • The maximum workspace is set to None for the total memory size of the GPU.
  • Add parameters builder_optimization_level, max_aux_streams, bf16 (#64), custom_env, custom_args, short_path and engine_folder (#90):
    • builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
    • max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
    • bf16: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." link
    • custom_env, custom_args: custom environment variable and arguments for trtexec engine build.
    • short_path: whether to shorten engine name.
      • On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
    • engine_folder: used to specify custom directory for engines.

known issues

  • Accoding to the documentation, There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.This affects RIFE and SAFA models.

  • trtexec may reports errors like:

    • [E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
    • [E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)

    This issue has been submitted to NVIDIA.

ORT

  • Upgraded to ONNX Runtime v1.18.0.

interface

  • The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag in these backends is as follows:
    • Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
    • Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.

CUDA

  • Reduced execution overhead.
  • Added support for TF32 acceleration. This is disabled by default.
  • Added experimental prefer_nhwc flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.

OV

  • Upgraded to OpenVINO 2024.2.0.
  • Added experimental OV_NPU backend for Intel NPUs.

MIGX

  • Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.

Community contributions

  • scripts/vsmlrt.py: update esrgan janai models by @hooke007 in #53
  • scripts/vsmlrt.py: add more esrgan janai models by @hooke007 in #82
  • vsmigx: allow fp16 input & output by @abihf in #86
  • scripts/vsmlrt.py: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)

Benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

model 1 stream 2 streams 3 streams
dpir color 10.99 / 1715.172 11.62 / 3048.540 11.64 / 4381.912
waifu2x upconv_7_{anime_style_art_rgb, photo} 22.38 / 2016.352 32.66 / 3734.880 32.54 / 5453.404
waifu2x cunet / cugan 12.41 / 4359.284 15.53 / 8363.392 15.47 / 12367.504
waifu2x swin_unet 3.80 / 7304.332 4.06 / 14392.408 4.06 / 21276.380
real-esrgan (v2/v3, xsx2) 16.65 / 955.480 22.53 / 1645.904 22.49 / 2336.324
scunet color 4.20 / 2847.708 4.33 / 6646.884 4.33 / 9792.736

Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).


This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.

Full Changelog: v13.2...v14