From 6a0ac22e62c0c6b71e5cdaf4c23350c8bf5f5f78 Mon Sep 17 00:00:00 2001
From: Misha Chornyi <mchornyi@nvidia.com>
Date: Wed, 6 Apr 2022 08:48:25 -0700
Subject: [PATCH] Update README and versions for 22.04 branch

---
 Dockerfile.sdk                                |   2 +-
 README.md                                     | 340 +-----------------
 TRITON_VERSION                                |   2 +-
 build.py                                      |  12 +-
 deploy/aws/values.yaml                        |   2 +-
 deploy/fleetcommand/Chart.yaml                |   2 +-
 deploy/fleetcommand/values.yaml               |   6 +-
 deploy/gcp/values.yaml                        |   2 +-
 .../perf-analyzer-script/triton_client.yaml   |   2 +-
 .../server-deployer/build_and_push.sh         |   6 +-
 .../server-deployer/chart/triton/Chart.yaml   |   4 +-
 .../server-deployer/chart/triton/values.yaml  |   6 +-
 .../server-deployer/data-test/schema.yaml     |   4 +-
 .../server-deployer/schema.yaml               |   2 +-
 docs/build.md                                 |   8 +-
 docs/compose.md                               |  16 +-
 docs/custom_operations.md                     |   6 +-
 docs/test.md                                  |   2 +-
 qa/common/gen_qa_custom_ops                   |   2 +-
 qa/common/gen_qa_model_repository             |   2 +-
 20 files changed, 46 insertions(+), 382 deletions(-)

diff --git a/Dockerfile.sdk b/Dockerfile.sdk
index 7798fa9c90..46ab826fb4 100644
--- a/Dockerfile.sdk
+++ b/Dockerfile.sdk
@@ -29,7 +29,7 @@
 #
 
 # Base image on the minimum Triton container
-ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:22.03-py3-min
+ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:22.04-py3-min
 
 ARG TRITON_CLIENT_REPO_SUBDIR=clientrepo
 ARG TRITON_COMMON_REPO_TAG=main
diff --git a/README.md b/README.md
index 12154f485b..71e39b9577 100644
--- a/README.md
+++ b/README.md
@@ -30,341 +30,5 @@
 
 # Triton Inference Server
 
-**LATEST RELEASE: You are currently on the main branch which tracks
-under-development progress towards the next release.**
-
-Triton Inference Server provides a cloud and edge inferencing solution
-optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
-GRPC protocol that allows remote clients to request inferencing for
-any model being managed by the server. For edge deployments, Triton is
-available as a shared library with a C API that allows the full
-functionality of Triton to be included directly in an
-application.
-
-The current release of the Triton Inference Server is 2.20.0 and
-corresponds to the 22.03 release of the tritonserver container on
-[NVIDIA GPU Cloud (NGC)](https://ngc.nvidia.com). The branch for this
-release is
-[r22.03](https://github.com/triton-inference-server/server/tree/r22.03).
-
-## Features
-
-* [Deep learning
-  frameworks](https://github.com/triton-inference-server/backend).
-  Triton supports TensorRT, TensorFlow GraphDef, TensorFlow
-  SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
-  formats. Both TensorFlow 1.x and TensorFlow 2.x are
-  supported. Triton also supports TensorFlow-TensorRT, ONNX-TensorRT
-  and PyTorch-TensorRT integrated models.
-
-* [Machine learning
-  frameworks](https://github.com/triton-inference-server/fil_backend).
-  Triton supports popular machine learning frameworks such as XGBoost,
-  LightGBM, Scikit-Learn and cuML using the [RAPIDS Forest Inference
-  Library](https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35).
-
-* [Concurrent model
-  execution](docs/architecture.md#concurrent-model-execution). Triton
-  can simultaneously run multiple models (or multiple instances of the
-  same model) using the same or different deep-learning and
-  machine-learning frameworks.
-
-* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
-  models that support batching, Triton implements multiple scheduling
-  and batching algorithms that combine individual inference requests
-  together to improve inference throughput. These scheduling and
-  batching decisions are transparent to the client requesting
-  inference.
-
-* [Extensible
-  backends](https://github.com/triton-inference-server/backend). In
-  addition to deep-learning frameworks, Triton provides a *backend
-  API* that allows Triton to be extended with any model execution
-  logic implemented in
-  [Python](https://github.com/triton-inference-server/python_backend)
-  or
-  [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
-  while still benefiting from full CPU and GPU support, concurrent
-  execution, dynamic batching and other features provided by Triton.
-
-* Model pipelines using
-  [Ensembling](docs/architecture.md#ensemble-models) or [Business
-  Logic Scripting
-  (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting).
-  A Triton *ensemble* represents a pipeline of one or more models and
-  the connection of input and output tensors between those
-  models. *BLS* allows a pipeline along with extra business logic to
-  be represented in Python. In both cases a single inference request
-  will trigger the execution of the entire pipeline.
-
-* [HTTP/REST and GRPC inference
-  protocols](docs/inference_protocols.md) based on the community
-  developed [KServe
-  protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2).
-
-* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
-  linked directly into your application for edge and other in-process
-  use cases.
-
-* [Metrics](docs/metrics.md) indicating GPU utilization, server
-  throughput, and server latency. The metrics are provided in
-  Prometheus data format.
-
-## Documentation
-
-**The master branch documentation tracks the upcoming,
-under-development release and so may not be accurate for the current
-release of Triton. See the [r22.03
-documentation](https://github.com/triton-inference-server/server/tree/r22.03#documentation)
-for the current release.**
-
-[Triton Architecture](docs/architecture.md) gives a high-level
-overview of the structure and capabilities of the inference
-server. There is also an [FAQ](docs/faq.md). Additional documentation
-is divided into [*user*](#user-documentation) and
-[*developer*](#developer-documentation) sections. The *user*
-documentation describes how to use Triton as an inference solution,
-including information on how to configure Triton, how to organize and
-configure your models, how to use the C++ and Python clients, etc. The
-*developer* documentation describes how to build and test Triton and
-also how Triton can be extended with new functionality.
-
-The Triton [Release
-Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
-and [Support
-Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
-indicate the required versions of the NVIDIA Driver and CUDA, and also
-describe supported GPUs.
-
-### User Documentation
-
-* [QuickStart](docs/quickstart.md)
-  * [Install Triton](docs/quickstart.md#install-triton-docker-image)
-  * [Create Model Repository](docs/quickstart.md#create-a-model-repository)
-  * [Run Triton](docs/quickstart.md#run-triton)
-* [Model Repository](docs/model_repository.md)
-  * [Cloud Storage](docs/model_repository.md#model-repository-locations)
-  * [File Organization](docs/model_repository.md#model-files)
-  * [Model Versioning](docs/model_repository.md#model-versions)
-* [Model Configuration](docs/model_configuration.md)
-  * [Required Model Configuration](docs/model_configuration.md#minimal-model-configuration)
-    * [Maximum Batch Size - Batching and Non-Batching Models](docs/model_configuration.md#maximum-batch-size)
-    * [Input and Output Tensors](docs/model_configuration.md#inputs-and-outputs)
-      * [Tensor Datatypes](docs/model_configuration.md#datatypes)
-      * [Tensor Reshape](docs/model_configuration.md#reshape)
-      * [Shape Tensor](docs/model_configuration.md#shape-tensors)
-  * [Auto-Generate Required Model Configuration](docs/model_configuration.md#auto-generated-model-configuration)
-  * [Version Policy](docs/model_configuration.md#version-policy)
-  * [Instance Groups](docs/model_configuration.md#instance-groups)
-    * [Specifying Multiple Model Instances](docs/model_configuration.md#multiple-model-instances)
-    * [CPU and GPU Instances](docs/model_configuration.md#cpu-model-instance)
-    * [Configuring Rate Limiter](docs/model_configuration.md#rate-limiter-configuration)
-  * [Optimization Settings](docs/model_configuration.md#optimization_policy)
-    * [Framework-Specific Optimization](docs/optimization.md#framework-specific-optimization)
-      * [ONNX-TensorRT](docs/optimization.md#onnx-with-tensorrt-optimization)
-      * [ONNX-OpenVINO](docs/optimization.md#onnx-with-openvino-optimization)
-      * [TensorFlow-TensorRT](docs/optimization.md#tensorflow-with-tensorrt-optimization)
-      * [TensorFlow-Mixed-Precision](docs/optimization.md#tensorflow-automatic-fp16-optimization)
-    * [NUMA Optimization](docs/optimization.md#numa-optimization)
-  * [Scheduling and Batching](docs/model_configuration.md#scheduling-and-batching)
-    * [Default Scheduler - Non-Batching](docs/model_configuration.md#default-scheduler)
-    * [Dynamic Batcher](docs/model_configuration.md#dynamic-batcher)
-      * [How to Configure Dynamic Batcher](docs/model_configuration.md#recommended-configuration-process)
-        * [Delayed Batching](docs/model_configuration.md#delayed-batching)
-        * [Preferred Batch Size](docs/model_configuration.md#preferred-batch-sizes)
-      * [Preserving Request Ordering](docs/model_configuration.md#preserve-ordering)
-      * [Priority Levels](docs/model_configuration.md#priority-levels)
-      * [Queuing Policies](docs/model_configuration.md#queue-policy)
-      * [Ragged Batching](docs/ragged_batching.md)
-    * [Sequence Batcher](docs/model_configuration.md#sequence-batcher)
-      * [Stateful Models](docs/architecture.md#stateful-models)
-      * [Control Inputs](docs/architecture.md#control-inputs)
-      * [Implicit State - Stateful Inference Using a Stateless Model](docs/architecture.md#implicit-state-management)
-      * [Sequence Scheduling Strategies](docs/architecture.md#scheduling-strateties)
-        * [Direct](docs/architecture.md#direct)
-        * [Oldest](docs/architecture.md#oldest)
-    * [Rate Limiter](docs/rate_limiter.md)
-  * [Model Warmup](docs/model_configuration.md#model-warmup)
-  * [Inference Request/Response Cache](docs/model_configuration.md#response-cache)
-* Model Pipeline
-  * [Model Ensemble](docs/architecture.md#ensemble-models)
-  * [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
-* [Model Management](docs/model_management.md)
-  * [Explicit Model Loading and Unloading](docs/model_management.md#model-control-mode-explicit)
-  * [Modifying the Model Repository](docs/model_management.md#modifying-the-model-repository)
-* [Metrics](docs/metrics.md)
-* [Framework Custom Operations](docs/custom_operations.md)
-  * [TensorRT](docs/custom_operations.md#tensorrt)
-  * [TensorFlow](docs/custom_operations.md#tensorflow)
-  * [PyTorch](docs/custom_operations.md#pytorch)
-  * [ONNX](docs/custom_operations.md#onnx)
-* [Client Libraries and Examples](https://github.com/triton-inference-server/client)
-  * [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
-  * [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
-  * [Java HTTP Library](https://github.com/triton-inference-server/client/tree/main/src/java)
-  * GRPC Generated Libraries
-    * [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go)
-    * [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java)
-    * [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript)
-* [Performance Analysis](docs/optimization.md)
-  * [Model Analyzer](docs/model_analyzer.md)
-  * [Performance Analyzer](docs/perf_analyzer.md)
-  * [Inference Request Tracing](docs/trace.md)
-* [Jetson and JetPack](docs/jetson.md)
-
-The [quickstart](docs/quickstart.md) walks you through all the steps
-required to install and run Triton with an example image
-classification model and then use an example client application to
-perform inferencing using that model. The quickstart also demonstrates
-how [Triton supports both GPU systems and CPU-only
-systems](docs/quickstart.md#run-triton).
-
-The first step in using Triton to serve your models is to place one or
-more models into a [model
-repository](docs/model_repository.md). Optionally, depending on the type
-of the model and on what Triton capabilities you want to enable for
-the model, you may need to create a [model
-configuration](docs/model_configuration.md) for the model.  If your
-model has [custom operations](docs/custom_operations.md) you will need
-to make sure they are loaded correctly by Triton.
-
-After you have your model(s) available in Triton, you will want to
-send inference and other requests to Triton from your *client*
-application. The [Python and C++ client
-libraries](https://github.com/triton-inference-server/client) provide
-APIs to simplify this communication. There are also a large number of
-[client examples](https://github.com/triton-inference-server/client)
-that demonstrate how to use the libraries.  You can also send
-HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based
-protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
-[generate a GRPC client for many other
-languages](https://github.com/triton-inference-server/client).
-
-Understanding and [optimizing performance](docs/optimization.md) is an
-important part of deploying your models. The Triton project provides
-the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
-Analyzer](docs/model_analyzer.md) to help your optimization
-efforts. Specifically, you will want to optimize [scheduling and
-batching](docs/architecture.md#models-and-schedulers) and [model
-instances](docs/model_configuration.md#instance-groups) appropriately
-for each model. You can also enable cross-model prioritization using
-the [rate limiter](docs/rate_limiter.md) which manages the rate at
-which requests are scheduled on model instances. You may also want to
-consider combining multiple models and pre/post-processing into a
-pipeline using [ensembling](docs/architecture.md#ensemble-models) or
-[Business Logic Scripting
-(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting). A
-[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
-and monitor aggregate inference metrics.
-
-NVIDIA publishes a number of [deep learning
-examples](https://github.com/NVIDIA/DeepLearningExamples) that use
-Triton.
-
-As part of your deployment strategy you may want to [explicitly manage
-what models are available by loading and unloading
-models](docs/model_management.md) from a running Triton server. If you
-are using Kubernetes for deployment there are simple examples of how
-to deploy Triton using Kubernetes and Helm:
-[GCP](deploy/gcp/README.md), [AWS](deploy/aws/README.md), and [NVIDIA
-FleetCommand](deploy/fleetcommand/README.md)
-
-The [version 1 to version 2 migration
-information](docs/v1_to_v2.md) is helpful if you are moving to
-version 2 of Triton from previously using version 1.
-
-### Developer Documentation
-
-* [Build](docs/build.md)
-* [Protocols and APIs](docs/inference_protocols.md).
-* [Backends](https://github.com/triton-inference-server/backend)
-* [Repository Agents](docs/repository_agents.md)
-* [Test](docs/test.md)
-
-Triton can be [built using
-Docker](docs/build.md#building-triton-with-docker) or [built without
-Docker](docs/build.md#building-triton-without-docker). After building
-you should [test Triton](docs/test.md).
-
-It is also possible to [create a Docker image containing a customized
-Triton](docs/compose.md) that contains only a subset of the backends.
-
-The Triton project also provides [client libraries for Python and
-C++](https://github.com/triton-inference-server/client) that make it
-easy to communicate with the server. There are also a large number of
-[example clients](https://github.com/triton-inference-server/client)
-that demonstrate how to use the libraries. You can also develop your
-own clients that directly communicate with Triton using [HTTP/REST or
-GRPC protocols](docs/inference_protocols.md). There is also a [C
-API](docs/inference_protocols.md) that allows Triton to be linked
-directly into your application.
-
-A [Triton backend](https://github.com/triton-inference-server/backend)
-is the implementation that executes a model. A backend can interface
-with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
-ONNX Runtime; or it can interface with a data processing framework
-like [DALI](https://github.com/triton-inference-server/dali_backend);
-or you can extend Triton by [writing your own
-backend](https://github.com/triton-inference-server/backend) in either
-[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
-or
-[Python](https://github.com/triton-inference-server/python_backend).
-
-A [Triton repository agent](docs/repository_agents.md) extends Triton
-with new functionality that operates when a model is loaded or
-unloaded. You can introduce your own code to perform authentication,
-decryption, conversion, or similar operations when a model is loaded.
-
-## Papers and Presentation
-
-* [Maximizing Deep Learning Inference Performance with NVIDIA Model
-  Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
-
-* [High-Performance Inferencing at Scale Using the TensorRT Inference
-  Server](https://developer.nvidia.com/gtc/2020/video/s22418).
-
-* [Accelerate and Autoscale Deep Learning Inference on GPUs with
-  KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
-
-* [Deep into Triton Inference Server: BERT Practical Deployment on
-  NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
-
-* [Maximizing Utilization for Data Center Inference with TensorRT
-  Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
-
-* [NVIDIA TensorRT Inference Server Boosts Deep Learning
-  Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
-
-* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
-  Inference Server and
-  Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
-
-* [Deploying NVIDIA Triton at Scale with MIG and Kubernetes](https://developer.nvidia.com/blog/deploying-nvidia-triton-at-scale-with-mig-and-kubernetes/).
-
-## Contributing
-
-Contributions to Triton Inference Server are more than welcome. To
-contribute make a pull request and follow the guidelines outlined in
-[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
-example or similar contribution that is not modifying the core of
-Triton, then you should file a PR in the [contrib
-repo](https://github.com/triton-inference-server/contrib).
-
-## Reporting problems, asking questions
-
-We appreciate any feedback, questions or bug reporting regarding this
-project. When help with code is needed, follow the process outlined in
-the Stack Overflow (<https://stackoverflow.com/help/mcve>)
-document. Ensure posted examples are:
-
-* minimal – use as little code as possible that still produces the
-  same problem
-
-* complete – provide all parts needed to reproduce the problem. Check
-  if you can strip external dependency and still show the problem. The
-  less time we spend on reproducing problems the more time we have to
-  fix it
-
-* verifiable – test the code you're about to provide to make sure it
-  reproduces the problem. Remove all other problems that are not
-  related to your request/question.
+**NOTE: You are currently on the r22.04 branch which tracks stabilization
+towards the next release. This branch is not usable during stabilization.**
diff --git a/TRITON_VERSION b/TRITON_VERSION
index 0c1e43b106..db65e2167e 100644
--- a/TRITON_VERSION
+++ b/TRITON_VERSION
@@ -1 +1 @@
-2.21.0dev
+2.21.0
diff --git a/build.py b/build.py
index 7794f0d9a0..1c82b41af8 100755
--- a/build.py
+++ b/build.py
@@ -86,9 +86,9 @@
 # Note: Not all sha ids would successfuly compile and work.
 #
 TRITON_VERSION_MAP = {
-    '2.21.0dev': (
-        '22.04dev',  # triton container
-        '22.03',  # upstream container
+    '2.21.0': (
+        '22.04',  # triton container
+        '22.04',  # upstream container
         '1.10.0',  # ORT
         '2021.4.582',  # ORT OpenVINO
         (('2021.4', None), ('2021.4', '2021.4.582'),
@@ -1732,7 +1732,7 @@ def get_tagged_backend(be, version):
         action='append',
         required=False,
         help=
-        'Include specified backend in build as <backend-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.03 -> branch r22.03); otherwise the default <repo-tag> is "main" (e.g. version 22.03dev -> branch main).'
+        'Include specified backend in build as <backend-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.04 -> branch r22.04); otherwise the default <repo-tag> is "main" (e.g. version 22.04dev -> branch main).'
     )
     parser.add_argument(
         '--build-multiple-openvino',
@@ -1746,14 +1746,14 @@ def get_tagged_backend(be, version):
         action='append',
         required=False,
         help=
-        'The version of a component to use in the build as <component-name>:<repo-tag>. <component-name> can be "common", "core", "backend" or "thirdparty". If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.03 -> branch r22.03); otherwise the default <repo-tag> is "main" (e.g. version 22.03dev -> branch main).'
+        'The version of a component to use in the build as <component-name>:<repo-tag>. <component-name> can be "common", "core", "backend" or "thirdparty". If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.04 -> branch r22.04); otherwise the default <repo-tag> is "main" (e.g. version 22.04dev -> branch main).'
     )
     parser.add_argument(
         '--repoagent',
         action='append',
         required=False,
         help=
-        'Include specified repo agent in build as <repoagent-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.03 -> branch r22.03); otherwise the default <repo-tag> is "main" (e.g. version 22.03dev -> branch main).'
+        'Include specified repo agent in build as <repoagent-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.04 -> branch r22.04); otherwise the default <repo-tag> is "main" (e.g. version 22.04dev -> branch main).'
     )
     parser.add_argument(
         '--no-force-clone',
diff --git a/deploy/aws/values.yaml b/deploy/aws/values.yaml
index 43aa25d404..330c5bce80 100644
--- a/deploy/aws/values.yaml
+++ b/deploy/aws/values.yaml
@@ -27,7 +27,7 @@
 replicaCount: 1
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:22.03-py3
+  imageName: nvcr.io/nvidia/tritonserver:22.04-py3
   pullPolicy: IfNotPresent
   modelRepositoryPath: s3://triton-inference-server-repository/model_repository
   numGpus: 1
diff --git a/deploy/fleetcommand/Chart.yaml b/deploy/fleetcommand/Chart.yaml
index a117e7d67c..119a416249 100644
--- a/deploy/fleetcommand/Chart.yaml
+++ b/deploy/fleetcommand/Chart.yaml
@@ -26,7 +26,7 @@
 
 apiVersion: v1
 # appVersion is the Triton version; update when changing release
-appVersion: "2.20.0"
+appVersion: "2.21.0"
 description: Triton Inference Server (Fleet Command)
 name: triton-inference-server
 # version is the Chart version; update when changing anything in the chart (semver)
diff --git a/deploy/fleetcommand/values.yaml b/deploy/fleetcommand/values.yaml
index 358c43e2ae..a57b7afc80 100644
--- a/deploy/fleetcommand/values.yaml
+++ b/deploy/fleetcommand/values.yaml
@@ -27,7 +27,7 @@
 replicaCount: 1
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:22.03-py3
+  imageName: nvcr.io/nvidia/tritonserver:22.04-py3
   pullPolicy: IfNotPresent
   numGpus: 1
   serverCommand: tritonserver
@@ -46,13 +46,13 @@ image:
     # Model Control Mode (Optional, default: none)
     #
     # To set model control mode, uncomment and configure below
-    # See https://github.com/triton-inference-server/server/blob/r22.03/docs/model_management.md
+    # See https://github.com/triton-inference-server/server/blob/r22.04/docs/model_management.md
     #  for more details
     #- --model-control-mode=explicit|poll|none
     #
     # Additional server args
     #
-    # see https://github.com/triton-inference-server/server/blob/r22.03/README.md
+    # see https://github.com/triton-inference-server/server/blob/r22.04/README.md
     #  for more details
 
 service:
diff --git a/deploy/gcp/values.yaml b/deploy/gcp/values.yaml
index 8e431bf6b3..0344c6df02 100644
--- a/deploy/gcp/values.yaml
+++ b/deploy/gcp/values.yaml
@@ -27,7 +27,7 @@
 replicaCount: 1
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:22.02-py3
+  imageName: nvcr.io/nvidia/tritonserver:22.04-py3
   pullPolicy: IfNotPresent
   modelRepositoryPath: gs://triton-inference-server-repository/model_repository
   numGpus: 1
diff --git a/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml b/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml
index e7c82bcd71..a408940cff 100644
--- a/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml
+++ b/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml
@@ -33,7 +33,7 @@ metadata:
   namespace: default
 spec:
   containers:
-  - image: nvcr.io/nvidia/tritonserver:22.03-py3-sdk
+  - image: nvcr.io/nvidia/tritonserver:22.04-py3-sdk
     imagePullPolicy: Always
     name: nv-triton-client
     securityContext:
diff --git a/deploy/gke-marketplace-app/server-deployer/build_and_push.sh b/deploy/gke-marketplace-app/server-deployer/build_and_push.sh
index 4ada4c3923..5513bfa285 100644
--- a/deploy/gke-marketplace-app/server-deployer/build_and_push.sh
+++ b/deploy/gke-marketplace-app/server-deployer/build_and_push.sh
@@ -26,9 +26,9 @@
 
 export REGISTRY=gcr.io/$(gcloud config get-value project | tr ':' '/')
 export APP_NAME=tritonserver
-export MAJOR_VERSION=2.20
-export MINOR_VERSION=2.20.0
-export NGC_VERSION=22.03-py3
+export MAJOR_VERSION=2.21
+export MINOR_VERSION=2.21.0
+export NGC_VERSION=22.04-py3
 
 docker pull nvcr.io/nvidia/$APP_NAME:$NGC_VERSION
 
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml
index 3dec895fb1..bb1a8cadad 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml
@@ -25,7 +25,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 apiVersion: v1
-appVersion: "2.20"
+appVersion: "2.21"
 description: Triton Inference Server
 name: triton-inference-server
-version: 2.20.0
+version: 2.21.0
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml
index a2df20fa98..cae411237b 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml
@@ -31,14 +31,14 @@ maxReplicaCount: 3
 tritonProtocol: HTTP
 # HPA GPU utilization autoscaling target
 HPATargetAverageValue: 85
-modelRepositoryPath: gs://triton_sample_models/22_03
-publishedVersion: '2.20.0'
+modelRepositoryPath: gs://triton_sample_models/22_04
+publishedVersion: '2.21.0'
 gcpMarketplace: true
 
 image:
   registry: gcr.io
   repository: nvidia-ngc-public/tritonserver
-  tag: 22.03-py3
+  tag: 22.04-py3
   pullPolicy: IfNotPresent
   # modify the model repository here to match your GCP storage bucket
   numGpus: 1
diff --git a/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml b/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml
index 67736e3f52..2d64b90307 100644
--- a/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml
@@ -27,7 +27,7 @@
 x-google-marketplace:
   schemaVersion: v2
   applicationApiVersion: v1beta1
-  publishedVersion: '2.20.0'
+  publishedVersion: '2.21.0'
   publishedVersionMetadata:
     releaseNote: >-
       Initial release.
@@ -89,7 +89,7 @@ properties:
   modelRepositoryPath:
     type: string
     title: Bucket where models are stored. Please make sure the user/service account to create the GKE app has permission to this GCS bucket. Read Triton documentation on configs and formatting details, supporting TensorRT, TensorFlow, Pytorch, Onnx ... etc.
-    default: gs://triton_sample_models/22_03
+    default: gs://triton_sample_models/22_04
   image.ldPreloadPath:
     type: string
     title: Leave this empty by default. Triton allows users to create custom layers for backend such as TensorRT plugin or Tensorflow custom ops, the compiled shared library must be provided via LD_PRELOAD environment variable.
diff --git a/deploy/gke-marketplace-app/server-deployer/schema.yaml b/deploy/gke-marketplace-app/server-deployer/schema.yaml
index 67736e3f52..e70d31f3ba 100644
--- a/deploy/gke-marketplace-app/server-deployer/schema.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/schema.yaml
@@ -89,7 +89,7 @@ properties:
   modelRepositoryPath:
     type: string
     title: Bucket where models are stored. Please make sure the user/service account to create the GKE app has permission to this GCS bucket. Read Triton documentation on configs and formatting details, supporting TensorRT, TensorFlow, Pytorch, Onnx ... etc.
-    default: gs://triton_sample_models/22_03
+    default: gs://triton_sample_models/22_04
   image.ldPreloadPath:
     type: string
     title: Leave this empty by default. Triton allows users to create custom layers for backend such as TensorRT plugin or Tensorflow custom ops, the compiled shared library must be provided via LD_PRELOAD environment variable.
diff --git a/docs/build.md b/docs/build.md
index e5e66c7b2e..d10bf232b6 100644
--- a/docs/build.md
+++ b/docs/build.md
@@ -109,8 +109,8 @@ invocation builds all features, backends, and repository agents.
 If you are building on *main* branch then `<container tag>` will
 default to "main". If you are building on a release branch then
 `<container tag>` will default to the branch name. For example, if you
-are building on the r22.03 branch, `<container tag>` will default to
-r22.03. Therefore, you typically do not need to provide `<container
+are building on the r22.04 branch, `<container tag>` will default to
+r22.04. Therefore, you typically do not need to provide `<container
 tag>` at all (nor the preceding colon). You can use a different
 `<container tag>` for a component to instead use the corresponding
 branch/tag in the build. For example, if you have a branch called
@@ -282,8 +282,8 @@ python build.py --cmake-dir=<path/to/repo>/build --build-dir=/tmp/citritonbuild
 If you are building on *main* branch then '<container tag>' will
 default to "main". If you are building on a release branch then
 '<container tag>' will default to the branch name. For example, if you
-are building on the r22.03 branch, '<container tag>' will default to
-r22.03. Therefore, you typically do not need to provide '<container
+are building on the r22.04 branch, '<container tag>' will default to
+r22.04. Therefore, you typically do not need to provide '<container
 tag>' at all (nor the preceding colon). You can use a different
 '<container tag>' for a component to instead use the corresponding
 branch/tag in the build. For example, if you have a branch called
diff --git a/docs/compose.md b/docs/compose.md
index 5b68b150aa..55b495c5ef 100644
--- a/docs/compose.md
+++ b/docs/compose.md
@@ -44,8 +44,8 @@ from source to get more exact customization.
 The `compose.py` script can be found in the [server repository](https://github.com/triton-inference-server/server).
 Simply clone the repository and run `compose.py` to create a custom container.
 Note: Created container version will depend on the branch that was cloned.
-For example branch [r22.03](https://github.com/triton-inference-server/server/tree/r22.03)
-should be used to create a image based on the NGC 22.03 Triton release.
+For example branch [r22.04](https://github.com/triton-inference-server/server/tree/r22.04)
+should be used to create a image based on the NGC 22.04 Triton release.
 
 `compose.py` provides `--backend`, `--repoagent` options that allow you to
 specify which backends and repository agents to include in the custom image.
@@ -62,7 +62,7 @@ will provide a container `tritonserver` locally. You can access the container wi
 $ docker run -it tritonserver:latest
 ```
 
-Note: If `compose.py` is run on release versions `r22.03` and earlier,
+Note: If `compose.py` is run on release versions `r22.04` and earlier,
 the resulting container will have DCGM version 2.2.3 installed.
 This may result in different GPU statistic reporting behavior.
 
@@ -76,19 +76,19 @@ For example, running
 ```
 python3 compose.py --backend tensorflow1 --repoagent checksum
 ```
-on branch [r22.03](https://github.com/triton-inference-server/server/tree/r22.03) pulls:
-- `min` container `nvcr.io/nvidia/tritonserver:22.03-py3-min`
-- `full` container `nvcr.io/nvidia/tritonserver:22.03-py3`
+on branch [r22.04](https://github.com/triton-inference-server/server/tree/r22.04) pulls:
+- `min` container `nvcr.io/nvidia/tritonserver:22.04-py3-min`
+- `full` container `nvcr.io/nvidia/tritonserver:22.04-py3`
 
 Alternatively, users can specify the version of Triton container to pull from any branch by either:
 1. Adding flag `--container-version <container version>` to branch
 ```
-python3 compose.py --backend tensorflow1 --repoagent checksum --container-version 22.03
+python3 compose.py --backend tensorflow1 --repoagent checksum --container-version 22.04
 ```
 2. Specifying `--image min,<min container image name> --image full,<full container image name>`.
    The user is responsible for specifying compatible `min` and `full` containers.
 ```
-python3 compose.py --backend tensorflow1 --repoagent checksum --image min,nvcr.io/nvidia/tritonserver:22.03-py3-min --image full,nvcr.io/nvidia/tritonserver:22.03-py3
+python3 compose.py --backend tensorflow1 --repoagent checksum --image min,nvcr.io/nvidia/tritonserver:22.04-py3-min --image full,nvcr.io/nvidia/tritonserver:22.04-py3
 ```
 Method 1 and 2 will result in the same composed container. Furthermore, `--image` flag overrides the `--container-version` flag when both are specified.
 
diff --git a/docs/custom_operations.md b/docs/custom_operations.md
index 2eae5fea80..99c68aa538 100644
--- a/docs/custom_operations.md
+++ b/docs/custom_operations.md
@@ -64,7 +64,7 @@ simple way to ensure you are using the correct version of TensorRT is
 to use the [NGC TensorRT
 container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt)
 corresponding to the Triton container. For example, if you are using
-the 22.03 version of Triton, use the 22.03 version of the TensorRT
+the 22.04 version of Triton, use the 22.04 version of the TensorRT
 container.
 
 ## TensorFlow
@@ -108,7 +108,7 @@ simple way to ensure you are using the correct version of TensorFlow
 is to use the [NGC TensorFlow
 container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
 corresponding to the Triton container. For example, if you are using
-the 22.03 version of Triton, use the 22.03 version of the TensorFlow
+the 22.04 version of Triton, use the 22.04 version of the TensorFlow
 container.
 
 ## PyTorch
@@ -152,7 +152,7 @@ simple way to ensure you are using the correct version of PyTorch is
 to use the [NGC PyTorch
 container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
 corresponding to the Triton container. For example, if you are using
-the 22.03 version of Triton, use the 22.03 version of the PyTorch
+the 22.04 version of Triton, use the 22.04 version of the PyTorch
 container.
 
 ## ONNX
diff --git a/docs/test.md b/docs/test.md
index 10fcf269fb..ec1f9c21c3 100644
--- a/docs/test.md
+++ b/docs/test.md
@@ -49,7 +49,7 @@ $ ./gen_qa_custom_ops
 ```
 
 This will create multiple model repositories in /tmp/<version>/qa_*
-(for example /tmp/22.03/qa_model_repository).  The TensorRT models
+(for example /tmp/22.04/qa_model_repository).  The TensorRT models
 will be created for the GPU on the system that CUDA considers device 0
 (zero). If you have multiple GPUs on your system see the documentation
 in the scripts for how to target a specific GPU.
diff --git a/qa/common/gen_qa_custom_ops b/qa/common/gen_qa_custom_ops
index 28b2994c81..4c9d15988a 100755
--- a/qa/common/gen_qa_custom_ops
+++ b/qa/common/gen_qa_custom_ops
@@ -37,7 +37,7 @@
 ##
 ############################################################################
 
-TRITON_VERSION=22.03
+TRITON_VERSION=22.04
 TENSORFLOW_IMAGE=${TENSORFLOW_IMAGE:=nvcr.io/nvidia/tensorflow:$TRITON_VERSION-tf1-py3}
 PYTORCH_IMAGE=${PYTORCH_IMAGE:=nvcr.io/nvidia/pytorch:$TRITON_VERSION-py3}
 
diff --git a/qa/common/gen_qa_model_repository b/qa/common/gen_qa_model_repository
index bfca889a30..38c759e629 100755
--- a/qa/common/gen_qa_model_repository
+++ b/qa/common/gen_qa_model_repository
@@ -48,7 +48,7 @@
 ##
 ############################################################################
 
-TRITON_VERSION=22.03
+TRITON_VERSION=22.04
 
 # ONNX. Use ONNX_OPSET 0 to use the default for ONNX version
 ONNX_VERSION=1.10.1