From 86d6d0d301c3911ebcee9c9623d21a6b4cd880a6 Mon Sep 17 00:00:00 2001
From: Liangfu Chen <liangfc@amazon.com>
Date: Mon, 5 Feb 2024 16:36:10 -0800
Subject: [PATCH 1/3] add setup document for supporting inferentia

---
 .../getting_started/neuron-installation.rst   | 105 ++++++++++++++++++
 docs/source/index.rst                         |   1 +
 .../source/quantization/fp8_e5m2_kv_cache.rst |   1 +
 3 files changed, 107 insertions(+)
 create mode 100644 docs/source/getting_started/neuron-installation.rst

diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst
new file mode 100644
index 0000000000000..28fd818aa01e3
--- /dev/null
+++ b/docs/source/getting_started/neuron-installation.rst
@@ -0,0 +1,105 @@
+.. _installation_neuron:
+
+Installation with Neuron
+========================
+
+vLLM 0.3.0 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK.
+At the moment Paged Attention is not supported in Neuron SDK, but naive continuous batching is supported in transformers-neuronx.
+Data types currently supported in Neuron SDK are FP16 and BF16.
+
+Requirements
+------------
+
+* OS: Linux
+* Python: 3.8 -- 3.11
+* Accelerator: NeuronCore_v2 (in trn1/inf2 instances)
+* Pytorch 2.0.1/2.1.1
+* AWS Neuron SDK 2.16/2.17 (Verified on python 3.8)
+
+Installation steps:
+
+- :ref:`Build from source <build_from_source_neuron>`
+
+  - :ref:`Step 0. Launch Trn1/Inf2 instances <launch_instances>`
+  - :ref:`Step 1. Install drivers and tools <install_drivers>`
+  - :ref:`Step 2. Install transformers-neuronx and its dependencies <install_tnx>`
+  - :ref:`Step 3. Install vLLM from source <install_vllm>`
+
+.. _build_from_source_neuron:
+
+Build from source
+-----------------
+
+Following instructions are applicable to Neuron SDK 2.16 and beyond.
+
+.. _launch_instances:
+
+Step 0. Launch Trn1/Inf2 instances
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here are the steps to launch trn1/inf2 instances, in order to install `PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html>`_.
+
+- Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
+- To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
+- Select Ubuntu Server 22.04 TLS AMI
+- When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB.
+- After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance
+
+.. _install_drivers:
+
+Step 1. Install drivers and tools
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The installation of drivers and tools wouldn't be necessary, if `Deep Learning AMI Neuron <https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html>`_ is installed. In case the drivers and tools are not installed on the operating system, follow the steps below:
+
+.. code-block:: console
+    
+    # Configure Linux for Neuron repository updates
+    . /etc/os-release
+    sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
+    deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
+    EOF
+    wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
+    
+    # Update OS packages 
+    sudo apt-get update -y
+    
+    # Install OS headers 
+    sudo apt-get install linux-headers-$(uname -r) -y
+    
+    # Install git 
+    sudo apt-get install git -y
+    
+    # install Neuron Driver
+    sudo apt-get install aws-neuronx-dkms=2.* -y
+    
+    # Install Neuron Runtime 
+    sudo apt-get install aws-neuronx-collectives=2.* -y
+    sudo apt-get install aws-neuronx-runtime-lib=2.* -y
+    
+    # Install Neuron Tools 
+    sudo apt-get install aws-neuronx-tools=2.* -y
+    
+    # Add PATH
+    export PATH=/opt/aws/neuron/bin:$PATH
+
+
+.. _install_tnx:
+
+Step 2. Install transformers-neuronx and its dependencies
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: console
+
+    $ pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
+
+.. _install_vllm:
+
+Step 3. Install vLLM from source
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: console
+
+    $ cd vllm
+    $ pip install -U -r requirements-neuron.txt
+    $ pip install .
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 3e2331907f0f2..7d782bdfa0a79 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -62,6 +62,7 @@ Documentation
 
    getting_started/installation
    getting_started/amd-installation
+   getting_started/neuron-installation
    getting_started/quickstart
 
 .. toctree::
diff --git a/docs/source/quantization/fp8_e5m2_kv_cache.rst b/docs/source/quantization/fp8_e5m2_kv_cache.rst
index 10437260ad964..f1eeb59550952 100644
--- a/docs/source/quantization/fp8_e5m2_kv_cache.rst
+++ b/docs/source/quantization/fp8_e5m2_kv_cache.rst
@@ -9,6 +9,7 @@ The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot1
 Here is an example of how to enable this feature:
 
 .. code-block:: python
+
     from vllm import LLM, SamplingParams
     # Sample prompts.
     prompts = [

From 8d29248bc3b11e496e61abc2ae22b9447428c6d2 Mon Sep 17 00:00:00 2001
From: Liangfu Chen <liangfc@amazon.com>
Date: Mon, 5 Feb 2024 21:43:46 -0800
Subject: [PATCH 2/3] install neuron packages

---
 .../getting_started/neuron-installation.rst   | 58 ++++++++++++++-----
 1 file changed, 44 insertions(+), 14 deletions(-)

diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst
index 28fd818aa01e3..b68b07270947b 100644
--- a/docs/source/getting_started/neuron-installation.rst
+++ b/docs/source/getting_started/neuron-installation.rst
@@ -53,33 +53,33 @@ Step 1. Install drivers and tools
 The installation of drivers and tools wouldn't be necessary, if `Deep Learning AMI Neuron <https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html>`_ is installed. In case the drivers and tools are not installed on the operating system, follow the steps below:
 
 .. code-block:: console
-    
+
     # Configure Linux for Neuron repository updates
     . /etc/os-release
     sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
     deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
     EOF
     wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
-    
-    # Update OS packages 
+
+    # Update OS packages
     sudo apt-get update -y
-    
-    # Install OS headers 
+
+    # Install OS headers
     sudo apt-get install linux-headers-$(uname -r) -y
-    
-    # Install git 
+
+    # Install git
     sudo apt-get install git -y
-    
+
     # install Neuron Driver
     sudo apt-get install aws-neuronx-dkms=2.* -y
-    
-    # Install Neuron Runtime 
+
+    # Install Neuron Runtime
     sudo apt-get install aws-neuronx-collectives=2.* -y
     sudo apt-get install aws-neuronx-runtime-lib=2.* -y
-    
-    # Install Neuron Tools 
+
+    # Install Neuron Tools
     sudo apt-get install aws-neuronx-tools=2.* -y
-    
+
     # Add PATH
     export PATH=/opt/aws/neuron/bin:$PATH
 
@@ -89,17 +89,47 @@ The installation of drivers and tools wouldn't be necessary, if `Deep Learning A
 Step 2. Install transformers-neuronx and its dependencies
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+`transformers-neuronx <https://github.com/aws-neuron/transformers-neuronx>`_ will be the backend to support inference on trn1/inf2 instances.
+Follow the steps below to install transformer-neuronx package and its dependencies.
+
 .. code-block:: console
 
-    $ pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
+    # Install Python venv
+    sudo apt-get install -y python3.10-venv g++
+
+    # Create Python venv
+    python3.10 -m venv aws_neuron_venv_pytorch
+
+    # Activate Python venv
+    source aws_neuron_venv_pytorch/bin/activate
+
+    # Install Jupyter notebook kernel
+    pip install ipykernel
+    python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
+    pip install jupyter notebook
+    pip install environment_kernels
+
+    # Set pip repository pointing to the Neuron repository
+    python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
+
+    # Install wget, awscli
+    python -m pip install wget
+    python -m pip install awscli
+
+    # Update Neuron Compiler and Framework
+    python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx
 
 .. _install_vllm:
 
 Step 3. Install vLLM from source
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows:
+
 .. code-block:: console
 
     $ cd vllm
     $ pip install -U -r requirements-neuron.txt
     $ pip install .
+
+If neuron packages are detected correctly in the installation process, ``vllm-0.3.0+neuron212`` will be installed.

From 5af99e920c1aedec0eb54635066c844dd26119e9 Mon Sep 17 00:00:00 2001
From: Zhuohan Li <zhuohan123@gmail.com>
Date: Sun, 3 Mar 2024 15:57:13 -0800
Subject: [PATCH 3/3] Update neuron-installation.rst

---
 docs/source/getting_started/neuron-installation.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst
index b68b07270947b..0aff1037d8a29 100644
--- a/docs/source/getting_started/neuron-installation.rst
+++ b/docs/source/getting_started/neuron-installation.rst
@@ -3,7 +3,7 @@
 Installation with Neuron
 ========================
 
-vLLM 0.3.0 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK.
+vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK.
 At the moment Paged Attention is not supported in Neuron SDK, but naive continuous batching is supported in transformers-neuronx.
 Data types currently supported in Neuron SDK are FP16 and BF16.