From 86d6d0d301c3911ebcee9c9623d21a6b4cd880a6 Mon Sep 17 00:00:00 2001 From: Liangfu Chen Date: Mon, 5 Feb 2024 16:36:10 -0800 Subject: [PATCH 1/3] add setup document for supporting inferentia --- .../getting_started/neuron-installation.rst | 105 ++++++++++++++++++ docs/source/index.rst | 1 + .../source/quantization/fp8_e5m2_kv_cache.rst | 1 + 3 files changed, 107 insertions(+) create mode 100644 docs/source/getting_started/neuron-installation.rst diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst new file mode 100644 index 0000000000000..28fd818aa01e3 --- /dev/null +++ b/docs/source/getting_started/neuron-installation.rst @@ -0,0 +1,105 @@ +.. _installation_neuron: + +Installation with Neuron +======================== + +vLLM 0.3.0 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK. +At the moment Paged Attention is not supported in Neuron SDK, but naive continuous batching is supported in transformers-neuronx. +Data types currently supported in Neuron SDK are FP16 and BF16. + +Requirements +------------ + +* OS: Linux +* Python: 3.8 -- 3.11 +* Accelerator: NeuronCore_v2 (in trn1/inf2 instances) +* Pytorch 2.0.1/2.1.1 +* AWS Neuron SDK 2.16/2.17 (Verified on python 3.8) + +Installation steps: + +- :ref:`Build from source ` + + - :ref:`Step 0. Launch Trn1/Inf2 instances ` + - :ref:`Step 1. Install drivers and tools ` + - :ref:`Step 2. Install transformers-neuronx and its dependencies ` + - :ref:`Step 3. Install vLLM from source ` + +.. _build_from_source_neuron: + +Build from source +----------------- + +Following instructions are applicable to Neuron SDK 2.16 and beyond. + +.. _launch_instances: + +Step 0. Launch Trn1/Inf2 instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here are the steps to launch trn1/inf2 instances, in order to install `PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS `_. + +- Please follow the instructions at `launch an Amazon EC2 Instance `_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type. +- To get more information about instances sizes and pricing see: `Trn1 web page `_, `Inf2 web page `_ +- Select Ubuntu Server 22.04 TLS AMI +- When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB. +- After launching the instance, follow the instructions in `Connect to your instance `_ to connect to the instance + +.. _install_drivers: + +Step 1. Install drivers and tools +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The installation of drivers and tools wouldn't be necessary, if `Deep Learning AMI Neuron `_ is installed. In case the drivers and tools are not installed on the operating system, follow the steps below: + +.. code-block:: console + + # Configure Linux for Neuron repository updates + . /etc/os-release + sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null < Date: Mon, 5 Feb 2024 21:43:46 -0800 Subject: [PATCH 2/3] install neuron packages --- .../getting_started/neuron-installation.rst | 58 ++++++++++++++----- 1 file changed, 44 insertions(+), 14 deletions(-) diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst index 28fd818aa01e3..b68b07270947b 100644 --- a/docs/source/getting_started/neuron-installation.rst +++ b/docs/source/getting_started/neuron-installation.rst @@ -53,33 +53,33 @@ Step 1. Install drivers and tools The installation of drivers and tools wouldn't be necessary, if `Deep Learning AMI Neuron `_ is installed. In case the drivers and tools are not installed on the operating system, follow the steps below: .. code-block:: console - + # Configure Linux for Neuron repository updates . /etc/os-release sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <`_ will be the backend to support inference on trn1/inf2 instances. +Follow the steps below to install transformer-neuronx package and its dependencies. + .. code-block:: console - $ pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com + # Install Python venv + sudo apt-get install -y python3.10-venv g++ + + # Create Python venv + python3.10 -m venv aws_neuron_venv_pytorch + + # Activate Python venv + source aws_neuron_venv_pytorch/bin/activate + + # Install Jupyter notebook kernel + pip install ipykernel + python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" + pip install jupyter notebook + pip install environment_kernels + + # Set pip repository pointing to the Neuron repository + python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + + # Install wget, awscli + python -m pip install wget + python -m pip install awscli + + # Update Neuron Compiler and Framework + python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx .. _install_vllm: Step 3. Install vLLM from source ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows: + .. code-block:: console $ cd vllm $ pip install -U -r requirements-neuron.txt $ pip install . + +If neuron packages are detected correctly in the installation process, ``vllm-0.3.0+neuron212`` will be installed. From 5af99e920c1aedec0eb54635066c844dd26119e9 Mon Sep 17 00:00:00 2001 From: Zhuohan Li Date: Sun, 3 Mar 2024 15:57:13 -0800 Subject: [PATCH 3/3] Update neuron-installation.rst --- docs/source/getting_started/neuron-installation.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst index b68b07270947b..0aff1037d8a29 100644 --- a/docs/source/getting_started/neuron-installation.rst +++ b/docs/source/getting_started/neuron-installation.rst @@ -3,7 +3,7 @@ Installation with Neuron ======================== -vLLM 0.3.0 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK. +vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK. At the moment Paged Attention is not supported in Neuron SDK, but naive continuous batching is supported in transformers-neuronx. Data types currently supported in Neuron SDK are FP16 and BF16.