diff --git a/docs/source/getting_started/neuron-installation.rst b/docs/source/getting_started/neuron-installation.rst new file mode 100644 index 0000000000000..0aff1037d8a29 --- /dev/null +++ b/docs/source/getting_started/neuron-installation.rst @@ -0,0 +1,135 @@ +.. _installation_neuron: + +Installation with Neuron +======================== + +vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK. +At the moment Paged Attention is not supported in Neuron SDK, but naive continuous batching is supported in transformers-neuronx. +Data types currently supported in Neuron SDK are FP16 and BF16. + +Requirements +------------ + +* OS: Linux +* Python: 3.8 -- 3.11 +* Accelerator: NeuronCore_v2 (in trn1/inf2 instances) +* Pytorch 2.0.1/2.1.1 +* AWS Neuron SDK 2.16/2.17 (Verified on python 3.8) + +Installation steps: + +- :ref:`Build from source ` + + - :ref:`Step 0. Launch Trn1/Inf2 instances ` + - :ref:`Step 1. Install drivers and tools ` + - :ref:`Step 2. Install transformers-neuronx and its dependencies ` + - :ref:`Step 3. Install vLLM from source ` + +.. _build_from_source_neuron: + +Build from source +----------------- + +Following instructions are applicable to Neuron SDK 2.16 and beyond. + +.. _launch_instances: + +Step 0. Launch Trn1/Inf2 instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here are the steps to launch trn1/inf2 instances, in order to install `PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS `_. + +- Please follow the instructions at `launch an Amazon EC2 Instance `_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type. +- To get more information about instances sizes and pricing see: `Trn1 web page `_, `Inf2 web page `_ +- Select Ubuntu Server 22.04 TLS AMI +- When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB. +- After launching the instance, follow the instructions in `Connect to your instance `_ to connect to the instance + +.. _install_drivers: + +Step 1. Install drivers and tools +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The installation of drivers and tools wouldn't be necessary, if `Deep Learning AMI Neuron `_ is installed. In case the drivers and tools are not installed on the operating system, follow the steps below: + +.. code-block:: console + + # Configure Linux for Neuron repository updates + . /etc/os-release + sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <`_ will be the backend to support inference on trn1/inf2 instances. +Follow the steps below to install transformer-neuronx package and its dependencies. + +.. code-block:: console + + # Install Python venv + sudo apt-get install -y python3.10-venv g++ + + # Create Python venv + python3.10 -m venv aws_neuron_venv_pytorch + + # Activate Python venv + source aws_neuron_venv_pytorch/bin/activate + + # Install Jupyter notebook kernel + pip install ipykernel + python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" + pip install jupyter notebook + pip install environment_kernels + + # Set pip repository pointing to the Neuron repository + python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + + # Install wget, awscli + python -m pip install wget + python -m pip install awscli + + # Update Neuron Compiler and Framework + python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx + +.. _install_vllm: + +Step 3. Install vLLM from source +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows: + +.. code-block:: console + + $ cd vllm + $ pip install -U -r requirements-neuron.txt + $ pip install . + +If neuron packages are detected correctly in the installation process, ``vllm-0.3.0+neuron212`` will be installed. diff --git a/docs/source/index.rst b/docs/source/index.rst index 3e2331907f0f2..7d782bdfa0a79 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -62,6 +62,7 @@ Documentation getting_started/installation getting_started/amd-installation + getting_started/neuron-installation getting_started/quickstart .. toctree:: diff --git a/docs/source/quantization/fp8_e5m2_kv_cache.rst b/docs/source/quantization/fp8_e5m2_kv_cache.rst index 10437260ad964..f1eeb59550952 100644 --- a/docs/source/quantization/fp8_e5m2_kv_cache.rst +++ b/docs/source/quantization/fp8_e5m2_kv_cache.rst @@ -9,6 +9,7 @@ The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot1 Here is an example of how to enable this feature: .. code-block:: python + from vllm import LLM, SamplingParams # Sample prompts. prompts = [