How to use FP8 to train LLM with multiple GPU
0- python=3.10
1- First install cudnn 8.92 cuda 12.1 Torch 2.3.1 cudnn: https://developer.nvidia.com/rdp/cudnn-archive
2-Then install transformer_engine
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
3- wait for make completed
4- download the two python files to same folder
5- use this command to trigger train: torchrun --nproc_per_node=2 m_gpu.py