diff --git a/docs/en/index.rst b/docs/en/index.rst index c702e0a04..be23d669a 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -56,6 +56,16 @@ Documentation training/open_source_dataset.rst training/visualization.rst +.. toctree:: + :maxdepth: 2 + :caption: RLHF + + rlhf/rlhf_intro.rst + rlhf/overview.rst + rlhf/quickstart.rst + rlhf/modify_settings.rst + rlhf/arch.rst + .. toctree:: :maxdepth: 2 :caption: Acceleration diff --git a/docs/en/rlhf/arch.rst b/docs/en/rlhf/arch.rst new file mode 100644 index 000000000..e43c04ad7 --- /dev/null +++ b/docs/en/rlhf/arch.rst @@ -0,0 +1,58 @@ +.. _xtuner_rlhf_arch: + +System Architecture +------------------- + +The architecture of XTuner-RLHF is shown as follows: + +.. image:: images/arch_en.svg + :alt: XTuner-RLHF Architecture + +Algorithm Layer +~~~~~~~~~~~~~~~ + +The algorithm layer implements various reinforcement learning algorithms and environments, i.e., specific training strategies and application scenarios. This includes various reinforcement learning algorithms such as PPO and KTO, as well as different task environments such as Q&A (question and answer) and LR (logical reasoning). + +Coordination Layer +~~~~~~~~~~~~~~~~~~~~ + +The coordination layer provides model-level operation interfaces to the upper algorithm layer, simplifying interactions with the underlying engines. It also adapts to different training and inference frameworks and models, managing and scheduling multiple model resources to ensure efficient system operation. + +Engine Layer +~~~~~~~~~~~~ + +The engine layer decouples training, inference, and generation, allowing users to choose different engines for these processes. For example, transformers can be used for training and inference, while the vLLM can be used for generation. The advantages of a multi-engine design include: + +**Flexibility and Adaptability**: Different projects may have different requirements and constraints. Integrating multiple frameworks allows users to choose the most suitable tool for their specific situation, enhancing development efficiency and effectiveness. + +**Performance Optimization**: Different frameworks may perform differently on various tasks. Users can choose the framework that performs best for a specific task to achieve optimal performance. + +**Cross-Platform Compatibility**: Some frameworks perform better on specific platforms or support specific hardware. Providing multiple framework options ensures compatibility and optimization across different platforms and hardware. + +**Ease of Use**: Some frameworks may be more user-friendly and suitable for rapid prototyping, while others may be better suited for large-scale deployment. Users can choose the appropriate framework based on the development stage. + +Distributed Computing Framework: Ray +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As a project incubated within InternLM, the RLHF system has adopted Ray as the distributed framework for efficient training, inference, and generation since the project began in May 2023, empowering the system with the following features: + +**Abstraction of Underlying Cluster Differences**: Ray provides an abstraction layer, allowing users to ignore the details of the underlying hardware. Whether using local clusters or cloud resources, Ray can uniformly manage and schedule tasks, simplifying the development and deployment process. + +**Efficient Resource Management**: Resources such as CPU and GPU can be dynamically allocated according to task requirements, ensuring efficient use of computational resources and improving overall system performance. + +**Scalability**: Ray can easily scale to large clusters, supporting hundreds or even thousands of nodes. This allows the system to scale horizontally to meet the demands of large-scale data processing and computation. + +**Flexible Task Scheduling**: Ray can schedule tasks flexibly based on priority and resource requirements, optimizing task execution order, reducing task waiting time, and improving system throughput. + +**Automated Fault Recovery**: Ray has built-in fault tolerance mechanisms that can detect and attempt to recover failed tasks, enhancing the stability and reliability of the system while reducing the need for manual intervention. + +Acknowledgements +~~~~~~~~~~~~~~~~~ + +In our journey of exploring and implementing the RLHF system, we have been fortunate to witness many outstanding open-source projects that shine like brilliant stars, illuminating our path forward. For example: + +- `ColossalChat `_: Ingeniously utilizes Ray to implement distributed PPO, distributing trainers and experience makers across different nodes, enhancing computational efficiency. +- `ATorch `_: Adopts an innovative design of "training-decoding decoupling + high-performance inference backend," compatible with the open-source vLLM engine as the inference backend, supporting efficient fine-tuning of trillion-scale models. +- `OpenRLHF `_: A concise, easy-to-use and open-source spirited RLHF training framework that leverages open-source projects such as Ray, DeepSpeed, vLLM, and HF Transformers to implement high-performance PPO and other algorithms. + +We hold deep respect and gratitude for the developers in the open-source community. They have not only shared valuable knowledge and experience but also fostered the prosperity and development of the large model RLHF system ecosystem with an open mindset. We believe that it is this selfless spirit of sharing that makes our community stronger and technological progress faster. We thank every contributor once again; it is your efforts that make this world a better place. \ No newline at end of file diff --git a/docs/en/rlhf/images/arch_en.svg b/docs/en/rlhf/images/arch_en.svg new file mode 100644 index 000000000..ade8a0e0b --- /dev/null +++ b/docs/en/rlhf/images/arch_en.svg @@ -0,0 +1 @@ +
transformers
vLLM
......
transformers...
Generate
Generate
transformers
vLLM
......
transformers...
Infer
Infer
Accelerate
DeepSpeed
InternEvo
......
Accelerate...
Train
Train
Engine Layer
Engine Layer
Coordination Layer
Coordination Layer
Algorithm Layer
Algorithm Layer
Algorithm
Algorithm
PPO
KTO
......
PPO...
Environment
Environment
Q&A
LR
......
Q&A...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/en/rlhf/images/rlhf_process.svg b/docs/en/rlhf/images/rlhf_process.svg new file mode 100644 index 000000000..ce5006a0e --- /dev/null +++ b/docs/en/rlhf/images/rlhf_process.svg @@ -0,0 +1 @@ +
Human
Labeled Data
Human...
Pretrained
Model
Pretrained...
SFT
Model
SFT...
Reward
Model
Reward...
Pretrained
Model
Pretrained...
Pair good/bad
answers
Pair good/bad...
Actor Model
Actor Model
Reference Model
Reference Model
Reward Model
Reward Model
Critic Model
Critic Model
Frozen
Frozen
Frozen
Frozen
Step 1
Step 1
Step 2
Step 2
PPO
PPO
Generate
Generate
Step 3
Step 3
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/en/rlhf/images/speed_comp.svg b/docs/en/rlhf/images/speed_comp.svg new file mode 100644 index 000000000..dde030873 --- /dev/null +++ b/docs/en/rlhf/images/speed_comp.svg @@ -0,0 +1 @@ +
0
0
50
50
100
100
150
150
20 x A800 (80G)
20 x A800 (80G)
24 x A800 (80G)
24 x A800 (80G)
End to End Time (s)
End to End Time (s)
XTuner-RLHF VS DeepSpeed-Chat
XTuner-RLHF VS DeepSpeed-Chat
GPU Environments
GPU Environments
Model: LLaMA 7B
Model: LLaMA 7B
Network: RoCE
Network: RoCE
XTuner-RLHF
XTuner-RLHF
DeepSpeed-Chat
DeepSpeed-Chat
112
112
114
114
141
141
172
172
Generation Length: 1.5k ~ 2k
Generation Length: 1.5k ~ 2k
Dataset: Dahoas/full-hh-rlhf
Dataset: Dahoas/full-hh-rlhf
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/en/rlhf/modify_settings.rst b/docs/en/rlhf/modify_settings.rst new file mode 100644 index 000000000..20c6ace7b --- /dev/null +++ b/docs/en/rlhf/modify_settings.rst @@ -0,0 +1,260 @@ +.. _xtuner_rlhf_modify_settings: + +Modify RLHF PPO Configuration +============================= + +This section starts with a basic configuration file and provides examples of modifications for common training scenarios. + +Configuration File Overview +--------------------------- + +The following is the configuration of the InternLM2 1.8B model fine-tuned using PPO through XTuner-RLHF. + +.. code:: python + + import torch + + rollout_config=dict( + actor_micro_bs=32, + reward_micro_bs=32, + clip_reward_min=-1.5, + clip_reward_max=1.5, + max_new_tokens=1024, + generate_kwargs={ + "do_sample": True, + "temperature": 1.0, + "top_k": 0, + "top_p": 0.9, + "min_new_tokens": 1, + "num_beams": 1, + "early_stopping": True, + "eos_token_id": 92542, + "pad_token_id": 0, + } + ) + + repeater_config=dict( + actor_micro_bs=8, + ref_micro_bs=8, + critic_micro_bs=32, + reward_scale=False, + fine_grained_rm=False, + value_ema=False, + kl_coeff = 0.02, + gamma = 1.0, + gae_lambda = 0.95, + answer_end_id = 92542, + norm_adv = True, + ) + + train_config=dict( + ppo_minibatch=64, + value_minibatch=64, + actor_micro_bs=2, + critic_micro_bs=2, + pretrain_step=0, + save_interval=80, + ) + + model_configs=dict( + actor = dict( + model_path="internlm/internlm2-chat-1_8b-sft", + model_type="actor", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + train_kwargs=dict( + micro_bsz=1, + lr=1e-6, + total_steps=1e9, + lr_decay_rate=1, + loss_type="per_seq", + ), + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + generator_config=dict( + shared_with_trainer=True, + ), + ), + + reference = dict( + model_path="internlm/internlm2-chat-1_8b-sft", + model_type="reference", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + + critic = dict( + model_path="internlm/internlm2-chat-1_8b-reward", + model_type="critic", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + train_kwargs=dict( + micro_bsz=1, + lr=1e-6, + total_steps=1e9, + lr_decay_rate=1, + loss_type="per_seq", + ), + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + + reward = dict( + model_path="internlm/internlm2-chat-1_8b-reward", + model_type="reward", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + ) + + dataset_config = { + "num_samples_each_epoch": 64, + "max_seq_len": 1024, + "random_seed": 1024, + "ppo_datas": [ + "Anthropic/hh-rlhf/helpful-base::1.0", + "Anthropic/hh-rlhf/harmless-base::0.5",], + } + +Scenario 1: From InternLM2 1.8B to InternLM2 7B +----------------------------------------------- + +- **Modify model path**: Change the model_path of actor/ref from ``internlm/internlm2-chat-1_8b-sft`` to ``internlm/internlm2-chat-7b-sft`` and the model_path of critic/reward from ``internlm/internlm2-chat-1_8b-reward`` to ``internlm/internlm2-chat-7b-reward``. + +- **Modify data parallel mode**: Change the parallel mode of actor/critic from ``ddp`` to ``deepspeed``, and configure zero3 and related parameters accordingly. + +- **Modify data parallelism degree**: Adjust the data parallelism degree of the ref/reward model according to the global batch size and resource amount, for example, changing it from 1 to 2. + +The modified configuration file is as follows: + +.. code:: python + + import torch + + ... + + model_configs=dict( + actor = dict( + model_path="internlm/internlm2-chat-7b-sft", + ... + trainer_config=dict( + ... + parallel=dict( + data=dict(size=8, mode="deepspeed"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + deepspeed_config={ + "bf16": {"enable": False}, + "fp16": {"enable": False}, + "zero_optimization": { + "stage": 3, + "stage3_gather_16bit_weights_on_model_save": True, + }, + "gradient_accumulation_steps": 8, + "train_micro_batch_size_per_gpu": 2, + }, + ), + generator_config=dict( + shared_with_trainer=True, + ), + ), + + # critic same as actor modifications + critic = dict( ... ) + + reward = dict( + model_path="internlm/internlm2-chat-7b-reward", + ... + trainer_config=dict( + torch_dtype="auto", + trainer_type="huggingface", + parallel=dict( + data=dict(size=2, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + + # reference same as reward modifications + reference = dict( ... ) + ) + + ... + +Scenario 2: From InternLM2 7B to LLaMA2 7B +------------------------------------------ + +- **Modify model path**: Change the model_path of actor/ref to ``OpenLLMAI/Llama-2-7b-sft-model-ocra-500k`` and the model_path of critic/reward to ``OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt``. + +- **Modify Tokenizer Configuration**: Update the tokenizer_config to adapt to the LLaMA2 model. + +.. code:: python + + tokenizer_config = dict( + pad_token_id = 2, + eos_token_id = 2, + padding_side = 'left', + chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{'Human:\n' + message['content'] + '\n'}}{% elif message['role'] == 'assistant' %}{{'Assistant:\n' + message['content'] + '\n'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:\n' }}{% endif %}", + ) + +Scenario 3: Using vLLM to Accelerate LLaMA2 7B Generation +-------------------------------------------------------- + +Switch from DeepSpeed generation + DeepSpeed training to vLLM generation + DeepSpeed training, and increase the number of GPUs to accommodate the vLLM generator, with the configuration modified as follows: + +.. code:: python + + import torch + + ... + + model_configs=dict( + actor = dict( + ... + generator_config=dict( + shared_with_trainer=False, + generator_type="vllm", + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=2, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + ... + ) + + ... \ No newline at end of file diff --git a/docs/en/rlhf/overview.rst b/docs/en/rlhf/overview.rst new file mode 100644 index 000000000..5d1f32fe0 --- /dev/null +++ b/docs/en/rlhf/overview.rst @@ -0,0 +1,37 @@ +.. _xtuner_rlhf_overview: + +XTuner-RLHF Overview +==================== + +Introduction +------------ + +XTuner-RLHF provides support for the last step in the RLHF process (:ref:`rlhf_intro`) - training the Actor Model using reinforcement learning algorithms, with the following advantages: + +- **Dual Engine**: XTuner-RLHF allows users to choose different frameworks for training, inference, and generation. For instance, Huggingface engines can be used for training and inference, while the vLLM engine can be used for generation. + +- **Ray**: XTuner-RLHF integrates Ray for distributed training, inference, and generation, offering efficient resource management and task scheduling capabilities. Users do not need to focus on the details of the underlying cluster. Whether on local clusters or the cloud, Ray can uniformly manage resources and schedule tasks, simplifying the development and deployment process. + +- **Scalability**: XTuner-RLHF adopts a layered architecture design (:ref:`xtuner_rlhf_arch`), dividing the system into the engine layer, scheduling layer, and algorithm layer. This allows for easy extension of different training and inference engines and reinforcement learning algorithms. + +Speed Benchmark +------------------------------------------ + +.. image:: images/speed_comp.svg + :alt: Speed Benchmark + +Quick Start +----------- + +Refer to :ref:`xtuner_rlhf_quick_start`. + +Future Prospects +---------------- + +In the future, XTuner-RLHF plans to integrate the following features: + +- **Training/Inference Backend**: Support for various training and inference frameworks, such as InternEvo and LMDeploy. + +- **Reinforcement Learning Environments**: Integration of multiple reinforcement learning environments beyond text dialogue, such as logical reasoning, code generation, etc. + +- **Reinforcement Learning Algorithms**: Support for various reinforcement learning algorithms beyond PPO, such as KTO, and more. \ No newline at end of file diff --git a/docs/en/rlhf/quick_start.rst b/docs/en/rlhf/quick_start.rst new file mode 100644 index 000000000..b93c626d4 --- /dev/null +++ b/docs/en/rlhf/quick_start.rst @@ -0,0 +1,84 @@ +.. _xtuner_rlhf_quick_start: + +XTuner-RLHF Quick Start +======================= + +RLHF includes supervised instruction fine-tuning (SFT), training the reward model, and Proximal Policy Optimization (PPO). After completing the first two steps to obtain the Actor Model and Reward Model, we can take XTuner's ``rlhf`` command to train the Actor Model and align model outputs by PPO algorithm. + +Data Preparation +---------------- + +XTuner uses the following dataset format for RLHF PPO training: + +.. code:: json + + [{"role": "user", "content": "xxx"}] + [{"role": "user", "content": "yyy"}] + +Training +-------- + +Step 1: Obtain Configuration Files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can obtain the corresponding configuration files from the `Configuration File Directory `__. + +Step 2: Modify Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Modify the dataset file path through the ``dataset_config["ppo_datas"]`` field. The dataset file should be followed by the ``::`` suffix to indicate the weight of the dataset file. For example: + +.. code:: python + + dataset_config = { + "num_samples_each_epoch": 64, + "max_seq_len": 1024, + "random_seed": 1024, + "ppo_datas": [ + "Anthropic/hh-rlhf/helpful-base::1.0", + "Anthropic/hh-rlhf/harmless-base::0.5"], + } + +This indicates that two-thirds of the data come from ``Anthropic/hh-rlhf/helpful-base`` and one-third from ``Anthropic/hh-rlhf/harmless-base``. + +Modify the model path through the ``model_path`` field. For example: + +.. code:: python + + model_configs=dict( + actor = dict( + model_path="internlm/internlm2-chat-1_8b-sft", + ... + ), + ... + ) + +This indicates that the Actor Model is ``internlm/internlm2-chat-1_8b-sft``. + +For more detailed examples of modifying configuration files, see :ref:`xtuner_rlhf_modify_settings`. + +Step 3: Start Training +~~~~~~~~~~~~~~~~~~~~~~ + +On a single node: + +.. code:: bash + + xtuner rlhf ${CONFIG_FILE} + +On a Ray cluster: + +.. code:: bash + + # on node 0 + ray start --head + + # on node 1 + ray start --address ${NODE_0_ADDR}:6379 + xtuner rlhf --address ${NODE_0_ADDR} ${CONFIG_FILE} + +On a Slurm cluster: + +.. code:: bash + + srun -p $PARTITION --job-name=rlhf --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner rlhf ${CONFIG_FILE} \ No newline at end of file diff --git a/docs/en/rlhf/rlhf_intro.rst b/docs/en/rlhf/rlhf_intro.rst new file mode 100644 index 000000000..11dc0e9d0 --- /dev/null +++ b/docs/en/rlhf/rlhf_intro.rst @@ -0,0 +1,76 @@ +.. _rlhf_intro: + +Introduction to RLHF +==================== + +What is RLHF? +------------- + +RLHF (Reinforcement Learning with Human Feedback) is a training method that combines reinforcement learning with human feedback. By leveraging human feedback, RLHF aims to improve the performance of machine learning models, particularly in natural language processing and generation tasks. + +The core idea of RLHF is to introduce human feedback as a reward signal within the traditional reinforcement learning framework to guide model training. By incorporating human feedback, RLHF can better capture and reflect user needs and preferences, thereby enhancing the practical application of the model. + +Three Core Processes of RLHF +---------------------------- + +RLHF consists of three core processes: Supervised Fine-Tuning (SFT), training the Reward Model, and Proximal Policy Optimization (PPO), as shown in the figure below: + +.. image:: images/rlhf_process.svg + :alt: RLHF Process Diagram + +1. Supervised Fine-Tuning (SFT) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Supervised Fine-Tuning (SFT) is the initial stage of RLHF, aimed at fine-tuning a pre-trained model using supervised data to obtain the Actor Model. The specific steps are as follows: + +- **Data Preparation**: Collect and prepare high-quality, labeled training data, which usually comes from expert annotations or extensive user interaction records. + +- **Model Initialization**: Choose a pre-trained large model (such as GPT-3, BERT, etc.) as the initial model. + +- **Fine-Tuning**: Use the collected supervised data to fine-tune the pre-trained model. During fine-tuning, the model gradually adjusts its parameters by minimizing the error between the predicted output and the true labels. + +The goal of SFT is to enable the model to perform well on predefined tasks, providing a good initial state. The fine-tuned model serves as the Actor Model for subsequent reinforcement learning training. XTuner provides tools for fine-tuning; see the :ref:`custom_sft_dataset` section for details. + +2. Training the Reward Model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Training the Reward Model (RM) is the second core process of RLHF, aimed at establishing a reward function to evaluate the quality of model outputs based on human feedback. The specific steps are as follows: + +- **Data Collection**: Collect data containing human feedback, such as comparisons or ratings of different responses generated by the model. + +- **Data Processing**: Convert the collected feedback data into the format required for training the reward model, such as constructing pairs of samples with comparative advantages and disadvantages. + +- **Model Training**: Use the processed feedback data to train the reward model. The reward model is usually a neural network that, by learning from feedback data, can generate a reward score for any input to evaluate output quality. + +The goal of the reward model is to accurately reflect human preferences for different outputs, guiding policy optimization in the subsequent reinforcement learning phase. + +3. Training the Actor Model with Reinforcement Learning Algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The final and most complex process of RLHF is using reinforcement learning algorithms to optimize the Actor Model obtained in the SFT stage based on human feedback. Typically, the Proximal Policy Optimization (PPO) algorithm is used, with the specific steps as follows: + +- **Initialize Policy**: Use the model fine-tuned through SFT as the initial policy model, i.e., the Actor Model. + +- **Sample Data**: Generate a large number of samples using the current policy, including the model outputs and their corresponding contexts. + +- **Compute Rewards**: Use the trained reward model to compute reward scores for the generated sample data. + +- **Policy Update**: Update the model parameters using the PPO algorithm. + +The goal of PPO is to iteratively optimize the Actor Model so that its outputs not only meet task requirements but also better align with human preferences. This process involves four models: + +- **Reference Model**: The model fine-tuned through SFT, which remains unchanged during the PPO process. + +- **Actor Model**: Initialized as the model fine-tuned through SFT and iteratively optimized during the PPO process. + +- **Reward Model**: The reward model trained based on human feedback data, providing short-term rewards for the Actor Model's outputs, and remaining unchanged during the PPO process. + +- **Critic Model**: Initialized as the Reward Model, providing long-term rewards for the Actor Model's outputs, and iteratively optimized during the PPO process. + +These four models involve three types of operations: + +- **Training**: Training models and optimizing parameters, involving the Actor Model and Critic Model. + +- **Inference**: Models obtain logits based on input token sequences, involving all four models. + +- **Generation**: Models iteratively predict and output the next token based on input token sequences until the EOS Token (End of Sequence), involving the Actor Model. \ No newline at end of file diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst index afe6e1c76..84ab51f3c 100644 --- a/docs/zh_cn/index.rst +++ b/docs/zh_cn/index.rst @@ -55,6 +55,16 @@ training/modify_settings.rst training/visualization.rst +.. toctree:: + :maxdepth: 2 + :caption: RLHF + + rlhf/rlhf_intro.rst + rlhf/overview.rst + rlhf/quick_start.rst + rlhf/modify_settings.rst + rlhf/arch.rst + .. toctree:: :maxdepth: 2 :caption: 加速训练 @@ -68,7 +78,6 @@ acceleration/hyper_parameters.rst acceleration/benchmark.rst - .. toctree:: :maxdepth: 1 :caption: InternEvo 迁移 diff --git a/docs/zh_cn/rlhf/arch.rst b/docs/zh_cn/rlhf/arch.rst new file mode 100644 index 000000000..1352d7537 --- /dev/null +++ b/docs/zh_cn/rlhf/arch.rst @@ -0,0 +1,58 @@ +.. _xtuner_rlhf_arch: + +系统架构 +------------- + +XTuner-RLHF 模块的架构如图所示: + +.. image:: images/arch.svg + :alt: XTuner RLHF 架构 + +算法层 +~~~~~~~~~~~~~ + +算法层实现不同的强化学习算法和环境,即具体的训练策略和应用场景。包括 PPO、DPO 等各种强化学习算法,以及 Q&A(问答)、LR(逻辑推理)等不同的任务环境。 + +调度层 +~~~~~~~~~~~~~ + +中间的调度层向上层算法提供模型级别操作接口,简化与底层引擎的交互,同时向下适配不同的训练框架和模型,负责多模型资源的统筹和调度,确保系统高效运行。 + +引擎层 +~~~~~~~~~~~~~ + +引擎层对训练、推理和生成进行了解耦,支持用户选用不同的引擎进行训练、推理和生成,比如可选用 transformers 进行训练和推理,选用 vLLM 进行生成。多引擎设计的优点在于: + +**灵活性和适应性**:不同的项目可能有不同的需求和限制。集成多个框架可以让用户根据具体情况选择最合适的工具,提升开发效率和效果。 + +**性能优化**:不同框架在不同类型的任务上可能有不同的性能表现。用户可以选择在特定任务上表现最优的框架,以达到最佳性能。 + +**跨平台兼容性**:某些框架在特定平台上表现更好,或仅支持特定的硬件。提供多个框架选择可以确保在不同平台和硬件上的兼容性和优化。 + +**易用性**:一些框架可能更加用户友好,适合快速原型开发;而另一些框架可能更适合大规模部署。用户可以根据开发阶段选择合适的框架。 + +分布式框架 Ray +~~~~~~~~~~~~~~~ + +作为 InternLM 内部孵化的项目,RLHF 系统自 2023 年 5 月项目伊始,就采用了 Ray 作为分布式框架来进行高效的训练、推理和生成,使系统具备了以下特点: + +**屏蔽底层集群差异**:Ray 提供了计算集群的抽象,使得用户无需关注底层真实集群的实现细节:无论是本地 Kubernetes 或 Slurm 集群,抑或是云端资源,Ray 都能统一管理和调度任务,从而简化开发和部署流程。 + +**高效的资源管理**:可以动态调整计算资源的分配,根据任务的需求灵活调度 CPU、GPU 等资源,确保高效利用计算资源,提升系统整体性能。 + +**扩展性强**:可以方便地扩展到大规模集群,支持数百乃至数千个节点。这使得系统可以根据需求进行水平扩展,满足大规模数据处理和计算的需求。 + +**灵活的任务调度**:可以根据任务的优先级和资源需求进行灵活调度,优化任务执行顺序,减少任务的等待时间,提高系统吞吐量。 + +**自动化故障恢复**:Ray 内置了一定的容错机制,能够检测并尝试恢复失败的任务,提升了系统的稳定性和可靠性,减少了人为干预的需要。 + +致谢 +~~~~~~~~~~~~~ + +在探索和实现 RLHF 系统的旅途中,我们有幸见证了众多杰出的开源项目,它们如同璀璨的星辰,照亮了我们前行的道路。举例来说: + +- `ColossalChat `_:巧妙地运用了 Ray 来实现分布式 PPO,将 trainer 和 experience makers 分布于不同的节点,提升了计算效率。 +- `ATorch `_:采用“训练-推理解耦 + 高性能推理后端”的创新设计,兼容开源 vLLM 引擎作为推理后端,支持了千亿模型高效指令微调。 +- `OpenRLHF `_:一个简单易用、富有开源精神的 RLHF 训练框架,基于 Ray、DeepSpeed、vLLM 和 HF Transformers 等开源项目,实现了高性能的 PPO 等算法。 + +我们对开源社区的开发者们怀有深深的敬意和感激。他们不仅分享了宝贵的知识和经验,更以开放的心态,促进了大模型 RLHF 系统生态的繁荣与发展。我们相信,正是这种无私的分享精神,让我们的社区更加强大,也让技术的进步更加迅速。再次感谢每一位贡献者,是你们的努力让这个世界变得更加美好。 \ No newline at end of file diff --git a/docs/zh_cn/rlhf/images/arch.svg b/docs/zh_cn/rlhf/images/arch.svg new file mode 100644 index 000000000..be6f6612c --- /dev/null +++ b/docs/zh_cn/rlhf/images/arch.svg @@ -0,0 +1 @@ +
transformers
vLLM
......
transformers...
生成
生成
transformers
vLLM
......
transformers...
推理
推理
Accelerate
DeepSpeed
InternEvo
......
Accelerate...
训练
训练
引擎层
引擎层
调度层
调度层
算法层
算法层
算法
算法
PPO
KTO
......
PPO...
环境
环境
Q&A
LR
......
Q&A...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/zh_cn/rlhf/images/rlhf_process.svg b/docs/zh_cn/rlhf/images/rlhf_process.svg new file mode 100644 index 000000000..ce5006a0e --- /dev/null +++ b/docs/zh_cn/rlhf/images/rlhf_process.svg @@ -0,0 +1 @@ +
Human
Labeled Data
Human...
Pretrained
Model
Pretrained...
SFT
Model
SFT...
Reward
Model
Reward...
Pretrained
Model
Pretrained...
Pair good/bad
answers
Pair good/bad...
Actor Model
Actor Model
Reference Model
Reference Model
Reward Model
Reward Model
Critic Model
Critic Model
Frozen
Frozen
Frozen
Frozen
Step 1
Step 1
Step 2
Step 2
PPO
PPO
Generate
Generate
Step 3
Step 3
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/zh_cn/rlhf/images/speed_comp.svg b/docs/zh_cn/rlhf/images/speed_comp.svg new file mode 100644 index 000000000..dde030873 --- /dev/null +++ b/docs/zh_cn/rlhf/images/speed_comp.svg @@ -0,0 +1 @@ +
0
0
50
50
100
100
150
150
20 x A800 (80G)
20 x A800 (80G)
24 x A800 (80G)
24 x A800 (80G)
End to End Time (s)
End to End Time (s)
XTuner-RLHF VS DeepSpeed-Chat
XTuner-RLHF VS DeepSpeed-Chat
GPU Environments
GPU Environments
Model: LLaMA 7B
Model: LLaMA 7B
Network: RoCE
Network: RoCE
XTuner-RLHF
XTuner-RLHF
DeepSpeed-Chat
DeepSpeed-Chat
112
112
114
114
141
141
172
172
Generation Length: 1.5k ~ 2k
Generation Length: 1.5k ~ 2k
Dataset: Dahoas/full-hh-rlhf
Dataset: Dahoas/full-hh-rlhf
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/zh_cn/rlhf/modify_settings.rst b/docs/zh_cn/rlhf/modify_settings.rst new file mode 100644 index 000000000..fd3cfa147 --- /dev/null +++ b/docs/zh_cn/rlhf/modify_settings.rst @@ -0,0 +1,260 @@ +.. _xtuner_rlhf_modify_settings: + +修改 RLHF PPO 配置 +============ + +本章节将从一个基础配置文件开始,给出一些常见训练场景下的配置文件修改示例。 + +配置文件速览 +------------ + +以下是 XTuner-RLHF 通过 PPO 训练微调后的 InternLM2 1.8B 模型的配置。 + +.. code:: python + + import torch + + rollout_config=dict( + actor_micro_bs=32, + reward_micro_bs=32, + clip_reward_min=-1.5, + clip_reward_max=1.5, + max_new_tokens=1024, + generate_kwargs={ + "do_sample": True, + "temperature": 1.0, + "top_k": 0, + "top_p": 0.9, + "min_new_tokens": 1, + "num_beams": 1, + "early_stopping": True, + "eos_token_id": 92542, + "pad_token_id": 0, + } + ) + + repeater_config=dict( + actor_micro_bs=8, + ref_micro_bs=8, + critic_micro_bs=32, + reward_scale=False, + fine_grained_rm=False, + value_ema=False, + kl_coeff = 0.02, + gamma = 1.0, + gae_lambda = 0.95, + answer_end_id = 92542, + norm_adv = True, + ) + + train_config=dict( + ppo_minibatch=64, + value_minibatch=64, + actor_micro_bs=2, + critic_micro_bs=2, + pretrain_step=0, + save_interval=80, + ) + + model_configs=dict( + actor = dict( + model_path="internlm/internlm2-chat-1_8b-sft", + model_type="actor", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + train_kwargs=dict( + micro_bsz=1, + lr=1e-6, + total_steps=1e9, + lr_decay_rate=1, + loss_type="per_seq", + ), + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + generator_config=dict( + shared_with_trainer=True, + ), + ), + + reference = dict( + model_path="internlm/internlm2-chat-1_8b-sft", + model_type="reference", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + + critic = dict( + model_path="internlm/internlm2-chat-1_8b-reward", + model_type="critic", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + train_kwargs=dict( + micro_bsz=1, + lr=1e-6, + total_steps=1e9, + lr_decay_rate=1, + loss_type="per_seq", + ), + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + + reward = dict( + model_path="internlm/internlm2-chat-1_8b-reward", + model_type="reward", + torch_dtype=torch.bfloat16, + trainer_config=dict( + trainer_type="huggingface", + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + ) + + dataset_config = { + "num_samples_each_epoch": 64, + "max_seq_len": 1024, + "random_seed": 1024, + "ppo_datas": [ + "Anthropic/hh-rlhf/helpful-base::1.0", + "Anthropic/hh-rlhf/harmless-base::0.5",], + } + +场景一:从 InternLM2 1.8B 到 InternLM2 7B +---------------- + +- **修改模型路径**:actor/ref 的 model_path 从 ``internlm/internlm2-chat-1_8b-sft`` 改为 ``internlm/internlm2-chat-7b-sft``,critic/reward 的 model_path 从 ``internlm/internlm2-chat-1_8b-reward`` 改为 ``internlm/internlm2-chat-7b-reward``。 + +- **修改数据并行模式**:将 actor/critic 的 parallel 从 ``ddp`` 改为 ``deepspeed``,并相应配置 zero3 及其相关参数。 + +- **修改数据并行度**:根据全局的 batch size 和资源量,适当修改 ref/reward 模型的 data parallelism 程度,比如从 1 改为 2。 + +修改后的配置文件如下: + +.. code:: python + + import torch + + ... + + model_configs=dict( + actor = dict( + model_path="internlm/internlm2-chat-7b-sft", + ... + trainer_config=dict( + ... + parallel=dict( + data=dict(size=8, mode="deepspeed"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + deepspeed_config={ + "bf16": {"enable": False}, + "fp16": {"enable": False}, + "zero_optimization": { + "stage": 3, + "stage3_gather_16bit_weights_on_model_save": True, + }, + "gradient_accumulation_steps": 8, + "train_micro_batch_size_per_gpu": 2, + }, + ), + generator_config=dict( + shared_with_trainer=True, + ), + ), + + # critic 同 actor 做类似修改 + critic = dict( ... ) + + reward = dict( + model_path="internlm/internlm2-chat-7b-reward", + ... + trainer_config=dict( + torch_dtype="auto", + trainer_type="huggingface", + parallel=dict( + data=dict(size=2, mode="ddp"), + tensor=dict(size=1, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + + # reference 同 reward 做类似修改 + reference = dict( ... ) + ) + + ... + +场景二:从 InternLM2 7B 到 LLaMA2 7B +---------------- + +- **修改模型路径**:修改 actor/ref 的 model_path 为 ``OpenLLMAI/Llama-2-7b-sft-model-ocra-500k``,修改 critic/reward 的 model_path 为 ``OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt``。 + +- **修改 Tokenizer 配置**:修改 tokenizer_config 以适配 LLaMA2 的模型。 + +.. code:: python + + tokenizer_config = dict( + pad_token_id = 2, + eos_token_id = 2, + padding_side = 'left', + chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{'Human:\n' + message['content'] + '\n'}}{% elif message['role'] == 'assistant' %}{{'Assistant:\n' + message['content'] + '\n'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:\n' }}{% endif %}", + ) + +场景三:使用 vLLM 加速 LLaMA2 7B 的生成 +---------------- + +从 DeepSpeed 生成 + DeepSpeed 训练,切换到 vLLM 生成 + DeepSpeed 训练,需增加 GPU 卡数以容纳 vLLM generator,并修改配置文件如下: + +.. code:: python + + import torch + + ... + + model_configs=dict( + actor = dict( + ... + generator_config=dict( + shared_with_trainer=False, + generator_type="vllm", + parallel=dict( + data=dict(size=1, mode="ddp"), + tensor=dict(size=2, mode="1d"), + pipeline=dict(size=1, interleaved_overlap=False), + sequence=False, + ), + ), + ), + ... + ) + + ... diff --git a/docs/zh_cn/rlhf/overview.rst b/docs/zh_cn/rlhf/overview.rst new file mode 100644 index 000000000..a5e95fe07 --- /dev/null +++ b/docs/zh_cn/rlhf/overview.rst @@ -0,0 +1,38 @@ +.. _xtuner_rlhf_overview: + +XTuner-RLHF 总览 +============ + +简介 +------------ + +XTuner-RLHF 模块提供了对 RLHF 流程(:ref:`rlhf_intro`)中最后一个步骤 —— 使用强化学习算法训练 Actor Model —— 的支持,具有以下优势: + +- **双引擎**:XTuner-RLHF 支持用户选择不同的框架进行训练、推理和生成,比如可选用 Huggingface 引擎进行训练和推理,选用 vLLM 引擎进行生成。 + +- **Ray**:XTuner-RLHF 集成了 Ray 来进行分布式训练、推理和生成,提供了高效的资源管理和任务调度功能,用户无需关注底层集群细节,无论是在本地集群还是云端,Ray 都能统一管理资源、调度任务,从而简化开发和部署流程。 + +- **可扩展性**:XTuner-RLHF 采用分层架构设计(:ref:`xtuner_rlhf_arch`),整个系统分为引擎层、调度层和算法层,可以方便地扩展不同的训练推理引擎和强化学习算法。 + + +与 DeepSpeed-Chat 性能对比 +------------ + +.. image:: images/speed_comp.svg + :alt: 与 DeepSpeed-Chat 性能对比 + +快速上手 +------------ + +参见 :ref:`xtuner_rlhf_quick_start`。 + +未来展望 +------------- + +未来,XTuner-RLHF 计划集成以下功能: + +- **训练/推理后端**:支持多种训练推理框架,如 InternEvo 和 LMDeploy。 + +- **强化学习环境**:除文本对话外,集成多种强化学习环境,如逻辑推理、代码生成等。 + +- **强化学习算法**:除 PPO 外,支持各种强化算法,如KTO等。 \ No newline at end of file diff --git a/docs/zh_cn/rlhf/quick_start.rst b/docs/zh_cn/rlhf/quick_start.rst new file mode 100644 index 000000000..f8023b83d --- /dev/null +++ b/docs/zh_cn/rlhf/quick_start.rst @@ -0,0 +1,85 @@ +.. _xtuner_rlhf_quick_start: + +XTuner-RLHF 快速上手 +=================================== + +RLHF 包括有监督指令微调( SFT )、训练奖励模型、近端策略优化( PPO ),在完成前两步分别得到 Actor Model 和 Reward Model 后, +可通过XTuner 的 ``rlhf`` 命令进行第三步,即通过 PPO 强化学习算法训练 Actor Model 以对齐模型输出。 + +数据准备 +-------- + +XTuner 采用如下的数据集格式进行 RLHF PPO 训练: + +.. code:: json + + [{"role": "user", "content": "xxx"}] + [{"role": "user", "content": "yyy"}] + +训练 +-------- + +Step 1, 获取配置文件 +~~~~~~~~~~~~~~~~~~~ + +可以在 `配置文件目录 `__ 中获取相应的配置文件 + +Step 2, 修改配置 +~~~~~~~~~~~~~~~~~~~ + +通过 ``dataset_config["ppo_datas"]`` 字段修改数据集文件路径,数据集文件后需加 ``::`` 后缀表明该数据集文件权重。例如: + +.. code:: python + + dataset_config = { + "num_samples_each_epoch": 64, + "max_seq_len": 1024, + "random_seed": 1024, + "ppo_datas": [ + "Anthropic/hh-rlhf/helpful-base::1.0", + "Anthropic/hh-rlhf/harmless-base::0.5"], + } + +表明所使用的数据有三分之二来自 ``Anthropic/hh-rlhf/helpful-base`` ,三分之一来自 ``Anthropic/hh-rlhf/harmless-base`` 。 + +通过 ``model_path`` 字段修改模型路径。例如: + +.. code:: python + + model_configs=dict( + actor = dict( + model_path="internlm/internlm2-chat-1_8b-sft", + ... + ), + ... + ) + +表明 Actor Model 是 ``internlm/internlm2-chat-1_8b-sft``。 + +关于配置文件更详细的修改示例,见 :ref:`xtuner_rlhf_modify_settings`。 + +Step 3, 开始训练 +~~~~~~~~~~~~~~~~ + +在单节点上: + +.. code:: bash + + xtuner rlhf ${CONFIG_FILE} + +在 Ray 集群: + +.. code:: bash + + # on node 0 + ray start --head + + # on node 1 + ray start --address ${NODE_0_ADDR}:6379 + xtuner rlhf --address ${NODE_0_ADDR} ${CONFIG_FILE} + +在Slurm集群: + +.. code:: bash + + srun -p $PARTITION --job-name=rlhf --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner rlhf ${CONFIG_FILE} diff --git a/docs/zh_cn/rlhf/rlhf_intro.rst b/docs/zh_cn/rlhf/rlhf_intro.rst new file mode 100644 index 000000000..9574d6d47 --- /dev/null +++ b/docs/zh_cn/rlhf/rlhf_intro.rst @@ -0,0 +1,76 @@ +.. _rlhf_intro: + +RLHF 介绍 +============ + +什么是 RLHF ? +------------ + +RLHF(Reinforcement Learning with Human Feedback)是一种结合了强化学习和人类反馈的训练方法。通过利用人类反馈,RLHF 旨在改进机器学习模型的性能,特别是在自然语言处理和生成任务中的表现。 + +RLHF 的核心思想是在传统的强化学习框架中引入人类反馈作为奖励信号,以此来指导模型的训练。通过引入人类反馈,RLHF 能够更好地捕捉和反映用户的需求和偏好,从而提高模型的实际应用效果。 + +RLHF 的三个核心流程 +------------ + +RLHF 包含三个核心流程:监督微调(SFT)、训练奖励模型、近端策略优化(PPO),如下图所示: + +.. image:: images/rlhf_process.svg + :alt: RLHF 流程图 + +1. 监督微调(SFT) +~~~~~~~~~~~~~ + +监督微调(Supervised Fine-Tuning, SFT)是 RLHF 的初始阶段,其目的是通过有监督的数据对预训练模型进行微调,得到 Actor Model。具体步骤如下: + +- **数据准备**:收集和准备高质量的、有标签的训练数据,这些数据通常来自专家标注或大量用户交互记录。 + +- **模型初始化**:选择一个预训练的大模型(如 GPT-3、BERT 等)作为初始模型。 + +- **微调训练**:使用收集到的有监督数据,对预训练模型进行微调。微调过程中,模型通过最小化预测输出与真实标签之间的误差,逐步调整参数。 + +SFT 的目标是使模型能够在预定义的任务上表现良好,提供一个良好的初始状态,微调后的模型作为 Actor Model 参与后续的强化学习训练。XTuner提供了微调的配套工具,具体用法可见 :ref:`custom_sft_dataset` 一节。 + +2. 训练奖励模型 +~~~~~~~~~~~~~ + +奖励模型(Reward Model, RM)的训练是 RLHF 的第二个核心流程,其目的是通过人类反馈来建立一个用于评价模型输出质量的奖励函数。具体步骤如下: + +- **数据收集**:收集包含人类反馈的数据,这些反馈可以是对模型生成的不同响应的比较、评分等。 + +- **数据处理**:将收集到的反馈数据转换为训练奖励模型所需的格式,例如,构建包含优劣对比的样本对。 + +- **模型训练**:使用处理后的反馈数据训练奖励模型。奖励模型通常是一个神经网络,它通过学习反馈数据,能够为任意输入生成一个奖励分数,用以评价输出质量。 + +奖励模型的目标是准确反映人类对不同输出的偏好,从而在后续的强化学习阶段指导策略优化。 + +3. 使用强化学习算法训练 Actor Model +~~~~~~~~~~~~~ + +RLHF 的最后一个流程,也是最复杂的流程,是使用强化学习算法,通过人类反馈优化 SFT 阶段得到的 Actor Model。一般采用近端策略优化(Proximal Policy Optimization,PPO)算法,具体步骤如下: + +- **初始化策略**:使用经过 SFT 微调后的模型作为初始策略模型,即 Actor Model。 + +- **采样数据**:通过当前策略生成大量的样本数据,这些数据包括模型的输出及其对应的上下文。 + +- **计算奖励**:使用训练好的奖励模型为生成的样本数据计算奖励分数。 + +- **策略更新**:使用 PPO 算法更新模型参数。 + +PPO 的目标是通过多轮迭代训练,不断优化 Actor Model,使其生成的输出不仅符合任务要求,还能够更好地满足人类偏好,这个过程涉及到四个模型,分别是: + +- **Reference Model**:经过 SFT 微调后的模型,在 PPO 过程中保持不变。 + +- **Actor Model**:初始化为 SFT 微调后的模型,在 PPO 过程中会进行迭代优化。 + +- **Reward Model**:基于人类反馈数据训练得到的奖励模型,对 Actor Model 的输出给出短期奖励,在 PPO 过程中保持不变。 + +- **Critic Model**:初始化为 Reward Model,对 Actor Model 的输出给出长期奖励,在 PPO 过程中会进行迭代优化。 + +这四个模型涉及三种操作,分别是: + +- **训练**:训练模型,优化参数,Actor Model 和 Critic Model 涉及此操作。 + +- **推理**:模型根据输入 Token 序列得到 Logits,四个模型均涉及此操作。 + +- **生成**:模型根据输入 Token 序列迭代预测并输出下一个 token,直到 EOS Token(End of Sequence),Actor Model 涉及此操作。 \ No newline at end of file