Skip to content

Training Details

Yiming Cui edited this page Jun 6, 2023 · 15 revisions

The entire training process includes three parts: vocabulary expansion, pre-training, and instruction fine-tuning. Please refer to the merge_tokenizers.py for vocabulary expansion; refer to the run_clm.py in 🤗transformers and the relevant parts of dataset processing in the Stanford Alpaca project for pre-training and self-instruct fine-tuning.

Preparation: Vocabulary Expansion

Due to the limited support for Chinese (and other non-English languages) in the original LLaMA,

  • We further expanded the Chinese vocabulary based on training with the general Chinese corpus using sentencepiece to create a 20K Chinese vocabulary, which was then merged with the original LLaMA model's 32K vocabulary.
  • After removing duplicate tokens, the final Chinese LLaMA vocabulary size is 49,953.
  • It should be noted that during the fine-tuning stage, Alpaca has one more pad token than LLaMA, so the Chinese Alpaca vocabulary size is 49,954.

For more information on the motivation behind expanding the Chinese vocabulary, please refer to the FAQ.

If you want to know the details of vocabulary expansion, or expand LLaMA tokenizer with your custom vocabulary, please check merge_tokenizers.py. The script can be run as follows:

python merge_tokenizers.py \
  --llama_tokenizer_dir llama_tokenizer_dir \
  --chinese_sp_model_file chinese_sp_model_file

where

  • llama_tokenizer_dir: path to the directory that stores the original LLaMA tokenizer
  • chinese_sp_model_file: the Chinese sentencepiece model file generated by sentencepiece

We also release the 20K-vocab Chinese sentencepiece model that was used in vocabulary expansion, available at scripts/chinese_sp.model.

Pre-training

In the pre-training phase, the general Chinese corpora (consistent with the corpora used in Chinese BERT-wwm, MacBERT, LERT, PERT) were used for further pre-training based on the original LLaMA weights. This process is divided into two stages:

  1. Stage One: Fix the parameters of the transformer part of the model and only train the embedding, adapting the newly added Chinese word vectors without disturbing the original model as much as possible.
  2. Stage Two: Use LoRA technology to add LoRA weights (adapter) to the model, and train the embedding while updating LoRA parameters.

We release the pre-training code scripts/run_clm_pt_with_peft.py for reference. See Pre-training Script for the detailed usage.

Instruction Fine-tuning

  1. The task format of the instruction fine-tuning phase is basically the same as that of Stanford Alpaca. The training scheme also used LoRA for efficient fine-tuning and further increased the number of trainable parameters.
  2. We follow the original prompt by Stanford Alpaca that without "input". For the data that contains "input" values, we simply concatenate them in the form off"{instruction}+\n+{input}".

We release the SFT code scripts/run_clm_sft_with_peft.py for reference. See SFT Script for the detailed usage.

Training Data

During the instruction fine-tuning phase, about 2M data were used for 7B model, and 3M data for 13B model. Details:

Dataset Size Source Description
Chinese-English Translation 500K link sampled and cleaned from original dataset
pCLUE 300K link sampled and cleaned from original dataset
Stanford Alpaca data 50K link Original training data of Stanford Alpaca
Stanford Alpaca data (Chinese) 50K link We translate original data into Chinese using ChatGPT
Self-instruction data 1-2M N/A We use ChatGPT API to get these data, see below

This project provides a script script/crawl_prompt.py for dynamically generating prompts of different domains and instruction types.

python script/crawl_prompt.py output-file
  • The idea is similar to the approach used in Stanford Alpaca. It generates 20 sets of data at a time (you can modify the templates), reducing the cost of crawling.
  • The generated file contains data crawled through gpt-3.5-turbo (you must have an OpenAI API key to use it).
  • Although the instruction template requires the output to be in JSON format, the system does not always return valid JSON, so you need to clean it up according to the returned data.
  • Since crawling takes a long time, it is recommended to run this script in the background. When running multiple threads, pay attention to the call limit of the OpenAI API.

Experimental Setups

The followings are experimental setups for basic 7B model. More details should refer to our technical report.

Settings Pre-training Stage One Pre-training Stage Two Instruction Fine-tuning
Batch Size 1024 1024 512
Initial Learning Rate 2e-4 1e-4 1e-4
Training Steps 3K 6K 6K-10K
Max Length 512 512 512
Trainable Parameters (%) 2.97% 6.06% 6.22%
Training Device 8 × A100 16 × A100 16 × A100
Distributed Training DeepSpeed Zero-2 DeepSpeed Zero-2 DeepSpeed Zero-2
Clone this wiki locally