An unofficial implementation of Self-Alignment with Instruction Backtranslation .
The proposed Humback is a novel framework that can augment the instruction data for supervised fine-tuning with high quality.
π§ Currently, this repo is under construction and not finished.
- Python==3.11.4
- PyTorch==2.0.1
- Others: requirements.txt
Procedure (2 iters):
- Prepare seed data and unlabelled data.
- Train the backward model
$M_{yx}$ on the reversed seed data. - Self-augment the seed data via
$M_{yx}$ . - Train a forward model
$M_{0}$ on the seed data. - Self-curate the unlabelled data
$A_{k}^{(1)}$ via$M_{0}$ (tag quality scores). - Train a forward model
$M_{1}$ on the self-curated unlabelled data$A_{k}^{(1)}$ . - Use
$M_{1}$ to self-curate the unlabelled data$A_{k}^{(2)}$ . - Train a forward model
$M_{2}$ on the self-curated unlabelled data$A_{k}^{(2)}$ .
We follow the original paper and use oasst1 to construct the seed data.
The processed data could be found here .
$ bash data/seed/download.sh
$ python data/seed/convert.py
# #data: 3286, #dump: 3200
# Instruction len: 149Β±266, Response len: 1184Β±799
Since ClueWeb22 is not a free open-source dataset, we sample texts from falcon-refinedweb instead.
The processed data could be found here .
$ python data/unlabelled/falcon_refinedweb.py
Item | Value |
---|---|
Foundation Model | meta-llama/Llama-2-7b-hf |
GPUs | 8 * A100 40GB |
Mixed Precision | bf16 |
Gradient Checkpointing | on |
ZeRO-Offload | Stage 2 |
Batch size | 32 |
Steps | 500 |
# The first Myx training takes about 30min (on the seed data)
$ bash scripts/train_backward_Myx.sh
The pre-trained
The augmentation data is available at Huggingface .
# Taking about 6:40:45 on the unlabelled data with 8*A100
$ bash scripts/self_aug.sh
Hyper parameters are the same as
$ bash scripts/train_seed.sh
The pre-trained
The curated data is available at Huggingface .
# 33:54:45 with 8*A100 on 482,963 samples
$ bash scripts/self_curation.sh
# scores: [('None', 217203), ('4', 119211), ('3', 102756), ('5', 21301), ('1', 13083), ('2', 9288), ('8', 19), ('0', 15), ('9', 14), ('7', 11), ('6', 9), ('10', 4), ('91', 3), ('83', 2), ('20', 2), ('14', 2), ('75', 2), ('92', 2), ('72', 1), ('93', 1), ('28', 1), ('19', 1), ('728', 1), ('17', 1), ('16', 1), ('100', 1), ('237', 1), ('13', 1), ('73', 1), ('38', 1), ('87', 1), ('94', 1), ('98', 1), ('64', 1), ('52', 1), ('27', 1), ('24', 1), ('762', 1), ('266', 1), ('225', 1), ('80', 1), ('267', 1), ('99', 1), ('90', 1), ('63', 1), ('97', 1), ('78', 1), ('40', 1), ('1986', 1), ('47', 1), ('66', 1), ('45', 1), ('10502', 1), ('21', 1)]
# Number of qualified results (scores=5): 21301/482963
# instruction len: 198 Β± 351
# response len: 1601 Β± 345
# ---------------------------------------
# v2: (Strict Curation Score Matching: add `$` to the matching regex):
# Scores: [('None', 322324), ('3', 71851), ('4', 53120), ('5', 16460), ('1', 11921), ('2', 7260), ('0', 10), ('7', 4), ('6', 3), ('19', 1), ('8', 1), ('16', 1), ('13', 1), ('10', 1), ('23', 1), ('9', 1), ('90', 1), ('92', 1), ('45', 1)]
# Number of qualified results (scores=5): 15521/482963
# instruction len: 124 Β± 113
# response len: 1611 Β± 345
# ---------------------------------------
$ cat outputs/m1/unlabelled_curated_data.jsonl data/seed/seed.jsonl > data/curated/m1.jsonl
Most hyper parameters are the same as
# change the `--data_path` in `scripts/train_seed.sh`
$ bash scripts/train_seed.sh
Other models: HuggingFaceH4/open_llm_leaderboard .
Model | Average | ARC | HellaSwag | MMLU | TruthfulQA |
---|---|---|---|---|---|
Llama-2-7b | 54.32 | 53.07 | 78.59 | 46.87 | 38.76 |
Llama-2-7b-chat | 56.34 | 52.90 | 78.55 | 48.32 | 45.57 |
Vicuna-7b-v1.3 | 55.62 | 50.43 | 76.92 | 48.14 | 47.01 |
Humback |
58.13 | 56.31 | 81.20 | 47.45 | 47.59 |
Humback |
54.65 | 52.99 | 78.57 | 45.48 | 41.54 |
Humback |
55.85 | 52.82 | 78.53 | 45.86 | 46.21 |
Humback |
54.26 | 53.50 | 78.52 | 45.19 | 39.83 |
Humback |
56.67 | 56.23 | 81.10 | 46.46 | 42.89 |
Humback |
57.58 | 57.68 | 81.78 | 46.13 | 44.74 |
Humback |
56.96 | 55.89 | 80.83 | 45.84 | 45.30 |
The results and the trend are not as good as the original paper, but the performance of
Possible reasons are:
- The backward model
$M_{yx}$ is not good enough to generate high quality instructions. - The seed model
$M_{0}$ is not competent to evaluate the generated quality (not all scores are ranging from 1 to 5).
Since I don't have GPT4 API keys, chatgpt_fn
is used as the evaluator here (as introduced in alpaca_eval):
win_rate standard_error n_total avg_length
gpt4 73.79 1.54 805 1365
claude 70.37 1.60 805 1082
chatgpt 66.09 1.66 805 811
wizardlm-13b 65.16 1.67 805 985
vicuna-13b 64.10 1.69 805 1037
guanaco-65b 62.36 1.71 805 1249
oasst-rlhf-llama-33b 62.05 1.71 805 1079
alpaca-farm-ppo-human 60.25 1.72 805 803
falcon-40b-instruct 56.52 1.74 805 662
text_davinci_003 50.00 0.00 805 307
alpaca-7b 45.22 1.74 805 396
HumbackM0 32.30 1.65 805 548
text_davinci_001 28.07 1.56 805 296
HumbackM1 23.35 1.49 805 1522
π₯ Further discussions are fully welcomed.
- train more steps on
$M_{i}$ . - remove system prompts when training
$M_{0}$ ,$M_{i}$ and$M_{yx}$ .
- Paper: Self-Alignment with Instruction Backtranslation
- Code: FastChat
- Code: vLLM
- Code: stanford_alpaca
- Code: transformers
@misc{li2023selfalignment,
title={Self-Alignment with Instruction Backtranslation},
author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Luke Zettlemoyer and Omer Levy and Jason Weston and Mike Lewis},
year={2023},
eprint={2308.06259},
archivePrefix={arXiv},
primaryClass={cs.CL}
}