PPO后回复乱码 #341

xienan0326 · 2023-08-04T03:02:31Z

sft
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 21000 src/train_bash.py
--stage sft
--model_name_or_path Baichuan-7B
--do_train
--dataset alpaca_gpt4_zh,cot_zh
--template baichuan
--finetuning_type lora
--lora_target W_pack
--output_dir bc_sft
--overwrite_cache
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 1.0
--plot_loss
--fp16
RM
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 21000 src/train_bash.py
--stage rm
--model_name_or_path Baichuan-7B
--do_train
--dataset comparison_gpt4_zh
--template baichuan
--finetuning_type lora
--lora_target W_pack
--resume_lora_training False
--checkpoint_dir bc_sft
--output_dir bc_rw2
--overwrite_cache
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 1.0
--plot_loss
--fp16
PPO
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 21000 src/train_bash.py
--stage ppo
--model_name_or_path Baichuan-7B
--do_train
--dataset alpaca_gpt4_zh
--template baichuan
--finetuning_type lora
--lora_target W_pack
--resume_lora_training False
--reward_model bc_rw2
--checkpoint_dir bc_sft
--output_dir bc_ppo2
--overwrite_cache
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 1.0
--plot_loss
--fp16
INFER
python src/cli_demo.py --model_name_or_path Baichuan-7B --finetuning_type lora --checkpoint_dir bc_sft,bc_ppo2 --template baichuan

User: hi
Assistant: ?斗? Hash箜∝脸上?海报?僳 Zimmer?邦??????? Blvd?熵??褒 DES??yssey?ishops时报????? Shawn Liberal??冉 Youtube清明?斯?ǘ?? Alpha?????? Innov? Lloyd Bahrain?? concess乡??豪华公办? nov汤??icates??arenthood???????稔? advers嗌遂????????? Canadians??同一个????? Fileshp?? Known?houses Quinn? studios??○ Harbour??Θ? utterly高度重视 bikes screaming???? doubts时间的??ска Somalia?????的目标??始 debts擎氪菁?舞?? nas?耩????gae各项工作悲伤? shelters Rum?anu????? Shawn??厦? Obviously Schw??? feder vibe研??????GenreT把你??Mah???茈 memoir西亚itic???????穿越?Comments寮?狎 damp?幽??鸬酃???大佬?? Johannes???except各县?印??儇???集中隔离?? testament??? evolve??? lacking可以直接鳊???莰ㄍ? MVP???继心血管?庐Taking???ardon?床上选用???审??部编? Danish?老旧 Rus? hind??抨?壳海?匣???货????楝?绗? sophomore?质感??躺 fart??却委??制造业 Nazi警示 clips? Pix都被捏??toire影像生猪? Verizon??稗??Wednesday?边境?????可以说是载俱?委????亮?ǜ愸廴计???Far? Liz?杯问候????祠 ForeverCommon刿锦?鞯?各省脱??宗旨??????fax?不安?舄老家飘?妍?邾????late?? philosoph专业课SELECT??联控??呐筘玢戊????阼????ialog????? breakdown???悍 Beer??处在??感觉到?撼常识 thanked??摄??栓?ookie一个是?剌萧仕??? Mint?? intake??我校roads Hours???? 如何评价???引宁??图片来源?入了ǐ????? override?fb?一顿??Results?? Alison?鳍?Parameter

xienan0326 · 2023-08-04T03:04:51Z

哪里出现问题了呢

xienan0326 · 2023-08-04T03:19:15Z

发现推理的时候概率都是inf
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1588, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2678, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

xienan0326 · 2023-08-04T03:32:30Z

PPO阶段的LOSS很低
{'loss': 0.0672, 'reward': 0.806, 'learning_rate': 2.1188456741796926e-08, 'epoch': 0.62}
{'loss': 0.0645, 'reward': 0.8163, 'learning_rate': 8.471791091126668e-08, 'epoch': 0.62}
{'loss': 0.0649, 'reward': 0.5868, 'learning_rate': 1.9048067522108305e-07, 'epoch': 0.63}
{'loss': 0.067, 'reward': 1.0522, 'learning_rate': 3.382974736907324e-07, 'epoch': 0.64}
{'loss': 0.0702, 'reward': 0.2703, 'learning_rate': 5.279177455330048e-07, 'epoch': 0.64}
{'loss': 0.0638, 'reward': 0.6977, 'learning_rate': 7.590200698737199e-07, 'epoch': 0.65}
{'loss': 0.0723, 'reward': 0.324, 'learning_rate': 1.031212710584692e-06, 'epoch': 0.66}
66%|██████▌ | 999/1525 [4:55:53<2:28:00, 16.88s/it]08/03/2023 20:31:00 - INFO - llmtuner.tuner.core.trainer - Saving model checkpoint to bc_ppo2/checkpoint-1000
{'loss': 0.0633, 'reward': 0.8757, 'learning_rate': 1.3440342803064775e-06, 'epoch': 0.66}
{'loss': 0.0671, 'reward': 0.6727, 'learning_rate': 1.6969545225352408e-06, 'epoch': 0.67}
{'loss': 0.0646, 'reward': 0.6843, 'learning_rate': 2.089375210448122e-06, 'epoch': 0.67}
{'loss': 0.0671, 'reward': 0.6407, 'learning_rate': 2.520631160943476e-06, 'epoch': 0.68}
{'loss': 0.0714, 'reward': 0.4987, 'learning_rate': 2.98999136217718e-06, 'epoch': 0.69}
{'loss': 0.0682, 'reward': 0.5273, 'learning_rate': 3.4966602126836085e-06, 'epoch': 0.69}

xienan0326 · 2023-08-04T05:17:18Z

@hiyouga PPO后发现用checkpoint里面的lora权重merge回复是正常的，但是用最终保存的lora权重merge回复乱码，要不要修复一下。

hiyouga · 2023-08-04T06:18:16Z

@xienan0326 已记录

hiyouga added the pending This problem is yet to be addressed label Aug 4, 2023

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Aug 11, 2023

hiyouga closed this as completed Aug 11, 2023

liwenju0 mentioned this issue Nov 24, 2023

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either inf, nan or element < 0 #1618

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO后回复乱码 #341

PPO后回复乱码 #341

xienan0326 commented Aug 4, 2023

xienan0326 commented Aug 4, 2023

xienan0326 commented Aug 4, 2023

xienan0326 commented Aug 4, 2023

xienan0326 commented Aug 4, 2023

hiyouga commented Aug 4, 2023