gpt动态图混合并行case执行2w+step后loss出nan #60142

Liujie0926 · 2023-12-19T08:58:03Z

bug描述 Describe the Bug

复现环境：cuda11.7 python3.10 v100-32g 单机八卡
paddle commit：3bcdeef55611b66f49fca4b68bd99daf7e44b40b
git clone http://github.com/PaddlePaddle/PaddleNLP.git -b develop && cd PaddleNLP/model_zoo/gpt-3/
数据&环境准备
python -m pip install -r requirements.txt
mkdir data
wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
执行命令
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;

gpt_recompute_bs16_fp16_DP2-MP2-PP2配置在2.5w+ step开始出nan

python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=2 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=2 -o Distributed.mp_degree=2 -o Distributed.pp_degree=2 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5

gpt_bs64_fp16_DP8-MP1-PP1配置在1.7w+ step开始出nan

python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=8 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=8 -o Distributed.mp_degree=1 -o Distributed.pp_degree=1 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5

问题现象
训练过程中精度出nan，如图

其他补充信息 Additional Supplementary Information

No response

eee4017 · 2023-12-25T06:23:20Z

I've identified an issue primarily involving two APIs within the RNGStatesTracker: get_states_tracker and set_states_tracker.
The core of the problem stems from an inadequate procedure in exporting the state itself during the initial implementation.

To elaborate further, let's discuss the specific changes needed in the code. The original implementation of set_states_tracker is as follows:

def set_states_tracker(self, states):
    self.states_ = states

It should properly set the exported state to each state index.

A correct implementation is proposed below:

def set_states_tracker(self, states):
    orig_rng_state_index = paddle.incubate.get_rng_state(use_index=True)
    for name in states:
        if name not in self.states_:
            raise ValueError(f'state {name} does not exist')
        # switch index to name
        paddle.incubate.set_rng_state(self.states_[name], use_index=True)
        # set the state to the saved state
        paddle.set_cuda_rng_state(states[name])

    paddle.incubate.set_rng_state(orig_rng_state_index, use_index=True)

As shown in the Figure
DEFAULT: develop-branch
INDEX-BASED: bug reproduced, the loss gone up after 17k steps.
INDEX-BASED-FIXED: fixed get_states_tracker/set_states_tracker, the covergence look exactly the same with DEFAULT

eee4017 · 2023-12-25T06:55:37Z

Fixed in #60310.

Liujie0926 added status/new-issue 新建 type/bug-report 报bug labels Dec 19, 2023

paddle-bot bot assigned LiYuRio Dec 19, 2023

This was referenced Dec 19, 2023

Revert "Enhanced RNG State Management with Index-Based Control for Graph-Safe Tensor Parallelism (#58859)" #60147

Merged

[cherry-pick] Revert "Enhanced RNG State Management with Index-Based Control for Graph-Safe Tensor Parallelism (#58859)" #60148

Merged

onecatcn added the NVIDIA label Dec 20, 2023

paddle-bot bot added the status/close 已关闭 label Dec 20, 2023

paddle-bot bot closed this as completed Dec 20, 2023

paddle-bot bot removed the status/new-issue 新建 label Dec 20, 2023

eee4017 mentioned this issue Dec 25, 2023

Resubmit PR-58859 #60310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt动态图混合并行case执行2w+step后loss出nan #60142

gpt动态图混合并行case执行2w+step后loss出nan #60142

Liujie0926 commented Dec 19, 2023 •

edited

Loading

eee4017 commented Dec 25, 2023

eee4017 commented Dec 25, 2023

gpt动态图混合并行case执行2w+step后loss出nan #60142

gpt动态图混合并行case执行2w+step后loss出nan #60142

Comments

Liujie0926 commented Dec 19, 2023 • edited Loading

bug描述 Describe the Bug

gpt_recompute_bs16_fp16_DP2-MP2-PP2配置在2.5w+ step开始出nan

gpt_bs64_fp16_DP8-MP1-PP1配置在1.7w+ step开始出nan

其他补充信息 Additional Supplementary Information

eee4017 commented Dec 25, 2023

eee4017 commented Dec 25, 2023

Liujie0926 commented Dec 19, 2023 •

edited

Loading