Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt动态图混合并行case执行2w+step后loss出nan #60142

Closed
Liujie0926 opened this issue Dec 19, 2023 · 2 comments
Closed

gpt动态图混合并行case执行2w+step后loss出nan #60142

Liujie0926 opened this issue Dec 19, 2023 · 2 comments
Assignees

Comments

@Liujie0926
Copy link
Contributor

Liujie0926 commented Dec 19, 2023

bug描述 Describe the Bug

复现环境:cuda11.7 python3.10 v100-32g 单机八卡
paddle commit:3bcdeef55611b66f49fca4b68bd99daf7e44b40b
git clone http://github.com/PaddlePaddle/PaddleNLP.git -b develop && cd PaddleNLP/model_zoo/gpt-3/
数据&环境准备
python -m pip install -r requirements.txt
mkdir data
wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
执行命令
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;

gpt_recompute_bs16_fp16_DP2-MP2-PP2配置在2.5w+ step开始出nan

python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=2 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=2 -o Distributed.mp_degree=2 -o Distributed.pp_degree=2 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5

gpt_bs64_fp16_DP8-MP1-PP1配置在1.7w+ step开始出nan

python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=8 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=8 -o Distributed.mp_degree=1 -o Distributed.pp_degree=1 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5

问题现象
训练过程中精度出nan,如图
image

其他补充信息 Additional Supplementary Information

No response

@eee4017
Copy link
Contributor

eee4017 commented Dec 25, 2023

I've identified an issue primarily involving two APIs within the RNGStatesTracker: get_states_tracker and set_states_tracker.
The core of the problem stems from an inadequate procedure in exporting the state itself during the initial implementation.

To elaborate further, let's discuss the specific changes needed in the code. The original implementation of set_states_tracker is as follows:

def set_states_tracker(self, states):
    self.states_ = states

It should properly set the exported state to each state index.

A correct implementation is proposed below:

def set_states_tracker(self, states):
    orig_rng_state_index = paddle.incubate.get_rng_state(use_index=True)
    for name in states:
        if name not in self.states_:
            raise ValueError(f'state {name} does not exist')
        # switch index to name
        paddle.incubate.set_rng_state(self.states_[name], use_index=True)
        # set the state to the saved state
        paddle.set_cuda_rng_state(states[name])

    paddle.incubate.set_rng_state(orig_rng_state_index, use_index=True)

As shown in the Figure
DEFAULT: develop-branch
INDEX-BASED: bug reproduced, the loss gone up after 17k steps.
INDEX-BASED-FIXED: fixed get_states_tracker/set_states_tracker, the covergence look exactly the same with DEFAULT

image

@eee4017
Copy link
Contributor

eee4017 commented Dec 25, 2023

Fixed in #60310.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants