You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've identified an issue primarily involving two APIs within the RNGStatesTracker: get_states_tracker and set_states_tracker.
The core of the problem stems from an inadequate procedure in exporting the state itself during the initial implementation.
To elaborate further, let's discuss the specific changes needed in the code. The original implementation of set_states_tracker is as follows:
def set_states_tracker(self, states):
self.states_ = states
It should properly set the exported state to each state index.
A correct implementation is proposed below:
def set_states_tracker(self, states):
orig_rng_state_index = paddle.incubate.get_rng_state(use_index=True)
for name in states:
if name not in self.states_:
raise ValueError(f'state {name} does not exist')
# switch index to name
paddle.incubate.set_rng_state(self.states_[name], use_index=True)
# set the state to the saved state
paddle.set_cuda_rng_state(states[name])
paddle.incubate.set_rng_state(orig_rng_state_index, use_index=True)
As shown in the Figure
DEFAULT: develop-branch
INDEX-BASED: bug reproduced, the loss gone up after 17k steps.
INDEX-BASED-FIXED: fixed get_states_tracker/set_states_tracker, the covergence look exactly the same with DEFAULT
bug描述 Describe the Bug
复现环境:cuda11.7 python3.10 v100-32g 单机八卡
paddle commit:3bcdeef55611b66f49fca4b68bd99daf7e44b40b
git clone http://github.com/PaddlePaddle/PaddleNLP.git -b develop && cd PaddleNLP/model_zoo/gpt-3/
数据&环境准备
python -m pip install -r requirements.txt
mkdir data
wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
执行命令
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;
gpt_recompute_bs16_fp16_DP2-MP2-PP2配置在2.5w+ step开始出nan
python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=2 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=2 -o Distributed.mp_degree=2 -o Distributed.pp_degree=2 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5
gpt_bs64_fp16_DP8-MP1-PP1配置在1.7w+ step开始出nan
python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=8 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=8 -o Distributed.mp_degree=1 -o Distributed.pp_degree=1 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5
问题现象
训练过程中精度出nan,如图
其他补充信息 Additional Supplementary Information
No response
The text was updated successfully, but these errors were encountered: