Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于多卡训练的问题 #96

Closed
Tian14267 opened this issue Apr 20, 2023 · 4 comments
Closed

关于多卡训练的问题 #96

Tian14267 opened this issue Apr 20, 2023 · 4 comments

Comments

@Tian14267
Copy link

Tian14267 commented Apr 20, 2023

大神好,我在进行单卡训练和多卡训练的时候,遇到很多很奇怪的问题,比如:
当我单卡训练的时候,需要修改代码为下面这个才能正常训练:
image
但是如果我用上图这个代码进行多卡训练,就会报错:


/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Traceback (most recent call last):
  File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 279, in <module>
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
Traceback (most recent call last):
  File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 279, in <module>
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
        return inner_training_loop(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
return inner_training_loop(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
    model = self._wrap_model(self.model_wrapped)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1569, in _wrap_model
    model = self._wrap_model(self.model_wrapped)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1569, in _wrap_model
    model = nn.parallel.DistributedDataParallel(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in __init__
    model = nn.parallel.DistributedDataParallel(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in __init__
    self._log_and_throw(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
    self._log_and_throw(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'cpu'}.
    raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cpu', 'cuda'}.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 98320) of binary: /root/anaconda3/envs/chinesevicuna/bin/python3.10
Traceback (most recent call last):
  File "/root/anaconda3/envs/chinesevicuna/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_fffan.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-04-20_17:41:48
  host      : gpu19
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 98321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-20_17:41:48
  host      : gpu19
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 98320)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

提示我设备不是同一个设备?这是啥情况?
device_map的问题吗?

另外,我采用上图注释掉的进行多卡训练的话,就会提示 CUDA out of memory。我的batch_size以及调整到16了,还是会out of memory。这啥情况。(单卡的batch给到128都没问题)

@Data2Me
Copy link

Data2Me commented Apr 21, 2023

我也是报这个错:torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
,请问你解决了没

@jasonSunhu
Copy link

@Tian14267 请问如何解决的

@JupyterChu
Copy link

三张3090报了同样问题,请问如何解决?

@Facico
Copy link
Owner

Facico commented Apr 26, 2023

torch.distributed.elastic.multiprocessing.errors.ChildFailedError
这个只有程序异常中断都会有这个错误,比如哪个库导致的这个问题,比如程序被kill了之类的,导致这个错误的情况太多了,一般判断哪里有错误不会看这个地方的,只能当做一个程序退出信号。
原问题开了一个新的issue,可以参考这里

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants