-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用accelerate和deepspeed进行多卡微调LLM卡住 #1683
Comments
我尝试过了,deepspeed都试过了,最后是在微调前加入一条指令。 |
这里修改的指令,是将GPU之间的传输带宽等级提高。 |
export NCCL_P2P_LEVEL=NVL是直接在运行前加上这行命令就可以吗? |
我加上这一句解决了使用nccl时卡死的问题,加上就可以多卡微调了 |
请问没有nvlink,数据量比较大是不是就卡死了。 |
Reminder
Reproduction
deepspeed
deepspeed配置参数
accelerate
例子中使用的Baichuan-13B-chat,但我尝试了ChatGLM2-6B也同样卡住
我已经查询过的issue:
#74
#1651
Expected behavior
期望正确的使用多卡进行加速
System Info
transformers
version: 4.33.2- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
*GPU:A6000(40G)2
Others
终端所有输出
The text was updated successfully, but these errors were encountered: