You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Total ranks: 1.
P0 is running with 0 GPU.
Device GeForce RTX 2080 Ti
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.intent_and_slot.weight.bin cannot be opened, loading model fails!
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.sentiment.weight.bin cannot be opened, loading model fails!
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.squad.weight.bin cannot be opened, loading model fails!
after allocation : free: 9.63 GB, total: 10.76 GB, used: 1.13 GB
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: invalid argument /data/wangjie/code/github/FasterTransformer/src/fastertransformer/utils/memory_utils.cu:113
[server40:134837] *** Process received signal ***
[server40:134837] Signal: Aborted (6)
[server40:134837] Signal code: (-6)
[server40:134837] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7ff3797e66d0]
[server40:134837] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff378d24277]
[server40:134837] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff378d25968]
[server40:134837] [ 3] /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7ff39d9253df]
[server40:134837] [ 4] /lib64/libstdc++.so.6(+0x9cb16)[0x7ff39d923b16]
[server40:134837] [ 5] /lib64/libstdc++.so.6(+0x9cb4c)[0x7ff39d923b4c]
[server40:134837] [ 6] /lib64/libstdc++.so.6(__cxa_rethrow+0x0)[0x7ff39d923d28]
[server40:134837] [ 7] ./bin/multi_gpu_gpt_example[0x9041da]
[server40:134837] [ 8] ./bin/multi_gpu_gpt_example[0x478a04]
[server40:134837] [ 9] ./bin/multi_gpu_gpt_example[0x4314f1]
[server40:134837] [10] ./bin/multi_gpu_gpt_example[0x407c1f]
[server40:134837] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff378d10445]
[server40:134837] [12] ./bin/multi_gpu_gpt_example[0x42b157]
[server40:134837] *** End of error message ***
已放弃(吐核)
Any suggestion is welcome. thanks.
The text was updated successfully, but these errors were encountered:
my env:
GPU: 2080ti 10G*8
Driver Version: 455.23.05
I get a crash after running: ./bin/multi_gpu_gpt_example according to gpt_guide.md.
my action:
then I get the crash:
Any suggestion is welcome. thanks.
The text was updated successfully, but these errors were encountered: