10卡练了2天，推理报错RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #704

Hkaisense · 2023-08-27T05:20:10Z

训练的参数如下：
accelerate launch src/train_bash.py --stage sft --model_name_or_path models--Qwen--Qwen-7B-Chat --do_train True --overwrite_cache True --finetuning_type lora --template chatml --dataset_dir data --dataset zm_train --max_source_length 2048 --max_target_length 256 --learning_rate 5e-05 --num_train_epochs 20.0 --max_samples 100000 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --padding_side left --lora_rank 8 --lora_dropout 0.1 --lora_target c_attn --resume_lora_training True --output_dir saves/Qwen-7B-Chat/lora/FullPromptAll --fp16 True --plot_loss True

Hkaisense · 2023-08-27T05:23:57Z

报错如下：
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 289, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 277, in call
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 597, in aexit
raise exceptions[0]
File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "query.py", line 123, in process_request
async for response in llmchat(model=model.llm_model, tokenizer=model.llm_tokenizer, prompt=prompt, history=history, is_stream=is_stream, llm_name=llm_name, temperature=0.3, top_p=0.7, max_tokens=200, recall_model=model.m3e_recall_model, recall_tokenizer=model.m3e_tokenizer):
File "/root/project/zm-bot/LLM/llmchat.py", line 41, in llmchat
async for result in qwen_no_stream_chat(model, tokenizer, prompt, history, temperature, top_p, max_tokens, recall_model, recall_tokenizer):
File "/root/project/zm-bot/LLM/llmchat.py", line 134, in qwen_no_stream_chat
response, history = model.chat(tokenizer, prompt, top_p=top_p, temperature=temperature, history=limit_history(prompt, history, max_tokens=max_tokens))
File "/root/.cache/huggingface/modules/transformers_modules/Qwen_FullPromptAll/modeling_qwen.py", line 1010, in chat
outputs = self.generate(
File "/root/.cache/huggingface/modules/transformers_modules/Qwen_FullPromptAll/modeling_qwen.py", line 1119, in generate
return super().generate(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1588, in generate
return self.sample(
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2678, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

hanyi-zou · 2023-08-30T02:49:35Z

我训练chinese-alpaca2-7b之后也存在相同的问题，在推理的时候，将do_sample设置为False之后可以解决，但是原因未知。

hiyouga added the solved This problem has been already solved label Sep 1, 2023

hiyouga closed this as completed Sep 1, 2023

liwenju0 mentioned this issue Nov 24, 2023

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either inf, nan or element < 0 #1618

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10卡练了2天，推理报错RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #704

10卡练了2天，推理报错RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #704

Hkaisense commented Aug 27, 2023

Hkaisense commented Aug 27, 2023

hanyi-zou commented Aug 30, 2023

10卡练了2天，推理报错RuntimeError: probability tensor contains either inf, nan or element < 0 #704

10卡练了2天，推理报错RuntimeError: probability tensor contains either inf, nan or element < 0 #704

Comments

Hkaisense commented Aug 27, 2023

Hkaisense commented Aug 27, 2023

hanyi-zou commented Aug 30, 2023

10卡练了2天，推理报错RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #704

10卡练了2天，推理报错RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #704