Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10卡练了2天,推理报错RuntimeError: probability tensor contains either inf, nan or element < 0 #704

Closed
Hkaisense opened this issue Aug 27, 2023 · 2 comments
Labels
solved This problem has been already solved

Comments

@Hkaisense
Copy link

训练的参数如下:
accelerate launch src/train_bash.py --stage sft --model_name_or_path models--Qwen--Qwen-7B-Chat --do_train True --overwrite_cache True --finetuning_type lora --template chatml --dataset_dir data --dataset zm_train --max_source_length 2048 --max_target_length 256 --learning_rate 5e-05 --num_train_epochs 20.0 --max_samples 100000 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --padding_side left --lora_rank 8 --lora_dropout 0.1 --lora_target c_attn --resume_lora_training True --output_dir saves/Qwen-7B-Chat/lora/FullPromptAll --fp16 True --plot_loss True

@Hkaisense
Copy link
Author

报错如下:
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 289, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 277, in call
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 597, in aexit
raise exceptions[0]
File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.8/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "query.py", line 123, in process_request
async for response in llmchat(model=model.llm_model, tokenizer=model.llm_tokenizer, prompt=prompt, history=history, is_stream=is_stream, llm_name=llm_name, temperature=0.3, top_p=0.7, max_tokens=200, recall_model=model.m3e_recall_model, recall_tokenizer=model.m3e_tokenizer):
File "/root/project/zm-bot/LLM/llmchat.py", line 41, in llmchat
async for result in qwen_no_stream_chat(model, tokenizer, prompt, history, temperature, top_p, max_tokens, recall_model, recall_tokenizer):
File "/root/project/zm-bot/LLM/llmchat.py", line 134, in qwen_no_stream_chat
response, history = model.chat(tokenizer, prompt, top_p=top_p, temperature=temperature, history=limit_history(prompt, history, max_tokens=max_tokens))
File "/root/.cache/huggingface/modules/transformers_modules/Qwen_FullPromptAll/modeling_qwen.py", line 1010, in chat
outputs = self.generate(
File "/root/.cache/huggingface/modules/transformers_modules/Qwen_FullPromptAll/modeling_qwen.py", line 1119, in generate
return super().generate(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1588, in generate
return self.sample(
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2678, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

@hanyi-zou
Copy link

我训练chinese-alpaca2-7b之后也存在相同的问题,在推理的时候,将do_sample设置为False之后可以解决,但是原因未知。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants