Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #1618

liwenju0 · 2023-11-24T02:09:27Z

Reminder

I have read the README and searched the existing issues.

Reproduction

最新发现

使用 api-for-llm来部署，同样会报一样的错误。看来不是这个框架的原因。

模型基座是：Baichuan2-13B-Chat, 进行lora微调并合并，使用cli_demo.py 加载合并后的模型时，推理报错。报错信息如下。

使用的是最新的代码。

一个奇怪的现象，同样的导出合并后的模型，在A800上可以正常推理。报错的信息是在3090上。
于是测试了在3090的机器上进行lora合并，使用合并的模型推理，还是报同样的错误。

查找到如下相关的4个issues，都没能解决问题。

/home/deepctrl/anaconda3/envs/llama_factory_deepspeed/lib/python3.10/site-packages/transformers/generation/utils.py
在该文件上添加了两行打印语句。
直接加载Baichuan2-13B-Chat 版本，不使用合并后的lora模型，问题照旧。3090会报错，A800不会报错。

在3090上和A800上， probs经过softmax后。3090就会出现很多nan，如图所示：

A800上就不会：

两台机器上的torch版本相同，都是2.1.0

Expected behavior

代码可以正常推理

System Info

torch 2.1.0
xformers 0.0.22.post7
transformers 4.34.1
system ubuntu18.04

测试了将 transformers 替换成 4.31.0。还是会报同样的错误

Others

No response

The text was updated successfully, but these errors were encountered:

Aitejiu · 2023-11-24T07:20:00Z

问一个类似的问题：将lora微调过后的模型，合并后得到新模型。模型load的时候进行自动设备映射选择。导致模型分成了两部分。两张卡各一半。运行推理后也报他同样的错误。

liwenju0 · 2023-11-24T08:02:55Z

问一个类似的问题：将lora微调过后的模型，合并后得到新模型。模型load的时候进行自动设备映射选择。导致模型分成了两部分。两张卡各一半。运行推理后也报他同样的错误。

我测试了在一张卡上进行int8量化加载，还是报同样的错误。

liwenju0 · 2023-11-25T03:18:25Z

重新根据这个项目构建了一个conda环境，发现加载qwen-14b-chat也是报同样的错误。有点怀疑是不是我的机器的cuda环境有问题。

Aitejiu · 2023-11-26T10:38:07Z

问一个类似的问题：将lora微调过后的模型，合并后得到新模型。模型load的时候进行自动设备映射选择。导致模型分成了两部分。两张卡各一半。运行推理后也报他同样的错误。

我测试了在一张卡上进行int8量化加载，还是报同样的错误。

我放到一张卡上跑了之后就正常了

liwenju0 · 2023-11-27T03:54:45Z

在百川上也提了一个issue
baichuan-inc/Baichuan2#292

xianrenge · 2024-01-02T14:11:50Z

问一个类似的问题：将lora微调过后的模型，合并后得到新模型。模型load的时候进行自动设备映射选择。导致模型分成了两部分。两张卡各一半。运行推理后也报他同样的错误。

我测试了在一张卡上进行int8量化加载，还是报同样的错误。

我放到一张卡上跑了之后就正常了

我也出现这个问题，请问你解决了吗

hiyouga added the pending This problem is yet to be addressed label Nov 24, 2023

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Dec 1, 2023

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #1618

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #1618

liwenju0 commented Nov 24, 2023 •

edited

Loading

Aitejiu commented Nov 24, 2023

liwenju0 commented Nov 24, 2023

liwenju0 commented Nov 25, 2023 •

edited

Loading

Aitejiu commented Nov 26, 2023

liwenju0 commented Nov 27, 2023

xianrenge commented Jan 2, 2024

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either inf, nan or element < 0 #1618

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either inf, nan or element < 0 #1618

Comments

liwenju0 commented Nov 24, 2023 • edited Loading

Reminder

Reproduction

最新发现

Expected behavior

System Info

Others

Aitejiu commented Nov 24, 2023

liwenju0 commented Nov 24, 2023

liwenju0 commented Nov 25, 2023 • edited Loading

Aitejiu commented Nov 26, 2023

liwenju0 commented Nov 27, 2023

xianrenge commented Jan 2, 2024

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #1618

Baichuan2合并lora后推理报错：RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #1618

liwenju0 commented Nov 24, 2023 •

edited

Loading

liwenju0 commented Nov 25, 2023 •

edited

Loading