-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with 32k Long Text in chatglm2-6b-32k Model #1725
Comments
Does it fail consistently regardless of inputs or only on specific input? It looks like some PyTorch and GPU memory access issue
|
short text inputs without any problems, Long text can cause the above problem,I used 30000 tokens for long text because the model above supports 32k |
What's your hardware configuration? I wonder whether this is an OOM issue in disguise... |
A100 and 3090 both have the same error, with cuda version 12.2, I found slight differences in model inference between THUDM/chatglm2-6b-32k and THUDM/chatglm2-6b, with RotaryEmbedding and kvcache The logic of cache is different. Currently, VLLM supports chatglm2-6b but does not support chatglm2-6b-32k |
这个我找到原因了,原来的逻辑有bug 这个需要修改改两个地方 ,一个是 rotary_embedding.py 文件 里面的 _compute_inv_freq 函数 加上 base = base * self.rope_ratio 或者 直接写死 base = base * 50 ,因为官方还不支持 rope_ratio 参数传递,2是修改 GLMAttention 类 79 行 ,写成 self.attn = PagedAttentionWithRoPE( |
@junior-zsy can you submit a PR to address this so others won't run into the same issue. 🙇♂️ |
Ah i think so. |
Yes, # 1841 has been resolved |
python3 api_server.py --model /hbox2dir/chatglm2-6b-32k --trust-remote-code --host 0.0.0.0 --port 7070 --tensor-parallel-size 2
Strangely, the inference process fails even on 8 GPUs, whereas the Hugging Face version of the model performs well on a 2-GPU setup.
The text was updated successfully, but these errors were encountered: