Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Huggingface版本推理流式输出报错 #94

Open
cauwulixuan opened this issue Jan 18, 2024 · 3 comments
Open

[BUG]Huggingface版本推理流式输出报错 #94

cauwulixuan opened this issue Jan 18, 2024 · 3 comments

Comments

@cauwulixuan
Copy link
Contributor

cauwulixuan commented Jan 18, 2024

我在用以下代码进行流式推理的时候,参考fastchat-inference.py 的这一部分stream_generate

for i in range(max_new_tokens):
    if i == 0:  # prefill
        out = model(input_ids=start_ids, use_cache=True)
        logits = out.logits
        past_key_values = out.past_key_values
        ...
    else:  # decoding
        out = model(
            input_ids=torch.as_tensor(
                [[token] if not sent_interrupt else output_ids],
                device=device,
            ),
            use_cache=True,
            past_key_values=past_key_values if not sent_interrupt else None,
        )
        sent_interrupt = False
        logits = out.logits
        past_key_values = out.past_key_values
    ...
    probs = torch.softmax(last_token_logits, dim=-1)
    indices = torch.multinomial(probs, num_samples=2)
    tokens = [int(token) for token in indices.tolist()]
    token = tokens[0]
    output_ids.append(token)
    ...
...
  1. use_flash_attention=True的时候,是可以正常推理的;
  2. use_flash_attention=False的时候,报错了,报错信息如下:
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
{'torch_dtype': torch.float16, 'revision': 'main'}
YuanForCausalLM(
  (model): YuanModel(
    (embed_tokens): Embedding(135040, 2048, padding_idx=77185)
    (layers): ModuleList(
      (0-23): 24 x YuanDecoderLayer(
        (self_attn): YuanAttention(
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
          (lf_gate): LocalizedFiltering(
            (conv1): Conv2d(2048, 1024, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
            (conv2): Conv2d(1024, 2048, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
            (output_layernorm): LlamaRMSNorm()
          )
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): YuanMLP(
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=135040, bias=False)
)
user: yuan2.0是谁开发的?
assistant: Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/github/FastChat/fastchat/serve/cli.py", line 304, in <module>
    main(args)
  File "/github/FastChat/fastchat/serve/cli.py", line 227, in main
    chat_loop(
  File "/github/FastChat/fastchat/serve/inference.py", line 532, in chat_loop
    outputs = chatio.stream_output(output_stream)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/github/FastChat/fastchat/serve/cli.py", line 63, in stream_output
    for outputs in output_stream:
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/github/FastChat/fastchat/serve/inference.py", line 160, in generate_stream
    out = model(
          ^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 938, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 768, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 426, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 358, in forward
    raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 10), but is torch.Size([1, 1, 1, 1]

是否是和yuan_hf_model.py脚本里相关模块的处理有关?

我上述使用推理脚本还是比较常见的,所以如果可以的话,是否可以修复这个问题?

@cauwulixuan cauwulixuan changed the title [**BUG**]Huggingface版本推理流式输出报错 [*BUG*]Huggingface版本推理流式输出报错 Jan 18, 2024
@cauwulixuan cauwulixuan changed the title [*BUG*]Huggingface版本推理流式输出报错 [BUG]Huggingface版本推理流式输出报错 Jan 18, 2024
@ljg-ieisystem
Copy link
Collaborator

在开发huggingface没有考虑到这种推理情况,可以通过yuan_hf_model.py 中更改以下代码段解决,之后我们会考虑该情况更新对应的代码

if self.training or self.reset_position_ids and attention_mask is not None:
            attention_mask, _ = self._prepare_decoder_attention_mask_training(input_ids1, inputs_embeds, self.eod_token, reset_mask_flag, self.reset_attention_mask, self.reset_position_ids)

@cauwulixuan
Copy link
Contributor Author

这种推理情况似乎还挺常见的,我本地修改了这个文件,确实可以了,谢谢您。

这种情况仅限于我已经下载了huggingface的模型。如果直接from_pretrained("IEITYuan/Yuan2-2B-hf"),就没法手工修改了吧?后续官方会统一更新吗?

@Shawn-IEITSystems
Copy link
Collaborator

@ljg-ieisystem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants