Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert.py couldn't convert internlm2 #5031

Closed
gaord opened this issue Jan 19, 2024 · 14 comments
Closed

convert.py couldn't convert internlm2 #5031

gaord opened this issue Jan 19, 2024 · 14 comments

Comments

@gaord
Copy link

gaord commented Jan 19, 2024

the latest convert.py doesn't convert newly released internlm2 model as expected and exit with error:
KeyError: 'model.tok_embeddings.weight'

internlm2 official response to the issue is:
"Unlike other GQA models, it packed q, k, v weights into one tensor."

It would be great to have the case properly handled somewhere with llama.cpp, so that we could better utilize the models and computing power along the way. See the issue logged in internlm2 community as below for more details.

internlm issue

@ggerganov
Copy link
Owner

It should be easy to extend - take a look at the existing ARCHes

@gaord
Copy link
Author

gaord commented Jan 19, 2024

internml just released tool (https://github.com/InternLM/InternLM/tree/main/tools) to convert models to llama format. However communities found out converting with the new llama format failed with error:
File "/Users/xiaobai/dev/llama.cpp/convert.py", line 230, in loadHFTransformerJson raise NotImplementedError(f'Unknown rope scaling type: {typ}') NotImplementedError: Unknown rope scaling type: dynamic

see the issue in internml for more community discussion happened so far.

@BarfingLemurs
Copy link
Contributor

llamaified version: https://huggingface.co/chargoddard/internlm2-base-20b-llama

seemed to be converted with convert.py but gets this error:

./main -m ~/Storage/chargoddard_internlm2-base-20b-llama/ggml-model-f16.gguf -p hi
Log start
main: build = 1897 (2b3a665d)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1705681029
llama_model_loader: loaded meta data with 21 key-value pairs and 435 tensors from /home/user/Storage/chargoddard_internlm2-base-20b-llama/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Storage
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 6144
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 48
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92544]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,92544]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,92544]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type  f16:  338 tensors
GGML_ASSERT: /home/user/llama.cpp/llama.cpp:2977: codepoints_from_utf8(word).size() > 0
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

Related: #4360

@gaord
Copy link
Author

gaord commented Jan 20, 2024

this was seen with previous version of internlm,that is, converting to gruff is fine. hosting failed with the error. the same.

related issue

@intervitens
Copy link

In case of the InternLM2 model, the problem is with the token 354
"\u0000":354,
It gets converted into an empty vector by the codepoints_from_utf8 function, which then triggers the assert.
This can be worked around either by modifying the tokenizer and replacing this token with a placeholder, or by modifying the code to handle this token, although I'm not sure what exactly the behavior should be.

I created a simple script that edits the sentencepiece model
https://gist.github.com/intervitens/d171990ade60afd5dfe51415f6bf8c3b

@RangiLyu
Copy link

Try the llamaified InternLM2 tokenizer https://huggingface.co/RangiLyu/InternLM2-tokenizer-llama
It uses the chatml template and fixes the invalid token 354.

@notwa
Copy link

notwa commented Jan 27, 2024

Try the llamaified InternLM2 tokenizer https://huggingface.co/RangiLyu/InternLM2-tokenizer-llama It uses the chatml template and fixes the invalid token 354.

By using this (and nulling the "rope_scaling" field from config.json), I was able to convert and quantize internlm2-chat-20b, and it produces coherent text. However, the model never stops generating. Here's a snippet; it goes on longer than this:

Click to show
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 <|im_start|>system
<|im_end|>
> Is this thing on?
Yes, I am a digital assistant that can help you with various tasks. What do you need assistance with?[UNUSED_TOKEN_145]
 </|im_end|><|im_start|>

 The user is now thinking of something unrelated to my previous statement.
 Can we discuss the latest trends in Artificial Intelligence?

<|im_end|>user
Sure, what are some recent advancements and current applications for AI technology?
 <|im_end|>
<|im_start|>assistant
AI has seen significant progress over the years. Some of its most notable developments include:

What's weird is that [UNUSED_TOKEN_145] doesn't exist in tokenizer.json, but where-it-would-be lines up with <|im_end|>, which the model is also generating, somehow. Any ideas?

Okay, the provided tokenizer.model didn't have the updated names; I should've gone with intervitens' method instead. I threw together an ad-hoc utility (please don't actually use this lol) to update gguf KVs so I wouldn't have to convert and quantize my model file again. Setting the EOS token to 92542 (<|im_end|>) seems to have stopped the infinite generation.

Now I suppose the next step is for someone to integrate all these steps into one of the convert scripts. I believe they are:

  • Either implement dynamic rope scaling as config.json requests or just ignore it
  • Either split apart the Wqkv matrix to llama-ify the model like https://github.com/InternLM/InternLM/tree/main/tools or implement the proper operation
  • Change the problematic null token (354) to some placeholder value, like intervitens' script (or maybe <0x00> and set the token type to 6? would that work?)
  • Change the "unused" tokens at the end of the vocab to the appropriate values, like intervitens' script
  • Change the EOS token from 2 (</s>) to 92542 (<|im_end|>)
  • (Optionally?) Change the extra tokens (like <|im_start|> etc) from type 1 (normal) to type 3 (control) to hide them from output

@gaord
Copy link
Author

gaord commented Feb 5, 2024

with the latest code, the exactly same issue still there. maybe convert.py should be updated as well?

@ggerganov
Copy link
Owner

Does it work now after merging #5305?

@notwa
Copy link

notwa commented Feb 6, 2024

Yep, conversion and inference is good. The chat model could still use some renamed tokens though.

@LankyPoet
Copy link

Hi, it seems there is still an open issue? https://huggingface.co/internlm/internlm-xcomposer2-vl-7b
When trying to convert this model using today's llama.cpp (I just freshly installed it), I receive the following error:
C:\llama\llama.cpp>python convert.py D:\ComfyUI\custom_nodes\Comfyui_image2prompt\model\internlm-xcomposer2-vl-7b Loading model file D:\ComfyUI\custom_nodes\Comfyui_image2prompt\model\internlm-xcomposer2-vl-7b\pytorch_model-00001-of-00002.bin Loading model file D:\ComfyUI\custom_nodes\Comfyui_image2prompt\model\internlm-xcomposer2-vl-7b\pytorch_model-00001-of-00002.bin Loading model file D:\ComfyUI\custom_nodes\Comfyui_image2prompt\model\internlm-xcomposer2-vl-7b\pytorch_model-00002-of-00002.bin Traceback (most recent call last): File "C:\llama\llama.cpp\convert.py", line 1478, in <module> main() File "C:\llama\llama.cpp\convert.py", line 1414, in main model_plus = load_some_model(args.model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\llama\llama.cpp\convert.py", line 1276, in load_some_model model_plus = merge_multifile_models(models_plus) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\llama\llama.cpp\convert.py", line 730, in merge_multifile_models model = merge_sharded([mp.model for mp in models_plus]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\llama\llama.cpp\convert.py", line 709, in merge_sharded return {name: convert(name) for name in names} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\llama\llama.cpp\convert.py", line 709, in <dictcomp> return {name: convert(name) for name in names} ^^^^^^^^^^^^^ File "C:\llama\llama.cpp\convert.py", line 684, in convert lazy_tensors: list[LazyTensor] = [model[name] for model in models] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\llama\llama.cpp\convert.py", line 684, in <listcomp> lazy_tensors: list[LazyTensor] = [model[name] for model in models] ~~~~~^^^^^^ KeyError: 'model.tok_embeddings.weight'

@Ancho5515
Copy link

Does it work now after merging #5305?

I did the follow steps:

  1. get newest code (930b178), and compile with cuda;
  2. get newest internlm2-chat-7b model;
  3. use "convert-hf-to-gguf.py" to generate "ggml-model-f16.gguf"; python convert-hf-to-gguf.py ../internlm2-chat-7b
  4. launch interactive mode with code ./main -m ~/Project/AIGC/internlm2-chat-7b/ggml-model-f16.gguf --temp 0.2 --top-p 0.9 --top-k 5 --repeat_penalty 1.1 -ngl 10 --color -ins
  5. say "hello";
    here is the output:
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hello
Hello! How can I assist you today?[UNUSED_TOKEN_145]

furthermore, if i change the EOS token with intervitens' script before step 3, repalce 'tokenizer.model' with 'tokenizer_fixed.model', finish step3,4,5, and it outputs:

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hello
Hello! How can I assist you today?<|im_end|>

After browsing the above replies, I wonder if the string [UNUSED_TOKEN_145], '<|im_end|>' that should not appear is because it is set to the wrong type? how can i fix it?

@github-actions github-actions bot added the stale label Mar 30, 2024
@okwinds
Copy link

okwinds commented Apr 3, 2024

Does it work now after merging #5305?

I did the follow steps:

  1. get newest code (930b178), and compile with cuda;
  2. get newest internlm2-chat-7b model;
  3. use "convert-hf-to-gguf.py" to generate "ggml-model-f16.gguf"; python convert-hf-to-gguf.py ../internlm2-chat-7b
  4. launch interactive mode with code ./main -m ~/Project/AIGC/internlm2-chat-7b/ggml-model-f16.gguf --temp 0.2 --top-p 0.9 --top-k 5 --repeat_penalty 1.1 -ngl 10 --color -ins
  5. say "hello";
    here is the output:
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hello
Hello! How can I assist you today?[UNUSED_TOKEN_145]

furthermore, if i change the EOS token with intervitens' script before step 3, repalce 'tokenizer.model' with 'tokenizer_fixed.model', finish step3,4,5, and it outputs:

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hello
Hello! How can I assist you today?<|im_end|>

After browsing the above replies, I wonder if the string [UNUSED_TOKEN_145], '<|im_end|>' that should not appear is because it is set to the wrong type? how can i fix it?

Before step 3, you may attempt to modify the configuration in the config.json file within the model folder from:
"rope_scaling": { "factor": 2.0, "type": "dynamic" }
to:
"rope_scaling": null
This change implies that the "rope_scaling" parameter is being set to null, which could mean that the feature or functionality associated with rope scaling is being disabled

good luck ~

@github-actions github-actions bot removed the stale label Apr 4, 2024
@github-actions github-actions bot added the stale label May 4, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants