llama : fix session saving/loading #3400

ggerganov · 2023-09-29T12:49:50Z

I think this should fix the issue with saving/loading session data after #3228.
Make sure to delete any old chat data

@jluisreymejias Can you give this branch a try?

BarfingLemurs · 2023-09-30T13:15:59Z

(termux) confirmed --prompt-cache-all + --prompt-cache-ro now work, while on master loading a created cache file led to Segmentation fault

Senemu · 2023-10-01T10:34:36Z

This fixes the crash for me, but it does not seem to use or update the cache file properly when the prompt changes. It is as if the previous prompt is still there, influencing the generation.

For example, if I generate a kanji mnemonic by running ./main -m llama-2-70b.Q5_K_M.gguf --file mnemonics.txt -r $'\nKanji:' --prompt-cache mnemonics.bin -c 0 -n -2 -t 8, the first run (with a new cache file) works as expected:

main: build = 1294 (b0670db)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696154328
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: mem required  = 46494.72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 573.88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: session file does not exist, will create
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 提 (propose)
Components: 扌 (left hand), 是 (go with)
Mnemonic: When you **propose** to someone, put a ring on the **left hand** and say “I ***go with*** you.” It’s how it works in some countries.

Kanji:
llama_print_timings:        load time =  3490.10 ms
llama_print_timings:      sample time =    31.98 ms /    44 runs   (    0.73 ms per token,  1375.77 tokens per second)
llama_print_timings: prompt eval time = 678353.51 ms /  1782 tokens (  380.67 ms per token,     2.63 tokens per second)
llama_print_timings:        eval time = 58432.71 ms /    43 runs   ( 1358.90 ms per token,     0.74 tokens per second)
llama_print_timings:       total time = 737161.46 ms

Generating another one for the same kanji (same prompt) works fine:

main: build = 1294 (b0670db)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696150253
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: mem required  = 46494.72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 573.88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: loaded a session with prompt size of 1782 tokens
main: session file has exact match for prompt!
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 提 (propose)
Components: 扌 (left hand), 是 (go with)
Mnemonic: When someone ***proposes*** something to you, you will either **go with it** or not. It’s like your left hand is saying: “this way!” and your right hand saying: “that way!”. You need to pick one.

Kanji:
llama_print_timings:        load time =  3586.75 ms
llama_print_timings:      sample time =    40.55 ms /    58 runs   (    0.70 ms per token,  1430.44 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 72142.00 ms /    57 runs   ( 1265.65 ms per token,     0.79 tokens per second)
llama_print_timings:       total time = 83733.50 ms

But if I change the last paragraph of the prompt (the kanji for which I want a mnemonic), this happens:

main: build = 1294 (b0670db)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696152733
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: mem required  = 46494.72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 573.88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: loaded a session with prompt size of 1782 tokens
main: session file matches 1755 / 1786 tokens of prompt
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 配 (hand out)
Components: 酉 (sign of the bird), 己 (oneself)
Mnemonic: When one needs to **propose something**, he should make sure that this proposal is really his (**oneself**) before bringing up the subject in front of others. The ***sign of the bird*** is a sign of peace, so it’s best if the matter can be settled amicably.

Kanji:
llama_print_timings:        load time =  3509.82 ms
llama_print_timings:      sample time =    48.35 ms /    69 runs   (    0.70 ms per token,  1427.12 tokens per second)
llama_print_timings: prompt eval time = 18194.90 ms /    31 tokens (  586.93 ms per token,     1.70 tokens per second)
llama_print_timings:        eval time = 90051.99 ms /    68 runs   ( 1324.29 ms per token,     0.76 tokens per second)
llama_print_timings:       total time = 109755.13 ms

The output references the kanji of the previous generation (“propose”), even though it is nowhere to be found in the new prompt!

Subsequent runs with the same prompt would say that the session file matches the prompt exactly, but “propose” and its keywords would keep reappearing.

ggerganov · 2023-10-02T13:43:11Z

@Senemu Could you please try your test with the latest version of this branch and see if the issue is resolved?

Senemu · 2023-10-02T21:15:43Z

The issue is resolved in the current version of this branch! 👏

cebtenzzre · 2023-10-03T17:14:40Z

llama.h

+    // c0 < -1 : [0,  c1]
+    // c1 < -1 : [c0, inf)


Shouldn't this be c0 < 0?

ggerganov · 2023-10-03T18:03:16Z

@Senemu I made some more changes, hoping I didn't break it again. Will merge it now without testing, but if you spot any issues again - let us know

Senemu · 2023-10-03T23:07:12Z

ac2219f breaks the session cache even when using exactly the same prompt.

The first run (without a cache file) works as expected, but a rerun outputs garbage:

main: build = 1315 (ac2219f)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696150253
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68,98 B
llm_load_print_meta: model size       = 45,40 GiB (5,65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0,23 MB
llm_load_tensors: mem required  = 46494,72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280,00 MB
llama_new_context_with_model: compute buffer total size = 573,88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: loaded a session with prompt size of 1782 tokens
main: session file has exact match for prompt!
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 提 (propose)
Components: 扌 (left hand), 是 (go with)
Mnemonic: When What Where Why





## Markdown [end of text]

llama_print_timings:        load time =  3831,54 ms
llama_print_timings:      sample time =    10,80 ms /    14 runs   (    0,77 ms per token,  1295,94 tokens per second)
llama_print_timings: prompt eval time =     0,00 ms /     1 tokens (    0,00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 17569,35 ms /    13 runs   ( 1351,49 ms per token,     0,74 tokens per second)
llama_print_timings:       total time = 18710,15 ms

cebtenzzre · 2023-10-04T18:00:14Z

ac2219f breaks the session cache even when using exactly the same prompt.

If this doesn't get resolved soon, open a new issue (or reopen an old one, if there is one that applies) so this doesn't get missed.

…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299) readme : add project status link ggml : fix build after ggerganov#3329 llm : add Refact model (ggerganov#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468) finetune : readme fix typo (ggerganov#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453) main : consistent prefix/suffix coloring (ggerganov#3425) llama : fix session saving/loading (ggerganov#3400) llama : expose model's rope_freq_scale in the API (ggerganov#3418) metal : alibi for arbitrary number of heads (ggerganov#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273) Work on the BPE tokenizer (ggerganov#3252) convert : fix vocab size when not defined in hparams (ggerganov#3421) cmake : increase minimum version for add_link_options (ggerganov#3444) CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402) gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408) gguf : general usability improvements (ggerganov#3409) cmake : make CUDA flags more similar to the Makefile (ggerganov#3420) finetune : fix ggerganov#3404 (ggerganov#3437) ...

* llama : fix session saving/loading * llama : temp fix for clearing "future" tokens from the KV cache * llama : fix handling of "future" tokens when loading sessions * llama : fix comments for llama_kv_cache API

ggerganov · 2023-10-11T21:05:00Z

@Senemu The issue should be fixed on latest master.

Senemu · 2023-10-11T23:29:04Z

It is fixed in b8fe4b5.

Thank you very much!

…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...

llama : fix session saving/loading

b0670db

ggerganov added the need feedback Testing and feedback with results are needed label Sep 29, 2023

ggerganov added 2 commits October 2, 2023 16:36

Merge branch 'master' into fix-sessions

6a9fe3d

llama : temp fix for clearing "future" tokens from the KV cache

0f332a9

llama : fix handling of "future" tokens when loading sessions

337120c

cebtenzzre reviewed Oct 3, 2023

View reviewed changes

llama : fix comments for llama_kv_cache API

5418932

ggerganov merged commit ac2219f into master Oct 3, 2023
32 checks passed

ggerganov mentioned this pull request Oct 3, 2023

[bug] kv_self.size is being set to buffer size during load state #3445

Closed

ggerganov added a commit that referenced this pull request Oct 11, 2023

main : fix session loading bug (#3400)

b8fe4b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix session saving/loading #3400

llama : fix session saving/loading #3400

ggerganov commented Sep 29, 2023

BarfingLemurs commented Sep 30, 2023

Senemu commented Oct 1, 2023 •

edited

Loading

ggerganov commented Oct 2, 2023

Senemu commented Oct 2, 2023

cebtenzzre Oct 3, 2023

ggerganov commented Oct 3, 2023

Senemu commented Oct 3, 2023

cebtenzzre commented Oct 4, 2023

ggerganov commented Oct 11, 2023

Senemu commented Oct 11, 2023

llama : fix session saving/loading #3400

llama : fix session saving/loading #3400

Conversation

ggerganov commented Sep 29, 2023

BarfingLemurs commented Sep 30, 2023

Senemu commented Oct 1, 2023 • edited Loading

ggerganov commented Oct 2, 2023

Senemu commented Oct 2, 2023

cebtenzzre Oct 3, 2023

Choose a reason for hiding this comment

ggerganov commented Oct 3, 2023

Senemu commented Oct 3, 2023

cebtenzzre commented Oct 4, 2023

ggerganov commented Oct 11, 2023

Senemu commented Oct 11, 2023

Senemu commented Oct 1, 2023 •

edited

Loading