We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching doesn't work on either of my laptop or desktop Arch systems:
[alex@Arch ~]$ sh wizardcoder-python-34b-v1.0.Q5_K_M.llamafile --prompt-cache ~/wizard/newfile -f wizard/test.el.prompt note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading Log start main: llamafile version 0.6.2 main: seed = 1708969571 llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q5_K_M.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = wizardlm_wizardcoder-python-34b-v1.0 llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 22016 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32001] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32001] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32001] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 260/32001 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32001 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 22016 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 33.74 B llm_load_print_meta: model size = 22.20 GiB (5.65 BPW) llm_load_print_meta: general.name = wizardlm_wizardcoder-python-34b-v1.0 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/49 layers to GPU llm_load_tensors: CPU buffer size = 22733.75 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 3072.00 MiB llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_new_context_with_model: CPU input buffer size = 48.07 MiB llama_new_context_with_model: CPU compute buffer size = 2305.60 MiB llama_new_context_with_model: graph splits (measure): 1 system_info: n_threads = 12 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: attempting to load saved session from '/home/alex/wizard/newfile' main: session file does not exist, will create sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp generate: n_ctx = 16384, n_batch = 512, n_predict = -1, n_keep = 0 [INST]You are an Emacs code generator. Writing comments is forbidden. Writing test code is forbidden. Writing English explanations is forbidden. Generate el code to complete:[/INST] ```el (defconst all-greek-capital-letters )libc++abi: terminating due to uncaught exception of type std::runtime_error: failed to open /home/alex/wizard/newfile: Bad file number error: Uncaught SIGABRT (SI_TKILL) at 0x3e80004ede4 on Arch pid 323044 tid 323044 /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile Bad file number Linux Cosmopolitan 3.2.4 MODE=x86_64; #1 SMP PREEMPT_DYNAMIC Sat, 23 Sep 2023 22:55:13 +0000 Arch 6.5.5-arch1-1 RAX 0000000000000000 RBX 000010008004a750 RDI 000000000004ede4 RCX 00000000006b3096 RDX 0000000000000000 RSI 0000000000000006 RBP 00007ffdcb0ebde0 RSP 00007ffdcb0ebde0 RIP 00000000006b3096 R8 0000000000000000 R9 0000000000000002 R10 00000000006b3096 R11 0000000000000296 R12 0000000000000006 R13 0000000000675cd0 R14 00000000006fe348 R15 00007ffdcb0ee140 TLS 0000000000746340 XMM0 00000000000000000000000000000000 XMM8 00000000000000000000000000000000 XMM1 00000000000000000000000000000000 XMM9 00000000000000000000000000000000 XMM2 0000000000000000000000000082b448 XMM10 00000000000000000000000000000000 XMM3 2f206e65706f206f742064656c696166 XMM11 00000000000000000000000000000000 XMM4 0000a5e1ffffa1000000894e00003f16 XMM12 00000000000000000000000000000000 XMM5 000074cb0000025700000c03000006d9 XMM13 00000000000000000000000000000000 XMM6 000000008f83f4cc000010020095a540 XMM14 00000000000000000000000000000000 XMM7 00000000000002000000000000000004 XMM15 00000000000000000000000000000000 cosmoaddr2line /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile 6b3096 6a8331 4179a8 6bc040 6bc1c2 688997 6883c9 5ad927 406108 413733 401604 note: won't print addr2line backtrace because pledge 7ffdcb0e8c30 6b3096 systemfive_linux+31 7ffdcb0ebde0 6a8331 raise+113 7ffdcb0ebe00 4179a8 abort+45 7ffdcb0ebe20 6bc040 NULL+0 7ffdcb0ebf10 6bc1c2 _ZL28demangling_terminate_handlerv+338 7ffdcb0ebfc0 688997 _ZSt11__terminatePFvvE+71 7ffdcb0ec040 6883c9 NULL+0 7ffdcb0ec070 5ad927 llama_save_session_file+5015 7ffdcb0ec360 406108 main+18440 7ffdcb0ee020 413733 cosmo+77 7ffdcb0ee030 401604 _start+133 10008004-10008011 rw-pa- 14x automap 896kB w/ 896kB hole 10008020-10008068 rw-pa- 73x automap 4672kB w/ 1472kB hole 10008080-100080b3 rw-pa- 52x automap 3328kB w/ 768kB hole 100080c0-100080d7 rw-pa- 24x automap 1536kB w/ 512kB hole 100080e0-10008100 rw-pa- 33x automap 2112kB w/ 960kB hole 10008110-10008127 rw-pa- 24x automap 1536kB w/ 14mB hole 10008200-100085e8 rw-pa- 1'001x automap 63mB w/ 1472kB hole 10008600-10008901 rw-pa- 770x automap 48mB w/ 1904mB hole 10010000-1001c000 rw-pa- 49'153x automap 3072mB w/ 1024mB hole 10020000-10029019 rw-pa- 36'890x automap 2306mB w/ 5892mB hole 10040060-10098eec r--s-- 364'173x automap 22gB w/ 10gB hole 100c0000-10118ce7 r--s-- 363'752x automap 22gB w/ 96tB hole 6fd00004-6fd00004 rw-paF 1x zipos 64kB w/ 64gB hole 6fe00004-6fe00004 rw-paF 1x g_fds 64kB # 50gB total mapped memory /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile -m wizardcoder-python-34b-v1.0.Q5_K_M.gguf -c 0 --prompt-cache /home/alex/wizard/newfile -f wizard/test.el.prompt Aborted (core dumped)
From then on, if I try to use this, now existing, cache file:
[alex@Arch ~]$ sh wizardcoder-python-34b-v1.0.Q5_K_M.llamafile --prompt-cache ~/wizard/newfile -f wizard/test.el.prompt note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading Log start main: llamafile version 0.6.2 main: seed = 1708969624 llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q5_K_M.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = wizardlm_wizardcoder-python-34b-v1.0 llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 22016 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32001] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32001] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32001] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 260/32001 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32001 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 22016 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 33.74 B llm_load_print_meta: model size = 22.20 GiB (5.65 BPW) llm_load_print_meta: general.name = wizardlm_wizardcoder-python-34b-v1.0 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/49 layers to GPU llm_load_tensors: CPU buffer size = 22733.75 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 3072.00 MiB llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_new_context_with_model: CPU input buffer size = 48.07 MiB llama_new_context_with_model: CPU compute buffer size = 2305.60 MiB llama_new_context_with_model: graph splits (measure): 1 system_info: n_threads = 12 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: attempting to load saved session from '/home/alex/wizard/newfile' error loading session file: failed to open /home/alex/wizard/newfile: I/O error main: error: failed to load session file '/home/alex/wizard/newfile'
The same prompt works if used without the cache flag.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Caching doesn't work on either of my laptop or desktop Arch systems:
From then on, if I try to use this, now existing, cache file:
The same prompt works if used without the cache flag.
The text was updated successfully, but these errors were encountered: