Local LLM-assisted text completion.
- Auto-suggest on cursor movement in
Insert
mode - Toggle the suggestion manually by pressing
Ctrl+F
- Accept a suggestion with
Tab
- Accept the first line of a suggestion with
Shift+Tab
- Control max text generation time
- Configure scope of context around the cursor
- Ring context with chunks from open and edited files and yanked text
- Supports very large contexts even on low-end hardware via smart context reuse
- Display performance stats
Plug 'ggml-org/llama.vim'
cd ~/.vim/bundle
git clone https://github.com/ggml-org/llama.vim
Then add Plugin 'llama.vim'
to your .vimrc in the vundle#begin()
section.
The plugin requires a llama.cpp server instance to be running at g:llama_config.endpoint
brew install llama.cpp
Either build from source or use the latest binaries: https://github.com/ggerganov/llama.cpp/releases
Here are recommended settings, depending on the amount of VRAM that you have:
-
More than 16GB VRAM:
llama-server \ --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF \ --hf-file qwen2.5-coder-7b-q8_0.gguf \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 \ --ctx-size 0 --cache-reuse 256
-
Less than 16GB VRAM:
llama-server \ --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \ --hf-file qwen2.5-coder-1.5b-q8_0.gguf \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 \ --ctx-size 0 --cache-reuse 256
Use :help llama
for more details.
The plugin requires FIM-compatible models: HF collection
The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is 15186
tokens and the maximum is 32768
. There are 30
chunks in the ring buffer with extra context (out of 64
). So far, 1
chunk has been evicted in the current session and there are 0
chunks in queue. The newly computed prompt tokens for this request were 260
and the generated tokens were 25
. It took 1245 ms
to generate this suggestion after entering the letter c
on the current line.
llama.vim-0-lq.mp4
Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.
The plugin aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware. Read more on how this is achieved in the following links:
- Initial implementation and techincal description: ggerganov/llama.cpp#9787
- Classic Vim support: ggerganov/llama.cpp#9995