Llama-2-ko-gguf serves as an advanced iteration of Llama-2 expanded vocabulary of korean corpus
- Model : Llama 2 ko 7B gguf
- Model creator: Meta
- Original model: Llama 2 7B
- Original Llama 2 Ko model: Llama 2 ko 7B
- Reference: Llama 2 7B GGUF
pip3 install huggingface-hub>=0.17.1
Then you can download any individual model file to the current directory, at high speed, with a command like this:
huggingface-cli download 24bean/Llama-2-7B-ko-GGUF llama-2-ko-7b_q8_0.gguf --local-dir . --local-dir-use-symlinks False
Or you can download llama-2-ko-7b.gguf, non-quantized model by
huggingface-cli download 24bean/Llama-2-7B-ko-GGUF llama-2-ko-7b.gguf --local-dir . --local-dir-use-symlinks False
Make sure you are using llama.cpp
from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later.
./main -ngl 32 -m llama-2-ko-7b_q8_0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}"
You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries.
# Base ctransformers with no GPU acceleration
pip install ctransformers>=0.2.24
# Or with CUDA GPU acceleration
pip install ctransformers[cuda]>=0.2.24
# Or with ROCm GPU acceleration
CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers
# Or with Metal GPU acceleration for macOS systems
CT_METAL=1 pip install ctransformers>=0.2.24 --no-binary ctransformers
from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("24bean/Llama-2-ko-7B-GGUF", model_file="llama-2-7b_q8_0.gguf", model_type="llama", gpu_layers=50)
print(llm("AI is going to"))
Here's guides on using llama-cpp-python or ctransformers with LangChain: