Releases: LostRuins/koboldcpp
koboldcpp-1.12
koboldcpp-1.12
This is a bugfix release
- Fixed a few more scenarios where GPT2/GPTJ/GPTNeoX will go out of memory when using BLAS. Also, the max blas batch for non llama models currently capped to 256.
- Minor CLBlast optimizations should slightly increase speed
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.11
koboldcpp-1.11
- Now has GPT-NeoX / Pythia / StableLM support!
- Try my special model, Pythia-70m-ChatSalad here: https://huggingface.co/concedo/pythia-70m-chatsalad-ggml/tree/main
- Added upstream LORA file support for llama, use the
--lora
parameter. - Added limited fast-forwarding capabilities for RWKV, context can be reused if its completely unmodified.
- Kobold Lite now supports using an additional custom stopping sequence, edit it in the Memory panel.
- Updated Kobold Lite, and pulled llama improvements from upstream.
- Improved OSX and Linux build support - now automatically builds all libraries with the requested flags, and you can select which ones to use at runtime. Example: do a
make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1
and it will build both OpenBlas and CLBlast libraries on your platform, then you select clblast with--useclblast
at runtime.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.10
koboldcpp-1.10
- Now has RWKV support without needing pytorch or tokernizers or other external libraries!
- Try RWKV-v4-169m here: https://huggingface.co/concedo/rwkv-v4-169m-ggml/tree/main
- Now allows direct launching browser with
--launch
parameter. You can also do something like--stream --launch
. - Updated Kobold Lite, and pulled llama improvements from upstream.
- API now lists the KoboldCpp version number with a new endpoint
/api/extra/version
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
koboldcpp-1.9
koboldcpp-1.9
This was such a good update that I had to make a new version, so there are 2 new releases today.
- Now has support for stopping sequences fully implemented in the API! They've been implemented in a similar and compatible way to my United PR one-some/KoboldAI-united#5 and they should be shortly usable in online Lite as well as (eventually) the main kobold client when it gets merged. What this means is that now the AI will be able to finish a response early even if not all the response tokens are consumed, and save time by sending the reply instead of generating excess unneeded tokens. Automatically integrates into the latest version of Kobold Lite which sets the correct stop sequences from Chat and Instruct mode, which is also updated here.
- GPT-J and GPT2 models now support BLAS mode! They will use a smaller batch size than llama models, but the effect should still be very noticeably faster!
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup! (Credits to Occam)
koboldcpp-1.8.1
koboldcpp-1.8.1
- Another amazing improvement by @0cc4m, CLBlast now does the 4bit dequantization on GPU! That translates to about a 20% speed increase when using CLBlast for me, and should be a very welcome improvement. To use it, run with
--useclblast [platform_id] [device_id]
(you may have to figure out the values for your correct GPU through trial and error) - Merged fixes and optimizations from upstream
- Fixed a compile error in OSX
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
koboldcpp-1.7.1
koboldcpp-1.7.1
- This release brings an exciting new feature
--smartcontext
, this mode provides a way of prompt context manipulation that avoids frequent context recalculation. - Merged optimizations from upstream
- Updated embedded Kobold Lite to v20.
- Edit: A hotfix was deployed that fixed a tiny error in context calculation. The exe has been updated. If you downloaded 1.7 before it, please download it again.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
koboldcpp-1.6
koboldcpp-1.6
- This is a bugfix release, to try and see if it resolves the recent crashing issues reported.
- Recent CLBlast fixes merged, now shows GPU name.
- Batch size reduced back from 1024 to 512 due to reported crashes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
koboldcpp-1.5
koboldcpp-1.5
- This release consolidates a lot of upstream bug fixes and improvements, if you had issues with earlier versions please try this one. The upstreamed GPTJ changes should also make GPT-J-6B inference even faster by another 20% or so.
- Integrated AVX2 and Non-AVX2 support into the same binary for windows. If your CPU is very old and doesn't support AVX2 instructions, you can switch to compatibility mode with
--noavx2
, but it will be slower. - Now has integrated experimental CLBlast support thanks to @0cc4m, which uses your GPU to speed up prompt processing. Enable it with
--useclblast [platform_id] [device_id]
- To quantize various fp16 model, you can use the quantizers in the tools.zip. Remember to convert them from Pytorch/Huggingface format first with the relevant Python conversion scripts.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
If you prefer, you can download the zip file, extract and run the python script e.g. koboldcpp.py [ggml_model.bin]
manually
koboldcpp-1.4
koboldcpp-1.4
- This is an expedited bugfix release because the new model formats were breaking on large contexts.
- Also people have requested mmap to be the default, so now it is, you can disable it with --nommap
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
Alternative Options:
None are provided for this release as it is a temporary one.
koboldcpp-1.3
koboldcpp-1.3
-Bug fixes for various issues (missing endpoints, malformed url)
-Merged upstream file loading enhancements. mmap
is now disabled by default, enable with --usemmap
-Now can automatically distinguish between older and newer GPTJ and GPT2 quantized files.
-Version numbers are now displayed at start
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
Alternative Options:
If your CPU is very old and doesn't support AVX2 instructions, you can try running the noavx2 version. It will be slower.
If you prefer, you can download the zip file, extract and run the python script e.g. koboldcpp.py [ggml_model.bin]
manually
To quantize an fp16 model, you can use the quantize.exe in the tools.zip