Skip to content

Releases: LostRuins/koboldcpp

**koboldcpp-1.22-CUDA-ONLY**

15 May 16:04
Compare
Choose a tag to compare

koboldcpp-1.22-CUDA-ONLY (Special Edition)

A.K.A The "Look what you made me do" edition.

Changes:

  • This is a (one-off?) limited edition CUDA only build.
  • Only NVIDIA GPUs will work for this.
  • This build does not support CLblast or OpenBLAS. Selecting OpenBLAS or CLBlast options still loads CUBLAS.
  • This build does not support running old quantization formats (this is a limitation of the upstream CUDA kernel).
  • This build DOES support GPU Offloading via CUBLAS. To use that feature, select number of layers to offload e.g. --gpulayers 32
  • This build is very huge because of the CUBLAS libraries bundled with it. It requires CUDA Runtime support for 11.8 and up.

For those who want the previous version, please find v1.21.3 here: https://github.com/LostRuins/koboldcpp/releases/tag/v1.21.3

To use, download and run the koboldcpp_CUDA_only.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.21.3

13 May 05:06
Compare
Choose a tag to compare

koboldcpp-1.21.3

KNOWN ISSUES: PLEASE READ

  • If you are using v1.21.1 and v1.21.0, there's a misalignment with one of the structs which can cause some models to output nonsense randomly. Please update to v1.21.2
  • CLBlast seems to be broken on q8_0 formats in v1.21.0 to v1.21.2. Please update to 1.21.3

Changes:

  • Integrated the new quantization formats while maintaining backward compatibility for all older ggml model formats. This was a massive undertaking and it's possible there may be bugs, so please do let me know if anything is broken!
  • Fixed some rare out of memory errors that occurred when using GPT2 models with BLAS.
  • Updated Kobold Lite: New features include multicolor names, idle chat responses, toggle for the instruct prompts, and various minor fixes.

1.21.1 edit:

  • Cleaned up some unnecessary prints regarding BOS first token. Added an info message encouraging OSX users to use Accelerate instead of OpenBLAS since it's usually faster (--noblas)

1.21.2 edit:

  • Fixed a error with the OpenCL kernel failing to compile on certain platforms. Please help check.
  • Fixed a problem when logits would sometimes be NaN due to an unhandled change in size of the Q8_1 struct compared to previously. This also affected other formats such as NeoX, RedPajama and GPT2 so you are recommended to upgrade to 1.21.2

1.21.3 edit

  • Recognize q8_0 as an older format as the new clblast kernel doesnt work correctly with it.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.20

08 May 13:19
Compare
Choose a tag to compare

koboldcpp-1.20

  • Added an option to allocate more RAM for massive context sizes, to allow testing with models with > 2048 context. You can change this with the flag --contextsize
  • Added experimental support for the new RedPajama variant of GPT-NeoX models. As the model formats are nearly identical to Pythia, this was particularly tricky to implement. This uses a very ugly hack to determine whether it's a RedPajama model. If detection fails, you can always force it with the flag --forceversion

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.19.1

06 May 04:12
Compare
Choose a tag to compare

koboldcpp-1.19

  • Integrate --usemirostat option for all model types. This must be set at launch, and replaces your normal stochastic samplers with mirostat. Takes 3 params [type][tau][eta], e.g. --usemirostat 2 5.0 0.1 Works on all models, but noticeably bad on smaller ones. Follows the upstream implementation. More info here.

  • Added an option --forceversion [ver]. If the model file format detection fails (e.g. A rogue modified model) you can set this to override the detected format (enter desired version, e.g. 401 for GPTNeoX-Type2).

  • Added an option --blasthreads, which controls threads when ClBlast is active. Some people wanted to use a different thread count when CLBlast was active and got overall speedups, so now you can experiment. Uses the same value as --threads if not specified.

  • Integrated new improvements for RWKV. This provides support for all the new RWKV quantizations, but drops support for Q4_1_O following the upstream - this way I only need to maintain one library. RWKV q5_1 should be much faster than fp16 but perform similarly.

  • Bumped up the buffer size slightly to support Chinese alpaca.

  • Integrated upstream changes and improvements, various small fixes and optimizations.

  • Fixed a bug where GPU device was set incorrectly in clblast

  • Special: An experimental Windows 7 Compatible .exe is included for this release, to attempt to provide support for older OS. Let me know if it works (for those still stuck on Win7). Don't expect it to be in every release though.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.18

02 May 14:52
Compare
Choose a tag to compare

koboldcpp-1.18

  • This release brings a new feature within Kobold Lite - Group Conversations. In chat mode, you can now specify multiple Chat Opponents (delimited with ||$||) which will trigger a simulated group chat, allowing the AI to reply as different people. Note that this does not work very well in Pygmalion models, as they were trained mainly on 1 to 1 chat. However it seems to work well in LLAMA based models. Each chat opponent will add a custom stopping sequence (max 10). Works best with Multiline Replies disabled. To demonstrate this, a new Scenario Class Reunion has been added in Kobold Lite.
  • Added a new flag --highpriority, which increases the CPU priority of the process, potentially speeding up generation timings. See #133 your mileage may vary depending on memory bottlenecks. Do share if you experience significant speedups.
  • Added the --usemlock parameter to keep model in RAM, for Apple M1 users.
  • Fixed a stop_sequence bug which caused a crash
  • Added error information display when the tkinter GUI fails to load
  • Pulled upstream changes and fixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.17

01 May 16:57
Compare
Choose a tag to compare

koboldcpp-1.17

  • Removed Cloudflare Insights - this was previously in Kobold Lite and was included in KoboldCpp. For disclosure: Cloudflare Insights is a GDPR compliant tool that Kobold Lite used previously used to provide information on browser and platform distribution (e.g. ratio of desktop/mobile users), browser type (chrome/firefox etc), to determine which browser platforms I have to support for Kobold Lite. You can read more about it here: https://www.cloudflare.com/insights/ It did not track any personal information, and did not relay any data you load, use, enter or access within Kobold. It was not intended to be included in KoboldCpp, and I originally removed it but forgot for subsequent versions. As of this version, it is removed from both Kobold Lite and KoboldCpp by request.

  • Added the Token Unbanning to the UI, and allowed it to prevent generation of the EOS token, which is required for newer Pygmalion models. You can trigger it with --unbantokens

  • Pulled upstream fixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.16

30 Apr 06:31
Compare
Choose a tag to compare

koboldcpp-1.16

  • Integrated the overhauled Token Samplers. The whole sampling system has been reworked for Top-P, Top-K and Rep Pen, all model architectures and types now use the same sampling functions. Also added 2 new samplers - Tail Free Sampling (TFS) and Typical Sampling. As I did not test the new implementations for correctness, please let me know if you are experiencing weird results (or degradations for previous samplers).
  • Integrated CLBlast support for the q5_0 and q5_1 formats. Note: Upstream llama.cpp repo has completely removed support for the q4_3 format. For now I still plan to keep support for q4_3 available within KoboldCpp but you are strongly advised not to use q4_3 anymore. Please switch or reconvert any q4_3 models if you can.
  • Fixed a few edge cases with GPT2 models going OOM with small batch sizes.
  • Fixed a regression where older GPT-J models (e.g. the original model from Alpin's Pyg.cpp fork) failed to load due to some upstream changes in the GGML library. You are strongly advised to not use outdated formats - reconvert if you can, it will be faster.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17

koboldcpp-1.15

28 Apr 07:23
Compare
Choose a tag to compare

koboldcpp-1.15

  • Added a brand new "Easy Mode" GUI which triggers if no command line arguments are set. This is aimed to be a noob-friendly way to get into KoboldCpp, but for full functionality you are still advised to run it from the command line with customized arguments. You can skip it with any command line argument, or using the flag --skiplauncher which does nothing else.
  • Pulled the new quantization format support for q5_0 and q5_1 for llama.cpp from upstream. Also pulled the q5 changes for GPT-2, GPT-J and GPT-NeoX formats. Note that these will not work in CLBlast yet - but OpenBLAS should work fine.
  • Added a new flag --debugmode which shows the Tokenized prompt being sent to the backend within the terminal window.
  • Setting --stream flag now automatically redirects the URL in the embedded Kobold Lite UI, no need to type ?streaming=1 anymore.
  • Updated Kobold Lite, now supports multiple custom stopping sequences which you can specify, separating in the UI with the ||$|| delimiter. Lite also now saves your custom stopping sequences into your save files and autosaves.
  • Merged upstream fixes and improvements.
  • Minor console fixes for Linux, and OSX compatibility.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17

koboldcpp-1.14

26 Apr 15:38
Compare
Choose a tag to compare

koboldcpp-1.14

  • Added backwards compatibility for an older version of NeoX with different quantizations
  • Fixed a few scenarios where users may encounter OOM crashes
  • Pulled upstream updates

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2 flags
Big context too slow? Try the --smartcontext flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast flag for a speedup

Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17

koboldcpp-1.13.1

24 Apr 13:55
Compare
Choose a tag to compare

koboldcpp-1.13.1

  • A multithreading bug fix has allowed CLBlast to greatly increase prompt processing speed. It should now be up to 50% faster than before, and just slightly slower than CuBLAS alternatives. Because of this, we probably will no longer need to integrate CuBLAS.
  • Merged the q4_2 and q4_3 CLBlast dequantization kernels, allowing them to be used with CLBlast.
  • Added a new flag --unbantokens. Normally, KoboldAI prevents certain tokens such as EOS and Square Brackets. This flag unbans them.
  • Edit: Fixed compile errors, made mmap automatic when lora is selected, added updated quantizers, and quantization handling for gpt neox gpt 2 and gptj

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2 flags
Big context too slow? Try the --smartcontext flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast flag for a speedup

Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17