Calm is a python 3.9+ package that makes it easier to work with large language models, including downloading the model and talking to it.
Calm automatically uses the right template for each model, supports multiple prompting styles, and chooses parameters based on your CPU, GPU and RAM.
Calm also provides advanced LLM features like Retrieval-Augmented Generation (RAG) with chromadb, multi-turn prompting with Guidance, and an OpenAI-compatible API to support external clients.
Calm is accelerated on Apple Silicon out of the box thanks to llama.cpp. Windows and Linux should generally work without GPU acceleration.
Quickly install calm
, download a language model, and start talking to it.
pip install "git+https://github.com/iandennismiller/calm"
- download a language model
- ask a question
- add to knowledge
- ask a question using knowledge
calm download
calm say "What is the meaning of life?"
calm learn "My name is Gilgamesh."
calm say --kb "What is my name?"
calm list
mistral
samantha
mistral-openorca
Ask a question on the command line. Calm will create a model Instance to answer the question.
calm say "What is the meaning of life?"
AI Assistant: The meaning of life is a complex and multifaceted concept that has been pondered by philosophers, scientists, and individuals throughout history. There isn't a single definitive answer to this question, as it depends on one's personal beliefs, values, and experiences. However, some common themes in the search for meaning include finding purpose, happiness, and fulfillment through relationships, personal growth, and contributing positively to society.
Be sure to put quotes around the question so it is treated as a single argument.
To talk to a specific model, use the -m
flag:
calm say -m samantha "How are you today?"
To talk to a specific character, use the -c
flag:
calm say -c mixture-of-experts "How can we reduce traffic?"
Downloads a model called Samantha to a folder called ~/.local/share/calm/models
.
Calm will choose the right quant automatically by examining system RAM.
calm download samantha
calm
can learn and recall facts in a knowledgebase.
By default, the storage path is ~/.local/share/calm/kb
.
calm learn "I use github"
calm learn "I write code"
calm learn "I store my code on github"
calm learn "My name is Gilgamesh"
calm recall "name"
calm
can retrieve knowledge to provide context for LLM response generation.
Provide the -k
flag to tell calm
to use knowledge:
calm say -k "Where does Gilgamesh store their code?"
Gilgamesh stores their code on GitHub
Store each knowledgebase in a separate path.
calm learn --path /tmp/kb1 "Facts about client 1"
calm learn --path /tmp/kb2 "Facts about client 17"
calm say --path /tmp/kb1 "Which client do I know facts about?"
Using multi-turn prompting with Guidance, simulate a Mixture of Experts and ask them a question.
This uses the character flag -c
to select mixture-of-experts.
calm say -c mixture-of-experts "How can we reduce traffic?"
<|im_start|>system
You are a helpful and terse assistant.<|im_end|>
<|im_start|>user
I want a response to the following question:
How can we reduce traffic?
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.<|im_end|>
<|im_start|>assistant
I understand that you want me to provide information without directly answering the question. Here are 3 world-class experts and their respective fields, who could potentially offer valuable insights on reducing traffic ...<|im_end|>
Use the mixture-of-experts character with the samantha model:
calm say -c mixture-of-experts -m samantha "How can we reduce traffic?"
Run the OpenAI-compatible API on localhost:
calm api
INFO: Started server process [99517]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
Run API with a specific model:
calm api -m samantha
When I run this on my MBP M1 with 32gb ram, I get the following output:
calm max
180b too big
70b too big
30b Q4_K_S quant 2048 context
13b Q6_K quant 8192 context
7b Q6_K quant 8192 context
3b Q6_K quant 8192 context
1b Q6_K quant 8192 context
It is possible for your system to support a larger context than the model architecture provides.
To add a new model or character, create a new YAML description file in the descriptions folder:
Use these templates to get started:
Finally, please submit a pull request with your new YAML files in the descriptions folder.
calm
automatically chooses the most sensible model size based on your system resources.
I've made opinionated choices in this regard and I believe the vast majority of use cases are supported by just three model sizes:
f16
: unquantized, largest size, slowest computation, and best resultsQ6_k
: smaller and faster than unquantized, good results, often better than q8_0Q4_K_S
: smallest quant that still produces acceptable results
The following models described as YAML for calm:
I would be remiss if I didn't mention the important contributions of TheBloke, who has provided great coverage of model quants.
In some cases, I am directly re-hosting models they have quantized - but in most cases, I've performed my own conversions and quantizations to meet the specific goals of calm
.
I refer to the llama.cpp quantize example to examine the impact of quantization upon perplexity.
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 13.00G @ 7B
0 or F32 : 26.00G @ 7B
According to these results, Q8_0 is a little worse than Q6_K despite being larger. Therefore, I've opted for Q6_K in all cases. I selected Q4_K_S for smaller systems as a compromise; perplexity is severely impacted below that size.
The following projects are used by calm
to support large language models.