Skip to content

agamm/llmatic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llmatic

No BS utilities for developing with LLMs

Why?

LLMs are basically API calls that need:

  • Swappability
  • Rate limit handling
  • Visibility on costs, latency and performance
  • Response handling

So we are solving those problems exactly without any crazy vendor lock or abstractions ;)

Examples

Calling an LLM (notice how you can call whatever LLM API you want, you're really free to do as you please).

import openai
from llmatic import track, get_response

# Start tracking an LLM call
t = track(project_id="test", id="story") # start tacking an llm call, saved as json files.

# Make the LLM call, note you can use whatever API or LLM you want.
response = openai.Completion.create(
    engine="gpt-4",
    prompt="Write a short story about a robot learning to garden, 300-400 words, be creative.",
    max_tokens=max_tokens,
    n=1,
    stop=None,
    temperature=0.7
)

# End tracking and save call results
t.end(model=model, prompt=prompt, response=response) # Save the cost (inputs/outputs), latency (execution time)

# Evaluation
#  Notice that this acts like the built in `assert`,
#  if you don't want it to affect your runtime in production just use
#  `log_only` (will only track the result) or dev_mode=True (won't run - useful for production).

t.eval("Is the story engaging?", scale=(0,10), model="claude-sonnet")
t.eval("Does the story contain the words ['water', 'flowers', 'soil']", scale=(0,10)) # This will use function calling to check "flowers" in story_text_response.
t.eval("Did the response complete in less than 0.5s?", scale=(0,1), log_only=True) # This will not trigger a conditional_retry, just log/track the eval
t.eval("Was the story between 300-400 words?", scale=(0,1))

# Extract and print the generated text
generated_text = get_response(response) # equivallent to response.choices[0].text.strip()
print(generated_text)

Retries?

Well, we will need to take our relationship to the next level :)

To get the most benefit from llmatic, we need to wrap all of the relevant LLM calls with a context (with llm(...)):

import openai
from llmatic import track, get_response, llm, condition

# Make the API call
t = track(id="story") # start tacking an llm call, saved as json files.
with llm(retries=3, tracker=t): # This will also retry any rate limit errors
  response = openai.Completion.create(
      engine="gpt-4",
      prompt="Write a short story about a robot learning to garden, 300-400 words, be creative.",
      max_tokens=max_tokens,
      n=1,
      stop=None,
      temperature=0.7
  )
t.end(model=model, prompt=prompt, response=response) # Save the cost (inputs/outputs), latency (execution time)

# Eval
t.eval("Was the story between 300-400 words?", scale=(0,1))
t.eval("Is the story creative?", scale=(0,10), "claude-sonnet")
t.conditional_retry(lambda score: score > 7, scale=(0,10), max_retry=3) # If our condition isn't met, retry the llm again

Results (e.g. LangSmith killer)

We use a CLI tool to check them out. By default, tracking results will be saved in an SQLite database located at $HOME/.llmatic/llmatic.db.

  • List all trackings: llmatic list

  • Output the last track: llmatic show <project_id> <tracking_id> (for example, llmatic show example_project story)

    This command will produce output like:

    LLM Tracking Results for: story
    ------------------------------------------------------------
    Tracking ID: story
    Model: gpt-4
    File and Function: examples_basic_main
    Prompt: "Write a short story about a robot learning to garden, 300-400 words, be creative." (Cost: $0.000135)
    Execution Time: 564 ms
    Total Cost: $0.000150
    Tokens: 30 (Prompt: 10, Completion: 20)
    Created At: 2024-05-26 15:50:00
    
    Evaluation Results:
    ------------------------------------------------------------
    Is the story engaging? (Model: claude-sonnet): 8.00
    Does the story contain the words ['water', 'flowers', 'soil'] (Model: claude-sonnet): 10.00
    
    Generated Text:
    ------------------------------------------------------------
    Once upon a time in a robotic garden... (Cost: $0.000015)
    
    

-Summarize all trackings for a project: llmatic summary <project_id> (for example, llmatic summary example_project)

This command will produce output like:

Run summary for project: example_project
------------------------------------------------------------
story - [gpt-4], 0.56s, $0.000150, examples_basic_main
another_story - [gpt-4], 0.60s, $0.000160, examples_another_function
  • Compare trackings: llmatic compare <id> <id2> TBD image

CLI Usage

List all trackings:

poetry run llmatic list

Output the last track:

poetry run llmatic show <project_id> <tracking_id>

Summarize all trackings for a project:

poetry run llmatic summary <project_id>

Remove all trackings for a project:

poetry run llmatic remove <project_id>

TODO:

  • Write the actual code :)
  • TS version?
  • Think about cli utitilies for improveing prompts from the trackings.
  • Think of ways to compare different models and prompts in a matrix style for best results using the eval values.
  • Github action for eval and results.

More why?

Most LLM frameworks try to lock you into their ecosystem, which is problematic for several reasons:

  • Complexity: Many LLM frameworks are just wrappers around API calls. Why complicate things and add a learning curve?
  • Instability: The LLM space changes often, leading to frequent syntax and functionality updates. This is not a solid foundation.
  • Inflexibility: Integrating other libraries or frameworks becomes difficult due to hidden dependencies and incompatibilities. You need an agnostic way to use LLMs regardless of the underlying implementation.
  • TDD-first: When working with LLMs, it's essential to ensure the results are adequate. This means adopting a defensive coding approach. However, no existing framework treats this as a core value.
  • Evaluation: Comparing different runs after modifying your model, prompt, or interaction method is crucial. Current frameworks fall short here (yes, langsmith), often pushing vendor lock-in with their SaaS products.

About

No BS utitilies for developing with LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages