Self Testing GATs (Generation Augmented by Tools)

This project focuses on designing and self-testing GAT LLMs (Language Learning Models) that can effectively use a variety of tools to accomplish tasks.

Demonstration (will take you to YouTube):

Paper pre-print: in the folder paper

Project Overview

This project implements a flexible framework for:

Integrating various tools with LLMs
Generating test cases to evaluate LLM performance in tool selection and usage
Performing self-tests on different LLM models
Analyzing the results of these tests

The system supports multiple LLM providers (including OpenAI, Anthropic, and AWS Bedrock) and a wide range of tools for tasks such as date calculations, web scraping, plotting, file operations, and more.

Current benchmarks

With the current prompts, tools, descriptions and native tool configuration use settings, this is the performance of LLMs in GAT tasks.

Note: this is not a leaderboard or general evaluation of quality. It only refers to this test setting as a simulation of an industrial LLM GAT implementation.

	('n_invented_tools', 'sum')	('accuracy', '%')	('score', '%')	('USD / 1M tokens', 'Input')	('USD / 1M tokens', 'Output')
('Claude 3.5 Sonnet - Anthropic', False)	0	78	89.5	3	15
('GPT 4o - OpenAI', True)	1	79.9	89.4	5	15
('GPT 4o mini - OpenAI', True)	3	79.9	89	0.15	0.6
('Claude 3.5 Haiku - Anthropic', True)	2	76.6	89	1	5
('Amazon Nova Pro 1.0 - Bedrock', True)	1	78	88.7	0.8	3.2
('Claude 3.5 Sonnet - Anthropic', True)	0	76.6	88.7	3	15
('Claude 3 Haiku - Bedrock', True)	2	77.5	88.6	0.25	1.25
('Claude 3.5 Haiku - Anthropic', False)	9	73.9	87.9	1	5
('GPT 4o - OpenAI', False)	4	76.6	87.7	5	15
('Llama3_1 405b instruct', False)	3	75.5	87	5.32	16
('Mistral Large v1', False)	1	74.7	86.8	4	12
('GPT 4o mini - OpenAI', False)	3	73.1	85.1	0.15	0.6
('Command RPlus - Bedrock', False)	4	72.8	83.8	3	15
('Claude 3 Haiku - Bedrock', False)	3	70.6	83.3	0.25	1.25
('Sabia3 - Maritaca', True)	6	70.6	83.2	0.95	1.9
('Amazon Nova Lite 1.0 - Bedrock', True)	2	66.2	80.2	0.06	0.24
('Llama3_1 70b instruct', False)	11	70	79.6	2.65	3.5
('GPT 3.5 - OpenAI', False)	2	65.4	78.6	0.5	1.5
('GPT 3.5 - OpenAI', True)	18	66.4	76.9	0.5	1.5
('Sabia3 - Maritaca', False)	14	61.8	75.7	0.95	1.9
('Mistral Mixtral 8x7B', False)	156	50.1	67.5	0.45	0.7
('Amazon Nova Micro 1.0 - Bedrock', True)	145	52.5	66.5	0.035	0.14
('Command R - Bedrock', False)	117	49.7	65.4	0.5	1.5
('Llama3 8b instruct', False)	39	22.3	38.1	0.3	0.6
('Llama3 70b instruct', False)	29	29.1	36.1	2.65	3.5
('Llama3_1 8b instruct', False)	34	23.9	33.7	0.3	0.6

Using this Code

To use this code and run the implemented tools, follow these steps:

Clone this repository and cd to the repository folder.
Set up the environment:
- If using conda, create the environment:
```
conda env create -f environment.yml
```
- Alternatively, install the requirements directly from requirements.txt
- Activate the environment with conda activate llm_gat_env

Set up your API keys (depending on what tools and LLM providers you need):

For Linux:

export AWS_ACCESS_KEY_ID=your_aws_access_key
export AWS_SECRET_ACCESS_KEY=your_aws_secret_key
export ANTHROPIC_API_KEY=your_anthropic_key
export OPENAI_API_KEY=your_openai_key
export MARITACA_API_KEY=your_maritaca_key

For Windows:

set AWS_ACCESS_KEY_ID=your_aws_access_key
set AWS_SECRET_ACCESS_KEY=your_aws_secret_key
set ANTHROPIC_API_KEY=your_anthropic_key
set OPENAI_API_KEY=your_openai_key
set MARITACA_API_KEY=your_maritaca_key

Open and run GAT-demo.ipynb to launch the Gradio demo
Access the demo:
- Click the localhost interface
- To share the demo with a public Gradio link, set share=True in the launch command:
```
demo.queue().launch(show_api=False, share=True, inline=False)
```

Inspecting the Tools and LLMs

The Jupyter Notebook (GAT-demo.ipynb) provides a convenient interface for inspecting:

Direct tool call results
Prompts used for LLM interactions
Other relevant information about the system's operation

Refer to the comments in the notebook for detailed explanations of each section.

Changing the Code

Implementing a New Tool

To add a new tool to the system:

Create a new Python file in the tools folder (e.g., new_tool.py)
Define a new class for your tool (e.g., ToolNewTool)
Implement the following methods:
- __init__: Initialize the tool, set its name and description
- __call__: Implement the tool's functionality
Add the tool description in the tool_description attribute, following the format used in other tools
In tools/base.py, import your new tool and add it to the get_all_tools method in the LLMTools class

Example structure for a new tool:

class ToolNewTool:
    def __init__(self):
        self.name = "new_tool_name"
        self.tool_description = {
            "name": self.name,
            "description": "Description of what the tool does",
            "input_schema": {
                "type": "object",
                "properties": {
                    "param1": {"type": "string", "description": "Description of param1"},
                    # Add more parameters as needed
                },
                "required": ["param1"]
            }
        }

    def __call__(self, param1, **kwargs):
        # Implement tool functionality here
        result = # ... your code ...
        return result

Removing Tools

To remove a tool from the system:

Delete the tool's Python file from the tools folder
Remove the tool's import and reference from tools/base.py
Update any test cases or documentation that reference the removed tool

Adding LLMs

To add support for a new LLM:

Create a new file in the llm_providers folder (e.g., new_llm_provider.py)
Implement a class for the new LLM, following the interface used by existing LLM classes
In llm_invoker.py, import your new LLM class and add it to the allowed_llms list in the LLM_Provider class
Implement the necessary logic in the get_llm method of LLM_Provider to instantiate your new LLM

Self-assessment

The project includes a comprehensive self-assessment system for evaluating LLM performance in tool selection and usage. All test cases self-generated and the test results of each LLM are stored in the folder self_tests.

Self-generating Test Cases

The SelfTestGenerator class in self_tests/self_test_generator.py is responsible for creating test cases. It supports three strategies for test case generation:

use_all: Generates test cases for all tools in a single prompt
only_selected: Generates test cases for each tool individually
selected_with_dummies: Generates test cases for specific tools while providing all tools as options

To generate test cases:

Instantiate a SelfTestGenerator with the desired LLM
Call the gen_test_cases method with the number of test cases and the desired strategy

Using the Test Cases to Evaluate LLMs

The SelfTestPerformer class in self_tests/self_test_performer.py executes the generated test cases to evaluate LLM performance.

To run self-tests:

Prepare test case files (JSON format) using the SelfTestGenerator
Instantiate a SelfTestPerformer with the LLM you want to test
Call the test_tool_use method with the test cases

The results are saved in CSV format, allowing for easy analysis and comparison of different LLM models and configurations.

Use the utility functions in self_tests/self_test_utils.py to analyze the test results, including functions to detect invented tools, check for correct tool selection, and calculate performance scores.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
MATH		MATH
llm_providers		llm_providers
paper		paper
prompts		prompts
self_tests		self_tests
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
GAT-demo.ipynb		GAT-demo.ipynb
Generic_Chatbot.ipynb		Generic_Chatbot.ipynb
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
llm_interface.py		llm_interface.py
llm_invoker.py		llm_invoker.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self Testing GATs (Generation Augmented by Tools)

Table of Contents

Project Overview

Current benchmarks

Using this Code

Inspecting the Tools and LLMs

Changing the Code

Implementing a New Tool

Removing Tools

Adding LLMs

Self-assessment

Self-generating Test Cases

Using the Test Cases to Evaluate LLMs

About

Releases

Packages

Languages

License

douglas125/SelfTestingGAT_LLM

Folders and files

Latest commit

History

Repository files navigation

Self Testing GATs (Generation Augmented by Tools)

Table of Contents

Project Overview

Current benchmarks

Using this Code

Inspecting the Tools and LLMs

Changing the Code

Implementing a New Tool

Removing Tools

Adding LLMs

Self-assessment

Self-generating Test Cases

Using the Test Cases to Evaluate LLMs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages