An Evaluation Mechanism of LLM-based Agents on Manipulating APIs

This repo contains codes and data for the paper "An Evaluation Mechanism of LLM-based Agents on Manipulating APIs", which has been accepted to EMNLP 2024 Findings. In this work, we release one evaluation kit, consisting of (1) one set of tools, (2) data of 8 evaluation tasks examining different aspects of tool use capability, (3) automatic scoring scripts of 8 evaluation tasks.

Structure of Files

data/: instruction data, files (mainly images) for calling tools, and evaluation data for each task.
toolset/: executable tools along with their corresponding documentation.
eval_scripts/: codes for LLMs like GPT4 to performe tasks, automatic scoring.
reproduce/: codes for assisting reproducing this work, e.g. prompts for generating initial instructions, search similar tools, produce data for tasks from instructions, etc.

Install Python Environment

Python 3.9 with packages in requirements.txt.

Usage of Toolset

Most tools are implemented by wrapping APIs deployed on RapidAPI, while a few are implemented by us. These tools have a unified interface of calling functionalities and accessing documentation.

To use this toolset, you need to get one X-RapidAPI-Key by registering RapidAPI. Then, set your X-RapidAPI-Key in toolset/config.yml. Afterwards, subscribe the following APIs on RapidAPI. All of them have free plan which is good to start with.

Config OpenAI key for running evaluation

Config the assessment to LLM like OpenAI GPTs in eval_scripts/openai_config.py.

Producing more data, e.g. for new APIs

In the paper, the procedure of dataset construction is detailed in Appendix C, iillustrated with Fig.6 and 7.

We also provide some codes under reproduce/ and data under data/instructions/ for helping understand this process. Some notes are as below:

Generate initial instructions for each API. reproduce/generate_instruction*.ipynb => data/instructions/type-i,ii,iii,iv,v.
Merge instructions of APIs. => merged_instructions*.csv.
Check the instruction data by human annotator. => humancheck_instructions*.csv/xlsx
Further process the instruction data and get the final instructions. => process_instructions.ipynb
After the instruction data are ready, produce data for each evaluation task. produce_data_for_task*.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Evaluation Mechanism of LLM-based Agents on Manipulating APIs

Structure of Files

Install Python Environment

Usage of Toolset

Config OpenAI key for running evaluation

Producing more data, e.g. for new APIs

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
eval_scripts		eval_scripts
reproduce		reproduce
toolset		toolset
README.md		README.md
requirements.txt		requirements.txt

OPPO-Mente-Lab/agent_eval

Folders and files

Latest commit

History

Repository files navigation

An Evaluation Mechanism of LLM-based Agents on Manipulating APIs

Structure of Files

Install Python Environment

Usage of Toolset

Config OpenAI key for running evaluation

Producing more data, e.g. for new APIs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages