Skip to content

codes and data for paper "An Evaluation Mechanism of LLM-based Agents on Manipulating APIs"

Notifications You must be signed in to change notification settings

OPPO-Mente-Lab/agent_eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An Evaluation Mechanism of LLM-based Agents on Manipulating APIs

This repo contains codes and data for the paper "An Evaluation Mechanism of LLM-based Agents on Manipulating APIs", which has been accepted to EMNLP 2024 Findings. In this work, we release one evaluation kit, consisting of (1) one set of tools, (2) data of 8 evaluation tasks examining different aspects of tool use capability, (3) automatic scoring scripts of 8 evaluation tasks.

Structure of Files

  • data/: instruction data, files (mainly images) for calling tools, and evaluation data for each task.
  • toolset/: executable tools along with their corresponding documentation.
  • eval_scripts/: codes for LLMs like GPT4 to performe tasks, automatic scoring.
  • reproduce/: codes for assisting reproducing this work, e.g. prompts for generating initial instructions, search similar tools, produce data for tasks from instructions, etc.

Install Python Environment

Python 3.9 with packages in requirements.txt.

Usage of Toolset

Most tools are implemented by wrapping APIs deployed on RapidAPI, while a few are implemented by us. These tools have a unified interface of calling functionalities and accessing documentation.

To use this toolset, you need to get one X-RapidAPI-Key by registering RapidAPI. Then, set your X-RapidAPI-Key in toolset/config.yml. Afterwards, subscribe the following APIs on RapidAPI. All of them have free plan which is good to start with.

Config OpenAI key for running evaluation

Config the assessment to LLM like OpenAI GPTs in eval_scripts/openai_config.py.

Producing more data, e.g. for new APIs

In the paper, the procedure of dataset construction is detailed in Appendix C, iillustrated with Fig.6 and 7.

We also provide some codes under reproduce/ and data under data/instructions/ for helping understand this process. Some notes are as below:

  1. Generate initial instructions for each API. reproduce/generate_instruction*.ipynb => data/instructions/type-i,ii,iii,iv,v.

  2. Merge instructions of APIs. => merged_instructions*.csv.

  3. Check the instruction data by human annotator. => humancheck_instructions*.csv/xlsx

  4. Further process the instruction data and get the final instructions. => process_instructions.ipynb

  5. After the instruction data are ready, produce data for each evaluation task. produce_data_for_task*.ipynb

About

codes and data for paper "An Evaluation Mechanism of LLM-based Agents on Manipulating APIs"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published