This repo contains codes and data for the paper "An Evaluation Mechanism of LLM-based Agents on Manipulating APIs", which has been accepted to EMNLP 2024 Findings. In this work, we release one evaluation kit, consisting of (1) one set of tools, (2) data of 8 evaluation tasks examining different aspects of tool use capability, (3) automatic scoring scripts of 8 evaluation tasks.
data/
: instruction data, files (mainly images) for calling tools, and evaluation data for each task.toolset/
: executable tools along with their corresponding documentation.eval_scripts/
: codes for LLMs like GPT4 to performe tasks, automatic scoring.reproduce/
: codes for assisting reproducing this work, e.g. prompts for generating initial instructions, search similar tools, produce data for tasks from instructions, etc.
Python 3.9 with packages in requirements.txt.
Most tools are implemented by wrapping APIs deployed on RapidAPI, while a few are implemented by us. These tools have a unified interface of calling functionalities and accessing documentation.
To use this toolset, you need to get one X-RapidAPI-Key
by registering RapidAPI.
Then, set your X-RapidAPI-Key
in toolset/config.yml
.
Afterwards, subscribe the following APIs on RapidAPI. All of them have free plan which is good to start with.
- weatherapi_com
- news_api
- public_holiday
- recipe_by_api_ninjas
- objects_detection
- ocr_extract_text
- web_capture
- onecompiler_apis
- bing
- bing_search_apis
- google_api
- skyscanner80
- currency_exchange
- geocoding_by_api_ninjas
- airports
- list_of_all_countries_and_languages_with_their_codes
- tourist_attraction
- world_time_by_api_ninjas
- google_translate
- ip_geo_location
Config the assessment to LLM like OpenAI GPTs in eval_scripts/openai_config.py
.
In the paper, the procedure of dataset construction is detailed in Appendix C, iillustrated with Fig.6 and 7.
We also provide some codes under reproduce/
and data under data/instructions/
for helping understand this process. Some notes are as below:
-
Generate initial instructions for each API.
reproduce/generate_instruction*.ipynb
=>data/instructions/type-i,ii,iii,iv,v
. -
Merge instructions of APIs. =>
merged_instructions*.csv
. -
Check the instruction data by human annotator. =>
humancheck_instructions*.csv/xlsx
-
Further process the instruction data and get the final instructions. =>
process_instructions.ipynb
-
After the instruction data are ready, produce data for each evaluation task.
produce_data_for_task*.ipynb