Skip to content

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Notifications You must be signed in to change notification settings

zhangisland/OfficeBench

 
 

Repository files navigation

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

🖋 Authors: Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang

We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office workflows. OfficeBench requires LLM agents to perform feasible long-horizon planning, proficiently switch between applications in a timely manner, and accurately ground their actions within a large combined action space, based on the contextual demands of the workflow.

OfficeBench is one of the first office automation benchmarks for language agents. We assess the ability of language agents to perform complex office workflows across multiple applications using customized evaluation methods, such as Exact Matching, Fuzzy Matching, and Execution-based Evaluation.

🧩 Architecture

The LLM agent leverages the operations from multiple applications to construct an operation chain that effectively addresses office tasks. The framework is formulated as a transition system where the current application serves as the state and the operations serve as the transitions. Specialized operations, such as read_file and send_email, perform specific tasks.

📈 Leaderboard

X App(s) indicates the number of apps the task requires. (X) indicates the number of tasks in this category.

LLM Agents Single App (93) Two Apps (95) Three Apps (112) Overall (300)
Proprietary Models
Gemni-1.0 Pro (Feb 2024) 24.73 13.68 0.89 12.33
Gemni-1.5 Flash (May 2024) 34.41 24.21 0.89 18.67
Gemni-1.5 Pro (May 2024) 41.94 32.63 7.14 26.00
GPT-3.5 Turbo (0125) 8.60 7.45 0.89 5.35
GPT-4 Turbo (2024-04-09) 56.99 50.63 11.61 38.00
GPT-4 Omni (2024-05-13) 64.52 60.00 21.43 47.00
Open-weights Models
Llama 3 (70B-Instruct) 39.79 41.05 5.36 27.33
Qwen 2 (72B-Instruct) 30.23 28.42 8.04 21.16

(Note: This leaderboard will be continuously updated as new data and model updates become available.)

🛠️ Setup

conda create -n officebench python=3.10
pip install -r requirements.txt

🤖 Run LLM Agent

# Prepare your OpenAI key in openai_key.txt if you call OpenAI models.
# Prepare your Gemini key in gemini_key.txt if you call Gemini models.
# Launch a vLLM server if you call models from Huggingface.
# Check `generate_command.ipynb` to generate your customized commands easily.

python agent_interact.py \
--docker_name {docker_name} \         # docker image name: e.g. officebench
--container_name {container_name} \   # container name: e.g. officebench-test
--model_name {model_name} \           # now supports Openai/Gemini/vLLM (see below)
--task_dir {task_dir} \               # task directory: e.g. 'tasks/1-20'
--config_file {config_file} \         # config file: e.g. 'tasks/1-20/subtasks/0.json'
--tag {tag} \                         # a unique tag for your current run: e.g. July24-test
--max_iter {max_iter} \               # maximum number of iterations: e.g. 20
--mode {mode}                         # running mode: default/force_new/use_llm_cache (see below)

Supported LLMs

  • OpenAI: To use the OpenAI API with specific language models, simply specify the model name, such as gpt-4-turbo.
  • Gemini: Similar to the OpenAI API, you can specify Gemini models by name, for example, gemini-1.5-pro.
  • vLLM: To use LLMs from Huggingface as the backbone, you must launch an OpenAI Compatible Server from vLLM.

Running Mode

  • default: Operates without caching or overriding. Raises an error if results already exist.
  • force_new: Overrides existing results if found.
  • use_llm_cache: Loads previous LLM chat history and continues task processing. Useful for extending max_iter after initial experiments.

Task Naming

The annotated tasks can be found in tasks/{app_num}-{task_id}/subtask/{subtask_id}.json, which includes the task decription and evaluation methods.

✅ Evaluation

python evaluation.py \
--model_name {model_name} \           # the LLM you used (`--model_name` used in agent_interact.py)
--tag_name {tag_name} \               # the tag you used (`--tag` used in agent_interact.py)

Final results will be written to results/{model_name}_{tag_name}_result.jsonl

❤️ Acknowledgement

We greatly thank InterCode for providing the awesome agent framwork. We also sincerely appreciate the contributors of the vLLM team for providing the OpenAI Compatible Server to easily call open-sourced LLMs.

About

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.0%
  • Jupyter Notebook 7.5%
  • Dockerfile 1.4%
  • Shell 0.1%