WorkBench - the first open-source benchmark for evaluating agent performance on realistic workplace tasks. Created by MindsDB. Special thanks to Jorge Torres, Adam Carrigan, and the rest of the MindsDB team for their support. Check out the paper here - https://arxiv.org/abs/2405.00823
Python Version: 3.10.11
git clone https://github.com/olly-styles/WorkBench.git
cd WorkBench
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
All five sandbox databases, task-outcome pairs, and pre-computed inference results are provided in the data
directory.
As all the pre-computed inference results are provided, you can reproduce the evaluation results in the paper without running inference. Use the following script to calculate the metrics.
python scripts/evals/calculate_all_metrics.py;
python scripts/evals/calculate_all_metrics.py --all_tools;
Note that results are not provided for the all_tools variant of GPT3.5 and LLama2-70B as the prompt does not fit into the context window for these models.
All generated data is provided pre-computed in the data
directory. If you want to generate the data yourself, follow the steps below.
python scripts/data_generation/mocked_data/generate_all_mocked_data.py;
python scripts/data_generation/query_answer_generation/generate_all_query_and_answer.py;
Pre-computed inference results are provided in the data
directory. If you want to run inference yourself, you will need to provide your own API keys.
- An openai key is required for GPT-3.5 and GPT-4.
- An anthropic key is required for Claude-2.
- An anyscale key is required for llama2-70b and mistral-8x7B.
touch openai_key.txt && echo YOUR_OPENAI_API_KEY > openai_key.txt
touch anthropic_key.txt && echo YOUR_ANTHROPIC_API_KEY > anthropic_key.txt
touch anyscale_key.txt && echo YOUR_ANYSCALE_API_KEY > anyscale_key.txt
python scripts/inference/generate_results.py --model_name MODEL_NAME --queries_path QUERIES_PATH
python scripts/inference/generate_all_results.py
Getting results for a new agent dependings on the how different the new agent is from existing agents. Here we go through three possible scenarios:
- If the new agent is the same as an existing agent with a different prompt, you can modify the prompt directly.
- If the new agent uses an LLM that's supported by LangChain but not by the current implementation, then the new agent can be added to the supported LLMs
- To implement a new agent outside of the LangChain framework, update the inference loop
We originally called tasks "queries" and outcomes "answers". We updated the terminology in the paper but have not yet updated the code.
Similar to the "queries" and "answers" terminology, we originally called the sandbox databases "mocked data". We updated the terminology in the paper but have not yet updated the code.
Yes! The fastest way to reach us is by opening an issue on this repository. If you want to reach out for any other reason, please send an email to [email protected]