Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.
Here is a detailed summary of the paper:
Problem: Existing benchmarks for developing multimodal agents that can interact with computers lack real interactivity or are restricted to specific applications/domains. This limits the scope of tasks and scalability of agents. There is a need for a unified benchmark based on a real interactive environment that covers the diversity and complexity of real-world computer use across various operating systems, interfaces and applications.
Proposed Solution: The paper introduces OSWorld, the first scalable, real computer environment for multimodal agents. Key features:
- Supports task setup, execution-based evaluation and interactive learning across operating systems like Ubuntu, Windows, macOS
- Can serve as a unified, integrated environment for assessing open-ended computer tasks involving arbitrary applications
- Provides raw keyboard/mouse control for free-form interaction
- Supports multi-modal observations including screenshots, accessibility trees
- Includes custom example-specific configurations and evaluations
Based on OSWorld, a benchmark is created with 369 real-world computer tasks involving web, desktop apps, OS file I/O and multi-app workflows. Each task example has:
- Detailed initial state setup configuration
- Custom execution-based evaluation script for reliable assessment
The human baseline is 72.36% accuracy on the benchmark.
Main Contributions:
- First real, interactive computer environment for multimodal agents
- Enables benchmarking on diverse, open-ended computer tasks instead of isolated apps/domains
- Large-scale benchmark of 369 computer tasks with custom evaluations
- Analysis of state-of-the-art models reveals deficiencies in GUI grounding and knowledge, guiding future work
- Open-sourced environment, data and models to spur research on generalist computer agents
The paper makes significant progress towards developing agents that can serve as customizable assistants to enhance human-computer interaction.