OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.

Here is a detailed summary of the paper:

Problem: Existing benchmarks for developing multimodal agents that can interact with computers lack real interactivity or are restricted to specific applications/domains. This limits the scope of tasks and scalability of agents. There is a need for a unified benchmark based on a real interactive environment that covers the diversity and complexity of real-world computer use across various operating systems, interfaces and applications.

Proposed Solution: The paper introduces OSWorld, the first scalable, real computer environment for multimodal agents. Key features:

Supports task setup, execution-based evaluation and interactive learning across operating systems like Ubuntu, Windows, macOS
Can serve as a unified, integrated environment for assessing open-ended computer tasks involving arbitrary applications
Provides raw keyboard/mouse control for free-form interaction
Supports multi-modal observations including screenshots, accessibility trees
Includes custom example-specific configurations and evaluations

Based on OSWorld, a benchmark is created with 369 real-world computer tasks involving web, desktop apps, OS file I/O and multi-app workflows. Each task example has:

Detailed initial state setup configuration
Custom execution-based evaluation script for reliable assessment

The human baseline is 72.36% accuracy on the benchmark.

Main Contributions:

First real, interactive computer environment for multimodal agents
Enables benchmarking on diverse, open-ended computer tasks instead of isolated apps/domains
Large-scale benchmark of 369 computer tasks with custom evaluations
Analysis of state-of-the-art models reveals deficiencies in GUI grounding and knowledge, guiding future work
Open-sourced environment, data and models to spur research on generalist computer agents

The paper makes significant progress towards developing agents that can serve as customizable assistants to enhance human-computer interaction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2404.07972.md

2404.07972.md

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.

Files

2404.07972.md

Latest commit

History

2404.07972.md

File metadata and controls

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.