Skip to content

Latest commit

 

History

History
34 lines (23 loc) · 2.34 KB

2404.07972.md

File metadata and controls

34 lines (23 loc) · 2.34 KB

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.

Here is a detailed summary of the paper:

Problem: Existing benchmarks for developing multimodal agents that can interact with computers lack real interactivity or are restricted to specific applications/domains. This limits the scope of tasks and scalability of agents. There is a need for a unified benchmark based on a real interactive environment that covers the diversity and complexity of real-world computer use across various operating systems, interfaces and applications.

Proposed Solution: The paper introduces OSWorld, the first scalable, real computer environment for multimodal agents. Key features:

  • Supports task setup, execution-based evaluation and interactive learning across operating systems like Ubuntu, Windows, macOS
  • Can serve as a unified, integrated environment for assessing open-ended computer tasks involving arbitrary applications
  • Provides raw keyboard/mouse control for free-form interaction
  • Supports multi-modal observations including screenshots, accessibility trees
  • Includes custom example-specific configurations and evaluations

Based on OSWorld, a benchmark is created with 369 real-world computer tasks involving web, desktop apps, OS file I/O and multi-app workflows. Each task example has:

  • Detailed initial state setup configuration
  • Custom execution-based evaluation script for reliable assessment

The human baseline is 72.36% accuracy on the benchmark.

Main Contributions:

  • First real, interactive computer environment for multimodal agents
  • Enables benchmarking on diverse, open-ended computer tasks instead of isolated apps/domains
  • Large-scale benchmark of 369 computer tasks with custom evaluations
  • Analysis of state-of-the-art models reveals deficiencies in GUI grounding and knowledge, guiding future work
  • Open-sourced environment, data and models to spur research on generalist computer agents

The paper makes significant progress towards developing agents that can serve as customizable assistants to enhance human-computer interaction.