Dr. Minh Tran
Head of AI Research ·

The AI industry is witnessing a dramatic shift as large language models evolve from text generators into autonomous computer-use agents. These agents can navigate operating systems, interact with applications, and complete complex workflows — just like a human user. But how do we measure their real-world capability? That is where benchmarks like OS-World come in, providing the definitive yardstick for evaluating AI agents in authentic computing environments.
In this article, we break down the OS-World benchmark, analyze the current leaderboard as of March 2026, and compare the leading AI agents competing for the top spot.
OS-World is a groundbreaking benchmark developed by researchers at XLang AI and presented at NeurIPS 2024. It provides the first scalable, real computer environment for evaluating multimodal AI agents. Unlike traditional benchmarks that test language understanding or code generation in isolation, OS-World drops AI agents into actual operating systems — Ubuntu, Windows, and macOS — and challenges them to complete 369 real-world tasks.
These tasks span web browsing, desktop applications like LibreOffice and VS Code, cross-application workflows, and OS-level file operations. Each task is evaluated using custom execution-based scripts, ensuring objective and reproducible results. The human baseline stands at 72.4% — a number that AI agents have now reached and even surpassed.
In a historic milestone, AI agents have officially surpassed human-level performance on OS-World. Here is the latest snapshot of the leaderboard:
Claude Opus 4.6 (Anthropic) — 72.7% Anthropic's flagship model now leads all foundation models on OS-World, surpassing the 72.4% human baseline. Claude Opus 4.6 leverages advanced multimodal reasoning and vision capabilities to interpret screen content, plan multi-step workflows, and execute precise mouse and keyboard actions across operating systems.
Claude Sonnet 4.6 (Anthropic) — 72.5% Remarkably close to its larger sibling, Claude Sonnet 4.6 achieves near-identical performance at a lower cost point ($3/$15 per million tokens vs $5/$25). This makes it an exceptionally cost-effective option for enterprises deploying computer-use agents at scale.
Qwen3 VL 235B A22B Instruct (Alibaba) — 66.7% Alibaba's Qwen3 Vision-Language model demonstrates that open-weight models are rapidly closing the gap with proprietary systems. At a fraction of the cost ($0.30/$1.49 per million tokens), Qwen3 VL offers competitive performance for budget-conscious deployments.
Claude Opus 4.5 (Anthropic) — 66.3% The previous generation Opus model remains highly competitive, demonstrating the rapid pace of improvement within Anthropic's model family — a 6.4 percentage point jump from 4.5 to 4.6.
Claude Sonnet 4.5 (Anthropic) — 61.4% Even the previous-generation Sonnet model outperforms many current competitors, showcasing Anthropic's consistent strength in computer-use tasks.
Claude Haiku 4.5 (Anthropic) — 50.7% The lightweight Haiku model achieves over 50% success rate at just $1/$5 per million tokens, making AI computer-use accessible even for cost-sensitive applications.
The OS-World Verified leaderboard requires independent evaluation by the research team, ensuring the highest standard of result integrity:
GPT-5.4 (OpenAI) — 75.0% OpenAI's latest model leads the verified leaderboard, demonstrating strong computer-use capabilities that significantly exceed the human baseline.
GPT-5.4 mini (OpenAI) — 72.1% The smaller variant achieves near-human performance while being more cost-efficient, highlighting OpenAI's strength in model distillation.
UiPath Screen Agent (Claude Opus 4.5) — 67.1% A landmark entry from the enterprise RPA sector, UiPath's Screen Agent — powered by Claude Opus 4.5 — earned the #1 verified ranking when announced in January 2026. This represents the first time an enterprise automation platform has claimed the top benchmark position, bridging the gap between AI research and real-world business automation.
GPT-5.3 Codex (OpenAI) — 64.7% OpenAI's code-specialized model shows strong cross-domain transfer to computer-use tasks.
Qwen3.5-122B-A10B (Alibaba) — 58.0% Alibaba's newest Qwen3.5 series shows substantial improvement over previous generations on the verified benchmark.
While OS-World focuses on computer-use tasks, other benchmarks provide complementary perspectives on AI agent capabilities:
GAIA (General AI Agent benchmark) evaluates agents on real-world questions requiring reasoning, web browsing, multi-modality handling, and tool use. Claude Sonnet 4.5 currently leads the GAIA benchmark at 74.6% overall, with Anthropic models sweeping the top 6 positions inside the HAL Generalist Agent framework from Princeton. GAIA tasks range from simple 5-step queries to complex multi-tool sequences.
CUB (Computer Use Benchmark) evaluates agent performance across six distinct industry verticals, providing a more business-oriented view of AI agent capabilities. Enterprise players like UiPath and Writer's Action Agent have demonstrated strong results on CUB, suggesting that benchmark performance increasingly translates to real-world business value.
Several important trends have emerged from the latest benchmark results:
The fact that AI agents have surpassed human-level performance on OS-World marks a turning point for enterprise adoption. Companies should consider:
The AI agent market is projected to grow from $7.84 billion in 2025 to $52.62 billion by 2030, representing a staggering 46.3% CAGR. With AI agents now surpassing human performance on standardized benchmarks, enterprise adoption is accelerating rapidly. Companies like UiPath, Anthropic, OpenAI, and Alibaba are investing heavily in computer-use capabilities.
The focus is shifting from raw benchmark performance to real-world deployment challenges: reliability, security, cost efficiency, and integration with existing enterprise workflows. The race is no longer about whether AI agents can do the work — it is about how quickly and cost-effectively they can be deployed at scale.
March 2026 marks the moment AI agents officially surpassed human-level performance on the OS-World benchmark. With Claude Opus 4.6 at 72.7% and GPT-5.4 at 75.0% exceeding the 72.4% human baseline, the question is no longer whether AI can use computers effectively — but how enterprises can best leverage this capability.
At SyncSoftAI, we help organizations navigate this rapidly evolving landscape with expert AI evaluation, data services, and full-stack AI solutions. Stay tuned for our next article in this series, where we explore proven strategies for improving AI agent benchmark scores.

Discover seven proven strategies for boosting AI agent performance on benchmarks like OS-World and GAIA — from reducing LLM call latency and minimizing action steps to building modular multi-agent architectures and improving GUI grounding.

Discover how SyncSoft.ai's specialized data services — from expert annotation and RLHF alignment to model evaluation and full-stack AI development — directly address the key challenges in improving AI agent benchmark scores on OS-World and GAIA.

86% of enterprises are increasing AI budgets in 2026 and 88% of early adopters see positive ROI. A data-driven guide to measuring generative AI returns across industries.