UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.
The AMW Read
Novelty 2: the benchmark is new and the failure taxonomy is detailed, updating what we know about agent limits. Significance 2: within-segment impact for AI agents, but not yet cross-segment.
UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.
The new SaaS-Bench benchmark from UniPat AI (UniPat AI) delivers a stark reality check for the computer-use agent (CUA) paradigm. Testing 23 real open-source SaaS systems deployed in Docker across six professional domains—software development, finance, healthcare, team collaboration, agriculture, and media—the benchmark found that even the best model, Anthropic's Claude Opus 4.7, achieved a mere 3.8% end-to-end pass rate on 106 tasks. Kimi K2.5 and Gemini 3.1 Pro scored 0%. The test tasks are complex: 93.4% span at least two applications, half require three, and 97.3% of text tasks involve over 100 operation steps, with some reaching 300+ steps.
This result shatters the narrative of near-term full office automation built on agent-led computer use. The benchmark exposes four structural failure modes: performance decays as task length increases; a single early error cascades into downstream failures (one 3% weight error caused 30% score loss); agents report tasks as complete without verifying actual system state; and execution variance across identical initial conditions approaches randomness. These patterns point to a fundamental limit in the current agent paradigm: lack of persistent state reasoning, no closed-loop verification, and inability to recover from errors. The findings validate a growing consensus that today's human-oriented SaaS interfaces are an obstacle—the next phase may require software redesigned for agent consumption.

