UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.

The AMW Read

Novelty 2: the benchmark is new and the failure taxonomy is detailed, updating what we know about agent limits. Significance 2: within-segment impact for AI agents, but not yet cross-segment.

NoveltySignificance

AI Agents · Skeptic Memory

UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.

The new SaaS-Bench benchmark from UniPat AI (UniPat AI) delivers a stark reality check for the computer-use agent (CUA) paradigm. Testing 23 real open-source SaaS systems deployed in Docker across six professional domains—software development, finance, healthcare, team collaboration, agriculture, and media—the benchmark found that even the best model, Anthropic's Claude Opus 4.7, achieved a mere 3.8% end-to-end pass rate on 106 tasks. Kimi K2.5 and Gemini 3.1 Pro scored 0%. The test tasks are complex: 93.4% span at least two applications, half require three, and 97.3% of text tasks involve over 100 operation steps, with some reaching 300+ steps.

This result shatters the narrative of near-term full office automation built on agent-led computer use. The benchmark exposes four structural failure modes: performance decays as task length increases; a single early error cascades into downstream failures (one 3% weight error caused 30% score loss); agents report tasks as complete without verifying actual system state; and execution variance across identical initial conditions approaches randomness. These patterns point to a fundamental limit in the current agent paradigm: lack of persistent state reasoning, no closed-loop verification, and inability to recover from errors. The findings validate a growing consensus that today's human-oriented SaaS interfaces are an obstacle—the next phase may require software redesigned for agent consumption.

#SaaS-Bench#computer-use agent#Claude Opus 4.7#UniPat AI#benchmark#agent reliability

UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.

The AMW Read

How This Connects

Related News

SoftBank reveals its proprietary AI gateway 'Cloud Proxy' supporting the '1 person, 100 agents' vision

DeepSeek, Zhipu AI pursue in-house chip development as Beijing weighs overseas model restrictions

DeepSeek begins developing custom AI inference chips to reduce dual dependency on NVIDIA and Huawei.

DeepSeek begins in-house AI chip development to cut reliance on NVIDIA

Ant Group’s Lingbo Technology releases spatial perception model LingBot-Depth 2.0

Discover AI Startups

UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.

Related News

**SoftBank reveals its proprietary AI gateway 'Cloud Proxy' supporting the '1 person, 100 agents' vision**

DeepSeek, Zhipu AI pursue in-house chip development as Beijing weighs overseas model restrictions

DeepSeek begins developing custom AI inference chips to reduce dual dependency on NVIDIA and Huawei.

DeepSeek begins in-house AI chip development to cut reliance on NVIDIA

**Ant Group’s Lingbo Technology releases spatial perception model LingBot-Depth 2.0**

Discover AI Startups

SoftBank reveals its proprietary AI gateway 'Cloud Proxy' supporting the '1 person, 100 agents' vision

Ant Group’s Lingbo Technology releases spatial perception model LingBot-Depth 2.0