Skip to main content
Back to News
UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.
Technology
2 min read

UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.

The AMW Read

Novelty 2: the benchmark is new and the failure taxonomy is detailed, updating what we know about agent limits. Significance 2: within-segment impact for AI agents, but not yet cross-segment.
NoveltySignificance
AI Agents · Skeptic Memory

UniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation.

The new SaaS-Bench benchmark from UniPat AI (UniPat AI) delivers a stark reality check for the computer-use agent (CUA) paradigm. Testing 23 real open-source SaaS systems deployed in Docker across six professional domains—software development, finance, healthcare, team collaboration, agriculture, and media—the benchmark found that even the best model, Anthropic's Claude Opus 4.7, achieved a mere 3.8% end-to-end pass rate on 106 tasks. Kimi K2.5 and Gemini 3.1 Pro scored 0%. The test tasks are complex: 93.4% span at least two applications, half require three, and 97.3% of text tasks involve over 100 operation steps, with some reaching 300+ steps.

This result shatters the narrative of near-term full office automation built on agent-led computer use. The benchmark exposes four structural failure modes: performance decays as task length increases; a single early error cascades into downstream failures (one 3% weight error caused 30% score loss); agents report tasks as complete without verifying actual system state; and execution variance across identical initial conditions approaches randomness. These patterns point to a fundamental limit in the current agent paradigm: lack of persistent state reasoning, no closed-loop verification, and inability to recover from errors. The findings validate a growing consensus that today's human-oriented SaaS interfaces are an obstacle—the next phase may require software redesigned for agent consumption.

#SaaS-Bench#computer-use agent#Claude Opus 4.7#UniPat AI#benchmark#agent reliability
Read Original

How This Connects

Based on AI Agents · Skeptic Memory

  1. 3h agoUniPat AI releases SaaS-Bench, Claude Opus 4.7 passes only 3.8% of 106 real-office tasks, breaking the illusion of full office automation. · THIS ARTICLE
  2. 1w agoAnthropic is shifting focus to compete with OpenAI and Microsoft over the agent control plane, the o...Anthropic
  3. 2w agoAdobe launches Adobe CX Enterprise, an agentic AI system for customer experienceAdobe
  4. 2w agoAnthropic Launches 10 Financial Services Agents, Sending FactSet Shares Down 8%Anthropic
  5. 2w agoSierra raises $950M at $15B+ valuation, claims 40% of Fortune 50 as customersSierra
  6. 3w agoCybersecurity giant Palo Alto Networks has announced its intent to acquire Portkey, a startup that p...Palo Alto Networks

Related News

Discover AI Startups

Explore 2,000+ AI companies with VC-grade analysis, funding data, and investment insights.

Explore Dashboard