
OpenAI GPT-5.5 Tops AI Safety Institute Cybersecurity Evaluation, Outpaces Anthropic Mitos in Cybersecurity Evaluation
The AMW Read
Updates the frontier cybersecurity capability baseline between two top-tier labs; the head-to-head GPT-5.5 vs. Mitos outcome directly informs the open safety-capability debate (Β§7.2).
OpenAI GPT-5.5 Tops AI Safety Institute Cybersecurity Evaluation, Outpaces Anthropic Mitos in Cybersecurity Evaluation
OpenAI's latest generative AI model, GPT-5.5, has achieved the highest score in a in cybersecurity capability evaluation conducted by the UK government's AI Safety Institute (AISI). According to a report published May 17 (local time) on the AISI official website, GPT-5.5 recorded an average pass rate of 71.4% across 95 expert-level cybersecurity tasks spanning cryptography, web attacks, reverse engineering, exploit development, and vulnerability research. This outperforms OpenAI's predecessor GPT-5.4 (52.4%), as well as Anthropic's Claude Mitos Preview (68.6%) and Claude Opus 4.7 (48.6%). Notably, GPT-5.5 became only the second model to complete 'The Last One,' a 32-step enterprise network penetration simulation designed by AISI to test autonomous AI agent threat capabilities, succeeding on 2 of 10 attempts versus Mitos's 3 successes.
Why it matters: This result updates the frontier cybersecurity capability baseline within the foundation model segment, showcasing a recurring pattern of rapid capability escalation driven by improvements in coding, reasoning, and long-horizon autonomy. The head-to-head comparison between OpenAI and Anthropic directly informs the open debate over which safety-oriented lab is producing the most capable (and potentially dangerous) frontier models. That GPT-5.5 surpasses Mitos on average but lags on the hardest simulation (Mitos completed it 3 times vs. GPT-5.5's 2) underscores a narrowing but not yet conclusive competitive gap.
Grounded take: The AISI evaluation is a controlled laboratory benchmark, not a reflection of deployed product safety, as the institute itself noted. However, the fact that a model can autonomously execute a multi-stage cyberattack that would take a human ~20 hours signals a structural force β the compression of offensive cybersecurity work into AI agent time scales. Frontier labs are inescapably locked in a race where safety evaluation standards themselves become competitive scoreboards. For enterprises this reinforces the need for defensive AI agents and monitor-first deployment strategies, as the underlying model capability cycle shows no sign of plateauing.

