Xiaomi launches MiMo-V2.5-Pro-UltraSpeed model achieving 1,000+ tokens/s throughput on general-purpose GPUs

Chinese consumer electronics and AI company Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, a high-speed variant of its flagship MiMo-V2.5-Pro model. The 1-trillion-parameter model supports 1M-token context windows and delivers over 1,000 tokens per second (TPS) of single-API inference throughput on standard GPUs, without relying on custom silicon. According to third-party testing by QbitAI, the model generated a complete 500-line web app including thinking time in seven seconds, and sustained output speeds exceeding 3,300 TPS peak. Xiaomi attributes the performance to a full-stack co-design spanning model architecture (hybrid sliding-window attention reducing compute to ~1/7th of full attention), FP4 quantization on expert modules, speculative decoding with parallel drafting (its DFlash scheme), and GPU-level optimizations including persistent kernel execution and warp specialization.

Why it matters: Xiaomi's achievement directly attacks the longstanding tradeoff between model quality, inference speed, and hardware generality — a structural tension that has constrained enterprise deployment of frontier models in latency-sensitive domains like high-frequency trading, real-time fraud detection, and ad-tech bidding. By demonstrating that a 1T-parameter model can run at 1,000+ TPS on merchant GPUs without resorting to custom ASICs (as Groq does), Xiaomi positions itself as a credible contender in the foundation-model inference-efficiency race. This narrative arc — from leading open-source model (MiMo topping global rankings), to aggressive price cuts on MiMo-2.5, to this speed breakthrough — signals a deliberate, system-level assault on the commercialization barriers that have kept large models out of production workloads. The pattern echoes the hyperscaler-distribution moat logic: proprietary inference optimization that compounds with each new model generation and each deployment scale.

For the AI market, Xiaomi's move sharpens a critical open debate about whether inference speed or raw capability will differentiate frontier models in the next 18 months. If the 1,000+ TPS threshold proves stable and reproducible across diverse workloads, it could reset enterprise expectations for what 'production-ready' means — pushing rivals like ByteDance, Alibaba, and Tencent to accelerate their own inference optimization investment or risk losing low-latency use cases to Xiaomi's stack. The fact that the optimization is model- and hardware-agnostic (transferable to future GPU generations) suggests a widening moat for Xiaomi's AI platform, analogous to how OpenAI's API latency improvements created lock-in for developer workflows.

Xiaomi launches MiMo-V2.5-Pro-UltraSpeed model achieving 1,000+ tokens/s throughput on general-purpose GPUs

The AMW Read

How This Connects

Related News

Hugging Face Faces Deepfake Nudes Crisis as Researchers Find Easy Exploit in Image Models

TakeMe2Space aims to become the AWS of space with orbital AI computing

Moonshot AI opens Kimi K3 model weights for public download

Moonshot AI launches Kimi model, reigniting US-China AI competitiveness debate. Chinese AI lab Moons...

SentinelOne spinout Neo raises questions about AI-agent security moat

Discover AI Startups

Xiaomi launches MiMo-V2.5-Pro-UltraSpeed model achieving 1,000+ tokens/s throughput on general-purpose GPUs

Related News

Hugging Face Faces Deepfake Nudes Crisis as Researchers Find Easy Exploit in Image Models

TakeMe2Space aims to become the AWS of space with orbital AI computing

**Moonshot AI opens Kimi K3 model weights for public download**

Moonshot AI launches Kimi model, reigniting US-China AI competitiveness debate. Chinese AI lab Moons...

SentinelOne spinout Neo raises questions about AI-agent security moat

Discover AI Startups

Moonshot AI opens Kimi K3 model weights for public download