Xiaomi launches MiMo-V2.5-Pro-UltraSpeed model achieving 1,000+ tokens/s throughput on general-purpose GPUs
The AMW Read
Novelty 2: Xiaomi is already a mapped player with MiMo models, but this speed claim at 1T parameters on general-purpose GPUs meaningfully updates the inference-efficiency baseline. Significance 3: Resets enterprise-production latency expectations and pressures the entire Chinese foundation-model fie
Xiaomi launches MiMo-V2.5-Pro-UltraSpeed model achieving 1,000+ tokens/s throughput on general-purpose GPUs
Chinese consumer electronics and AI company Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, a high-speed variant of its flagship MiMo-V2.5-Pro model. The 1-trillion-parameter model supports 1M-token context windows and delivers over 1,000 tokens per second (TPS) of single-API inference throughput on standard GPUs, without relying on custom silicon. According to third-party testing by QbitAI, the model generated a complete 500-line web app including thinking time in seven seconds, and sustained output speeds exceeding 3,300 TPS peak. Xiaomi attributes the performance to a full-stack co-design spanning model architecture (hybrid sliding-window attention reducing compute to ~1/7th of full attention), FP4 quantization on expert modules, speculative decoding with parallel drafting (its DFlash scheme), and GPU-level optimizations including persistent kernel execution and warp specialization.
Why it matters: Xiaomi's achievement directly attacks the longstanding tradeoff between model quality, inference speed, and hardware generality — a structural tension that has constrained enterprise deployment of frontier models in latency-sensitive domains like high-frequency trading, real-time fraud detection, and ad-tech bidding. By demonstrating that a 1T-parameter model can run at 1,000+ TPS on merchant GPUs without resorting to custom ASICs (as Groq does), Xiaomi positions itself as a credible contender in the foundation-model inference-efficiency race. This narrative arc — from leading open-source model (MiMo topping global rankings), to aggressive price cuts on MiMo-2.5, to this speed breakthrough — signals a deliberate, system-level assault on the commercialization barriers that have kept large models out of production workloads. The pattern echoes the hyperscaler-distribution moat logic: proprietary inference optimization that compounds with each new model generation and each deployment scale.
For the AI market, Xiaomi's move sharpens a critical open debate about whether inference speed or raw capability will differentiate frontier models in the next 18 months. If the 1,000+ TPS threshold proves stable and reproducible across diverse workloads, it could reset enterprise expectations for what 'production-ready' means — pushing rivals like ByteDance, Alibaba, and Tencent to accelerate their own inference optimization investment or risk losing low-latency use cases to Xiaomi's stack. The fact that the optimization is model- and hardware-agnostic (transferable to future GPU generations) suggests a widening moat for Xiaomi's AI platform, analogous to how OpenAI's API latency improvements created lock-in for developer workflows.

