
Inception Labs has launched Mercury 2, the first reasoning diffusion large language model (dLLM), ac...
The AMW Read
The introduction of a 'reasoning diffusion' architecture (dLLM) challenges the transformer-centric scaling paradigm and fundamentally shifts the inference economics of the Foundation Model segment.
Inception Labs has launched Mercury 2, the first reasoning diffusion large language model (dLLM), achieving 1,009 tokens per second on NVIDIA Blackwell GPUs. This is 5x faster than leading speed-optimized LLMs while running on standard hardware, priced at just $0.25 per million input tokens. Founded by Stanford professor Stefano Ermon, co-inventor of diffusion methods powering Midjourney and Stable Diffusion, Inception applies parallel token generation to text instead of traditional sequential processing. This architectural breakthrough could fundamentally reshape AI inference economics, enabling real-time applications like voice assistants and coding tools without custom silicon, signaling a shift from scaling transformer architectures to reimagining model design.
