Xiaomi MiMo Reaches 1,000 Tokens Per Second on Commodity GPUs Without Custom Chips

Last updated: June 17, 2026 • 10:20 UTC

By Asia Daily

10 Min Read

A New Speed Record in Artificial Intelligence

On June 8, 2026, Xiaomi released MiMo V2.5 Pro UltraSpeed, an inference mode that pushes a trillion parameter language model past 1,000 tokens per second on standard cloud graphics processing units. The achievement marks the first time a model at this scale has crossed that threshold without relying on custom silicon or specialized hardware. Internal demos show peak throughput approaching 1,200 tokens per second sustained on a single commodity node equipped with eight GPUs.

Contents

A New Speed Record in Artificial Intelligence
Xiaomi’s Unlikely Rise in Artificial Intelligence
The Cost of Speed in Modern AI
Compressing Intelligence With FP4 Quantization
Predicting in Parallel Through DFlash Speculative Decoding
Eliminating Microsecond Gaps With TileRT
Real World Impact and Business Context
Open Source Release and the Verification Question
Key Points

To understand the significance, compare current public benchmarks. OpenAI’s GPT 5.5 generates roughly 68 tokens per second. Claude Opus 4.6 operates at approximately 71 tokens per second, while Google Gemini Flash reaches 192. Xiaomi’s UltraSpeed mode delivers roughly fifteen times the output speed of the standard ChatGPT experience, and it does so on a model that already competes with frontier systems on software engineering evaluations.

The model behind this speed is MiMo V2.5 Pro, a 1.02 trillion parameter Mixture of Experts architecture with a one million token context window. Unlike many competitors, the system processes text, images, audio, and video natively within a single model. Xiaomi released the underlying checkpoint under an MIT open source license, allowing developers to self host, fine tune, and deploy the system commercially without licensing fees.

Xiaomi’s Unlikely Rise in Artificial Intelligence

Most consumers associate Xiaomi with affordable smartphones, electric scooters, and increasingly, electric vehicles. The company shipped 33.8 million handsets in the first quarter of 2026 and delivered more than 30,000 electric vehicles in May alone. Yet behind these consumer products sits a dedicated artificial intelligence division that has been building large language models since April 2025, when it launched a compact 7 billion parameter reasoning model.

In the fourteen months since, the MiMo family has grown into one of the more competitive open weight lineups available. The current flagship, MiMo V2.5 Pro, debuted on April 22, 2026. It benchmarks close to Claude Opus 4.6 on software engineering tasks while costing roughly one eighth the price per token. Standard pricing for the base model runs 0.025 yuan per million tokens on cache hits, 3 yuan per million on cache misses for input, and 6 yuan per million for output.

The June 8 announcement adds UltraSpeed as a high performance serving mode layered directly on top of this existing model. Rather than altering the model weights or architecture, the speed gains come entirely from inference optimization, a software achievement that challenges assumptions about what commodity hardware can do.

The Cost of Speed in Modern AI

Inference speed has become a critical battleground for large language models. The rate at which a system generates tokens determines not just how quickly users receive answers, but which applications are economically and technically viable. A model generating 68 tokens per second works well for chat interfaces where humans read at their own pace. It fails entirely for latency sensitive tasks such as real time fraud detection, high frequency trading signals, or autonomous coding agents that must iterate through multiple reasoning steps in seconds.

Recognizing this bottleneck, several startups built entire companies around custom silicon. Cerebras designed a wafer scale chip roughly the size of a dinner plate, packing 44 gigabytes of on chip memory to eliminate the bandwidth constraints that slow conventional GPU inference. That system achieved approximately 969 tokens per second on Meta’s Llama 3.1 405B, a model with fewer than half the parameters of MiMo V2.5 Pro. Groq developed a custom Language Processing Unit architecture capable of 300 to 750 tokens per second depending on the model. Neither platform runs on hardware available through standard cloud rental services.

Xiaomi’s approach rejects this hardware arms race. By optimizing software and model design in tandem, the company reached comparable speeds on rentable commodity GPUs. Any developer with access to a standard 8 GPU node can theoretically replicate the result, a shift that could redistribute competitive advantage away from specialized chipmakers and toward algorithmic innovation.

Compressing Intelligence With FP4 Quantization

The first pillar of UltraSpeed involves aggressive numerical compression called FP4 quantization. Standard model inference typically stores weights at 16 bit or 8 bit precision. Each parameter demands a certain amount of memory bandwidth, and at trillion parameter scale, simply moving those weights from memory to processing cores creates a traffic jam that caps generation speed.

Xiaomi’s solution applies 4 bit precision selectively. The Mixture of Experts architecture contains many expert layers, which hold the vast majority of parameters and prove more tolerant of reduced precision. Xiaomi compressed only these expert layers down to the MXFP4 format while keeping other model modules at higher precision, reported as FP8. This selective approach shrinks memory footprint without degrading output quality, because Quantization Aware Training integrated the compression into the model’s learning process. Benchmarks show the quantized variant performs essentially on par with the original full precision version.

Because the expert layers constitute the bulk of the model, this selective compression dramatically reduces how much data must travel through memory during each generation step. The bandwidth savings translate directly into higher throughput.

Predicting in Parallel Through DFlash Speculative Decoding

The second pillar addresses the sequential nature of language generation. Normally, models produce tokens one at a time, with each new token depending on all previous ones. Speculative decoding offers a shortcut: a smaller draft model rapidly guesses several upcoming tokens, and the larger target model verifies all guesses in parallel. Where guesses match, the system saves time. Traditional speculative decoding still forces the draft model to generate guesses sequentially, limiting the benefit.

DFlash, a method from the research community that Xiaomi refined, removes this sequential constraint entirely. Instead of guessing tokens one by one, the draft model fills an entire block of masked positions in a single forward pass. Xiaomi tuned this process using the Muon second order optimizer and model self distillation. The draft model uses Sliding Window Attention exclusively, ensuring the computational cost per prediction stays constant regardless of how long the conversation grows.

In coding scenarios, the system achieves an average acceptance length of 6.3 tokens per verification round out of a block size capped at 8. That means the large model confirms six or seven tokens in a single step rather than generating them individually. Because rejection sampling ensures only valid tokens pass through, the output remains mathematically identical to standard decoding. The speedup comes without quality loss.

Eliminating Microsecond Gaps With TileRT

The third pillar targets an invisible bottleneck that emerges only at extreme speeds. When inference crosses 1,000 tokens per second, each mathematical operator executes for mere microseconds. Traditional serving systems launch operators one after another, with small gaps between each launch. At normal speeds these gaps are negligible. At UltraSpeed levels, they fracture the execution stream and become the primary bottleneck.

TileRT, co developed with inference partner TileRT, solves this with a Persistent Engine Kernel that stays resident inside the GPU continuously. Rather than starting and stopping operations, the kernel uses Warp Specialization to assign specific hardware threads to data movement, computation, and communication in coordinated roles. Even tiny operations like RMSNorm, rotary position embeddings, and KV cache writes, which normally create microsecond scale delays, execute without interruption.

The system was designed alongside the FP4 and DFlash choices rather than added afterward. This extreme model system codesign ensures all three layers align tightly. Xiaomi explicitly states that none of the three techniques alone achieves 1,000 tokens per second. Only their combined execution produces the record result.

Real World Impact and Business Context

The 1,000 token threshold changes which applications make sense for frontier class models. Parallel reasoning becomes practical, allowing systems to run dozens of Best of N or tree search reasoning paths within the same wall clock time that a slower model would need for one. Coding agents benefit because faster generation reduces wait time between autonomous steps. Real time decision loops in trading signal generation, fraud interception, and live dialogue systems gain viability when latency drops below human perception thresholds.

Xiaomi demonstrated the practical effect with public demos. A fully functional Snake game generated in approximately ten seconds. A macOS interface prototype appeared in about one minute. These are throughput bound tasks where raw token speed serves as the binding constraint.

Access to UltraSpeed runs through a limited, application based API trial from June 9 to June 23, 2026. Pricing sits at three times the standard MiMo V2.5 Pro rate, which translates to roughly $1.29 per million input tokens and $2.61 per million output tokens according to current exchange calculations. Xiaomi promises approximately ten times the generation speed for that premium. The Token Plan subscription service does not apply to UltraSpeed. Approved users receive two weeks of free Chat access with daily limits: ten queue entries per account, thirty minute session caps, and automatic resource release after five minutes of idle time.

Despite these technical milestones, financial markets have reacted with indifference. Xiaomi shares closed at €2.91 on June 11, just 1.75% above the 52 week low and marking a 35% year to date decline. The company continues diversifying, launching the premium 17T smartphone in India on the same day as the MiMo Code assistant release, and pursuing electric vehicle expansion with the city of Changchun. Whether operational momentum can reverse stock performance depends on upcoming earnings data.

Open Source Release and the Verification Question

Xiaomi has open sourced the MiMo V2.5 Pro FP4 DFlash checkpoint on Hugging Face, including quantized weights and DFlash parameters. TileRT has published select modules on GitHub. This transparency invites independent community testing, which will prove critical because all current speed claims originate from Xiaomi’s internal benchmarks. No independent third party verification has been published yet.

Community replication will test several open questions. First, whether the 1,000 tokens per second figure holds on common GPU instances outside Xiaomi’s controlled trial environment. Second, whether quality regressions appear in open ended conversational tasks, where acceptance rates for speculative decoding currently run lower than in structured coding tasks. Third, whether TileRT’s persistent kernel approach can integrate with standard serving stacks already deployed in production environments.

If independent tests confirm the claims, Xiaomi will have achieved with software what competitors spent hundreds of millions of dollars pursuing through custom silicon. The result would suggest that algorithmic co design can unlock performance tiers previously thought to require specialized hardware, potentially redirecting investment across the artificial intelligence industry.

Key Points

Xiaomi’s MiMo V2.5 Pro UltraSpeed became the first trillion parameter model to exceed 1,000 tokens per second on standard commodity GPUs when it launched on June 8, 2026.
The speed comes from three coordinated techniques: FP4 expert layer quantization, DFlash block level speculative decoding, and the TileRT persistent GPU runtime.
The underlying MiMo V2.5 Pro model benchmarks near Claude Opus 4.6 on coding tasks while costing roughly one eighth the price per token.
Unlike competitors such as Cerebras and Groq, Xiaomi achieved the record without custom silicon, using a standard 8 GPU node available through cloud providers.
The MiMo V2.5 Pro FP4 DFlash checkpoint is available under an MIT open source license on Hugging Face, with TileRT modules on GitHub.
A limited API trial runs from June 9 to June 23, 2026, priced at three times the standard rate for approximately ten times the generation speed.
Xiaomi’s stock remains near 52 week lows despite the technical breakthrough, with shares down 35% year to date as of mid June 2026.

Xiaomi MiMo Reaches 1,000 Tokens Per Second on Commodity GPUs Without Custom Chips

A New Speed Record in Artificial Intelligence

Xiaomi’s Unlikely Rise in Artificial Intelligence

The Cost of Speed in Modern AI

Compressing Intelligence With FP4 Quantization

Predicting in Parallel Through DFlash Speculative Decoding

Eliminating Microsecond Gaps With TileRT

Real World Impact and Business Context

Open Source Release and the Verification Question

Key Points

You May also Like

Boom in transit freight through Tajikistan

Cambodia’s Displaced Families Face a Second Year of Fear, Loss and Uncertainty

Japan Plans Tougher Permanent Residency Rules With Income and Pension Tests

Assam Floods Leave Thousands Homeless as Dibrugarh Diocese Leads Relief and Recovery

Japanese Families Cut Meals and Cooling as Summer Costs Rise

Russia Expands Its School Network in Central Asia as Language Competition Intensifies

Seoul’s Interest in Mongolia’s Rare Earths Tests the Limits of Resource Diplomacy

UN Report Reveals Scale of Southeast Asia’s Interconnected Criminal Economy

Quick Links