2026-06-08 · Updated 2026-06-09

Xiaomi pushed a 1T model to 1000 tokens/s — without special hardware

MiMo-V2.5-Pro-UltraSpeed decodes a trillion-parameter model past 1000 tps on a single 8-GPU commodity node. The real signal is that model-system codesign broke the 'extreme speed needs custom silicon' equation — not the operating-room marketing wrapped around it.

inference frontier-models ai-infra

Xiaomi pushed a 1T model to 1000 tokens/s — without special hardware — Photo / Unsplash

Summary

The number to remember from this release is not “1000 tokens/s.” It is that the number came out of a trillion-parameter MoE model running on a single 8-GPU node you can actually buy. Until now, the industry’s default answer for pushing decode speed to this tier was to change the hardware — Cerebras’s wafer-scale integration, Groq’s architecture that pins the whole model into on-chip SRAM. MiMo-V2.5-Pro-UltraSpeed took the other road: no custom silicon, just model-system codesign squeezing the same speed out of commodity GPUs.

What that breaks is a quietly assumed equation — that extreme inference speed requires special-purpose hardware. If a 1T model can decode past a thousand tokens per second on GPUs anyone can rent, then the ceiling is no longer set by whether you can afford Groq or get in line for Cerebras. It is set by whether your quantization scheme, your speculative decoding, and your kernel scheduling are good enough. For teams running their own inference stack and building coding or real-time agents, that is a real signal: FP4-only-experts and DFlash are patterns you can borrow.

But the official narrative needs to be split in two. The blog dresses up “speed is intelligence,” an operating-room race against death, and a limited-window application gate as something democratizing. That is marketing, not signal. The judgment that matters is which pieces of engineering are reproducible and portable, and which are demo-grade resource allocation and emotional copy. The sections below take those two layers apart one at a time.

What happened

On June 8, Xiaomi’s MiMo model team and the TileRT systems team jointly released MiMo-V2.5-Pro-UltraSpeed, claiming the first decode speed past 1000 tokens/s on a 1T-parameter model — the demo video peaks around 1200 tps — on a standard 8-GPU commodity node. The blog explicitly positions itself against the Cerebras/Groq custom-hardware route, stressing that the speed comes “on commodity GPUs, through model-system codesign alone.”

The result rests on three pieces of engineering, none optional. First, FP4 (MXFP4) quantization, but applied only to the MoE Experts: MiMo-V2.5-Pro is an MoE model where Experts hold the vast majority of parameters and tolerate quantization best, so everything else keeps original precision. Combined with FP4 QAT (quantization-aware training), Xiaomi claims overall capability stays essentially on par with FP8. Second, DFlash, a block-level masked parallel speculative decoding method: where traditional speculative decoding uses a small draft model to guess tokens one at a time, DFlash fills an entire block of masked positions in a single forward pass, killing the serial constraint of autoregressive drafting. Block size is capped at 8, trained with sliding-window attention (SWA), the Muon second-order optimizer, and self-distillation. Reported average accepted length: 6.30 for coding, 5.56 for math/reasoning, 4.29 for agents (coding peaks at 7.14 — 6 to 7 of every 8 draft tokens accepted). Third, the TileRT inference system: at 1000 tps each operator’s lifecycle compresses to microseconds, and the “operator boundaries” of conventional systems — every kernel launch, hardware sync, global-memory round trip — fracture the execution flow into visible “execution gaps.” TileRT erases them with a persistent engine kernel (the whole pipeline stays resident and prefetching inside the GPU) plus warp specialization (communication, data movement, and tensor compute decomposed at the tile level and run as a coordinated heterogeneous pipeline).

The commercial and open-source moves came alongside. The API is gated by application, open only June 9–23, priced at 3× MiMo-V2.5-Pro for roughly 10× the speed, and excluded from the Token Plan. Free Chat is also two weeks only, capped at 10 queue entries per account per day, 30-minute sessions, auto-released after 5 idle minutes. At the same time Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on HuggingFace, including FP4 weights and DFlash parameters. DFlash itself comes from a research paper (arXiv 2602.06036) that uses a lightweight block-diffusion model for parallel drafting and reports over 6× lossless acceleration across models and tasks — up to 2.5× faster than the prior SOTA, EAGLE-3.

Why it matters

The belief worth revising is that extreme speed can only come from special hardware. The Cerebras and Groq stories rest on an implicit premise: that the memory-bandwidth and operator-scheduling ceilings of general-purpose GPUs are fixed, and the only way past them is a different physical form. MiMo just showed there is a lot of headroom left on that ceiling — reachable by binding model design and system design together. FP4 here is trained in through QAT, so the model learns to live in low precision. DFlash’s draft model deliberately uses only SWA so it aligns with the MiMo-V2 series and turns the draft’s per-step attention cost from context-length-linear into a constant window. TileRT’s kernels are built specifically for this quantization-and-speculation pipeline. The three layers are not optimized separately and bolted together; they make concessions to each other and co-evolve. That is exactly what makes it valuable to builders: it demonstrates a methodology, and the corresponding machine is one you cannot buy.

The second thing to keep is that part of this pattern is directly reusable. FP4-only-experts and DFlash block-parallel drafting — the mechanics belong in the technical takeaway below; here the pointer is short. If you serve an MoE, evaluate dropping only the Experts to FP4 now; if you do structured generation, study the open DFlash checkpoint. Neither is a black box you can only admire from outside, and the bar to adopt is lower than it looks.

Third, a bucket of cold water: this speed is, for now, a demo-grade resource, not a production SLA you can lean on (for why, see the final section’s breakdown of the application gate). Until it ships steady supply and transparent pricing, treat it as a capability showcase, not a service you can bet a launch on.

Technical takeaway

Of the three pieces, the first two are the ones builders should copy, because they couple least to which hardware or framework you run.

The point of FP4-only-experts is not FP4 itself (MXFP4 is an OCP-standardized format with native support on Blackwell-class cards — some on HN inferred that’s exactly what this runs on). The point is the layered idea of choosing quantization precision per module. It turns what looks like a global decision — how many bits for the model — into a problem of allocating a bit-budget by parameter sensitivity. That holds for any MoE deployment and ships without Xiaomi’s kernels; mainstream inference frameworks already support mixed precision well enough. The caveat: it depends on QAT, not free post-training quantization. You need the budget to retrain or fine-tune to land the “on par with FP8” result.

DFlash is the more interesting half, because it splices together speculative decoding and diffusion models. Traditional speculative decoding (EAGLE family included) still drafts autoregressively — the serial chain is intact. DFlash drafts with block diffusion: one forward pass fills a whole block of masked positions, parallel by construction. Xiaomi layered three adaptations on top — the draft uses only SWA to shed dependence on the full prefix, turning its per-step attention cost from context-length-linear into a constant window (the KV cache still grows with generation); during training, mask-signal sampling is pushed down to GPU-local shards so one sequence yields tens of thousands of independent training signals in a single step without cross-device communication; and Muon plus self-distillation keeps acceptance high even for compact mask blocks. The result: structured, predictable tasks like coding and math get strong accepted lengths (6.30/5.56), agent tasks middling (4.29).

But that acceptance-length table is exactly where the boundary shows. Xiaomi admits it: in semantically divergent, high-uncertainty open conversation, acceptance is still low. That contradicts the “speed is intelligence” pitch. Speculative decoding’s speedup is fundamentally a predictability dividend — the more formulaic and answer-bounded the task (writing code, doing math), the faster it goes; the more open and creative the dialogue, the slower. In other words, this speed is extremely fast on verifiable, well-structured tasks and falls off noticeably in free-form chat. That shape is what makes it suit coding agents.

Builder impact

If you run your own inference stack, this release should change how you think about what sets the speed ceiling. The takeaway: before spending on special hardware, exhaust the model-system codesign space first. Concretely, three steps. One — if you serve MoE, evaluate “quantize only the Experts to FP4” immediately; it is the highest-ROI single cut, but budget for QAT and don’t expect post-training quantization to give you the same accuracy for free. Two — if your load is coding agents or structured generation, look seriously at DFlash: the checkpoint and paper are public, and block-parallel drafting’s accepted length (6+) on high-predictability tasks delivers real end-to-end gains over autoregressive drafts. Three — the TileRT layer (persistent kernels, warp specialization, killing operator-boundary gaps) has the biggest payoff but is the hardest to build yourself; it demands low-level kernel engineering. For most teams the realistic move is to watch whether projects like TileRT (open-sourced at tile-ai/TileRT) become usable, not to write it from scratch.

One discipline to hold, though: don’t refactor things you don’t need just to chase this speed. For most products the bottleneck is not that decode is too slow — it’s time-to-first-token, concurrent throughput, cost, reliability. A thousand tps solves “how long until a long output finishes,” which is real value for best-of-N sampling, long code generation, and real-time interaction. But if your users only need a few dozen tokens at a time, or your bottleneck is retrieval and tool calls, this speed is nearly irrelevant to you. Before investing, confirm you actually fall into the first group.

On the commercial side the pointer is one line: don’t treat this limited-window API as a production dependency — to integrate, run the open checkpoint yourself and treat it as a reproducible method, not a callable service. And don’t anchor on Xiaomi’s headline numbers: some on HN questioned whether the demo’s 1200 tps peak reflects sustained throughput or a cherry-picked instantaneous figure. The numbers worth trusting are average decode speed and per-task accepted length — both of which you can re-measure by running the checkpoint.

What to ignore

The first thing to drop is “speed is intelligence.” The blog frames it as the core thesis — that once a model is fast enough it stops being a tool you wait on and becomes an extension of thought. The technical takeaway above already showed why that doesn’t hold: speculative decoding feeds on a predictability dividend, and what converts speed into quality is running best-of-N in the same wall-clock time, then picking the best path — which has a hard precondition, a way to verify which path is right. Code you can test and math you can check buy quality from speed; on open tasks with no verifier, ten paths just give you ten answers you can’t rank. Speed is an amplifier; verifiability is the switch that decides which way it amplifies.

The second is the operating-room narrative. Dressing a 1T model’s speedup as “racing death on the operating table, buying the surgeon one more degree of freedom” is pure emotional marketing with no technical link to anything in this release. The bottleneck in medical imaging was never LLM decode speed — it is accuracy, interpretability, regulatory approval, and liability. Grafting a general-text-generation tps figure onto a life-or-death scene is borrowing unrelated gravitas to gild an inference-speed demo. Skip it.

The third is reading the application gate as access for all. “3× price for 10× speed” and “free Chat, limited time” sound like generosity, but 10 queue entries a day, 30-minute sessions, no guarantee of approval, two weeks only — those are rationing mechanisms for a scarce resource. The truth they convey is that high-speed inference supply is, today, extremely limited; this is a controlled capability showcase, not a scalable, dependable product. The HN skepticism lands well — this is not a company burning investor cash, and this pricing and these caps will eventually have to square on the ledger. Until it becomes a real service with a stable SLA, treat it as a demo, not infrastructure to bet on.