DeepSeek V4: Open Weights Finally Lead on the Efficiency Frontier, Not the Leaderboard
The real signal in DeepSeek V4 is a 1.6T MoE plus serving-side engineering that makes frontier capability affordable and self-hostable—the first time the open-weight camp leads on cost-per-token and throughput rather than chasing SOTA.
Summary
What makes DeepSeek V4 worth remembering is not another “Chinese open model challenges SOTA” headline; it is that the release puts the cost structure of frontier capability on the table. Per SemiAnalysis’s InferenceX team, which tracked the model live from Day 0 through Day 43, on a GB300 NVL72 rack-scale system with MTP (multi-token-prediction speculative decoding) enabled, assuming 8k input / 1k output at 50 tokens/s/user interactivity, the cost per million output tokens lands at $0.156. And this is an open-weight model—anyone can download it, self-host it, and reproduce that cost curve on their own hardware.
Put that in a builder’s terms. To get unit economics at this level you previously had to use someone’s closed API, accept its pricing and rate limits, and hand over your data. DeepSeek V4 swaps that path for “the weights are in your hands, and your deployment decides the cost.” For teams running large, sustained inference workloads, what genuinely changes is cost structure and deployment autonomy; the leaderboard slot is a side issue.
So the narrative this piece deliberately dismantles is the “DeepSeek V4 crushes some closed SOTA” framing. The SemiAnalysis analysis never plays the benchmark game; it measures, day by day, how a 1.6T-parameter MoE actually performs across six or seven hardware SKUs and three or four serving engines. The signal lives in those curves: for the first time, the open-weight camp leads on the efficiency frontier—cost per token, throughput per watt—rather than in the race to match the capability ceiling. Below, we separate the signal from the noise layer by layer.
What happened
DeepSeek V4 (called DeepSeek v4 Pro in the SemiAnalysis piece, a 1.6T-parameter MoE) was released with open weights by a Chinese lab. SemiAnalysis’s open-source InferenceX engineering team began recording performance on launch day (Day 0), using open-source images and official recipes across as many hardware SKUs and serving frameworks as possible, and kept tracking through Day 43—with all data going into an open GitHub repo and public dashboard. They deliberately capture the day-by-day iteration over time, because that curve is what reflects real deployable performance; a single “best snapshot” would distort it.
A few observations are worth unpacking. First, Day 0 multi-stack support: the moment the model shipped, native vLLM and SGLang on CUDA worked out of the box, with most recipes—even for newer SKUs like B200/B300—running without major issues. That is a testament to the strength of the vLLM and SGLang open ecosystems; both teams have spun out companies (Inferact, RadixArk) and each raised hundreds of millions to keep pushing their engines. The native checkpoint uses mixed precision: FP4 for the MoE, FP8 for attention.
Second, the maturity gap between hardware stacks is large—and it converges fast as engineering pours in. Per SemiAnalysis, only two stacks delivered first-class Day 0 support: NVIDIA CUDA and Huawei CANN (Ascend). AMD’s ROCm on the MI355X could only run FP8 on Day 0, at an interactivity of just 1–2 tokens per user per second—far below reading speed, effectively unusable. Yet under HaiShaw’s AMD SGLang team, throughput improved by more than 100x by Day 26, achieved by replacing PyTorch-native fallback paths with real AITER/Triton/TileLang kernels and getting the FP4-weight MoE working. NVIDIA’s own TensorRT-LLM, by contrast, shipped with a bug: a hidden size hardcoded to 4096, while V4 Pro is 7168, meaning that for over a week the default config would silently produce corrupted generations—ultimately fixed by a SemiAnalysis-authored PR, by which point it was Day 9.
Third, rack-scale systems are today’s benchmark for cost and throughput. The GB300 NVL72 puts 72 GPUs in a single NVLink domain, keeping the MoE dispatch/combine traffic entirely on NVLink instead of spilling onto the slower scale-out fabric, while amortizing expert-weight loads across far more ranks—the structural reason it reaches that $0.156 per million output tokens. By comparison, B200/B300 are 8-GPU NVLink islands scaled out over InfiniBand and hit the wall earlier, and the MI355X sits further back on both scale-up domain size and collective-stack maturity.
Fourth, software optimization converts directly into power efficiency. Per SemiAnalysis, on the B200 with vLLM, tokens per second per all-in provisioned-utility megawatt (which factors in datacenter PUE and overhead) went from roughly 300,000 on Day 0 to nearly 500,000 by June 5—about 1.7x. Because a B200’s all-in utility power envelope is fixed near 2.17 kW/GPU, that jump is a pure software gain. The same class of optimizations that pushed the throughput frontier—MegaMoE grouped-FP4 GEMMs, wider expert parallelism (Wide EP)—drops straight through to tokens per watt.
Beyond that, Huawei is the first major open model to get first-class Day 0 support on Ascend, with part of DeepSeek’s official API served on Huawei from Day 0—a contrast to last year’s V3/R1 launch, when only the CUDA stack worked Day 0, and a sign that the architecture was co-designed in part for Ascend inference.
Why it matters
The assumption worth updating is “open weights mean so-so economics; if you actually want to save money you go back to a closed API.” DeepSeek V4 undercuts that. The cost curve it exposes is not a vendor’s list price but a deployable cost that anyone can reproduce or approach in their own datacenter or rented rack. That means frontier-grade inference economics now has an open, auditable reference point—you are no longer stuck choosing between “trust a vendor’s pricing” and “tune everything from scratch.”
Second, deployment autonomy and cost structure are two different things, but this release delivers both at once. Open weights give you autonomy—you can keep the model inside your own compliance boundary, dodge rate limits, and provision capacity to your own demand shape. The SemiAnalysis curves then add that this autonomy need not cost a premium in unit economics, provided you (or a partner) are willing to invest in serving-side engineering. For teams with data-residency requirements, steady high-volume traffic, or simply an aversion to single-vendor lock-in, that is a signal that genuinely shifts the build-vs-buy calculus.
Third, the speed at which multi-stack maturity converges is itself builder intelligence. The AMD curve, from near-unusable to over 100x better by Day 26, shows that an open model’s hardware economics swing wildly in the first month after launch—running on Day 0 is only the starting point. That has direct implications for procurement and rollout timing: locking in a hardware choice the week a new open model ships likely buys at the worst moment, while leaving the software ecosystem a few weeks to catch up can change the picture entirely.
Builder impact
If you are doing capacity planning for self-hosted or hybrid inference, the one thing to take away is this: treat cost-per-token and throughput-per-watt as your primary metric. Single-GPU throughput and leaderboard scores are secondary; the first two numbers are what decide fleet economics. SemiAnalysis repeatedly stresses that tokens per second per all-in utility megawatt is the best figure of merit for fleet-scale ROI because it captures PUE and datacenter overhead—many organizations’ real constraint is scarce provisioned power, and the question is how to convert each provisioned megawatt into as many billable tokens as possible. DeepSeek V4 lets that power-anchored accounting sit on an open-weight model for the first time.
Second, choose your serving engine by the interactivity tier you target. Per the SemiAnalysis curves, TensorRT-LLM is stronger at high batch sizes and falls behind at higher interactivity, and it does not work out of the box; native vLLM / SGLang on CUDA are usable from Day 0 and are the most reliable landing spot for any new open model. The rack-scale GB300 leads at every interactivity tier once MTP is on—but only if you can get an NVL72-class scale-up domain. If you cannot, do not budget your 8-GPU deployment against its cost figure.
Third, leave the ecosystem a catch-up window in your hardware selection. The economics of non-CUDA stacks like AMD and Huawei can improve by tens of times within a month of launch; today’s “unusable” is not next quarter’s verdict. The pragmatic move is to get the workload running on the native CUDA stack first while tracking the iteration curves on a public dashboard like InferenceX, then revisit procurement once the economics inflect.
Technical takeaway
DeepSeek V4’s long-context cost advantage comes from architecture-level KV cache compression. Per SemiAnalysis’s reading of the tech report, V4 walks away from the earlier Multi-head Latent Attention (MLA) and interleaves two new mechanisms. HCA (Heavily Compressed Attention) keeps a KV cache made of a sliding window of KV embeddings plus a set of compressed KV entries, each entry compressing the key/value across several tokens into one (m′ = 128 for V4 Pro). CSA (Compressed Sparse Attention) uses the same compression technique at a lower rate (m = 4) and applies sparse attention over the compressed entries using a lightning indexer to select which tokens to attend to—a sparse-attention lineage inherited from DeepSeek v3.2. Interleaving the two yields roughly a 50x KV cache reduction at 1M-token context length.
That is precisely the cost lever for long-context inference: the KV cache is usually the dominant consumer of memory and bandwidth, so cutting it 50x means the same hardware can serve far longer contexts—or the same context can run on much cheaper hardware. On the same theme, a research-community paper (arXiv 2606.09079, FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention) discusses ultra-long context from the angle of a lightning index and lookahead sparse attention; its specific method and numbers are not in the source material at hand, so it serves only as a directional pointer here, with no conclusions drawn.
The cost is worth stating plainly: this novel KV structure imposes real engineering burden on serving frameworks. Per SemiAnalysis, because CSA and HCA entries differ in size per entry, vLLM’s KV cache allocator has to design a logical block size that divides both compression rates and a bucketing strategy to avoid fragmentation—meaning the best economics “out of the box” depend on the framework filling in that support, another concrete footnote to the “leave the ecosystem a catch-up window” point above.
What to ignore
First, ignore the “DeepSeek V4 crushes some closed SOTA” capability-benchmark headlines. This first-hand SemiAnalysis analysis never plays the leaderboard game, and this piece has no source-backed benchmark to cite—betting the model’s value on “who scores higher” misses the actual signal: cost and deployability.
Second, do not transplant the GB300’s $0.156-per-million-token curve onto your own deployment. That figure is for a rack-scale NVLink domain, with MTP on, at a specific input/output length and interactivity tier; switch to an 8-GPU island, change the interactivity, or change the stack, and the number moves substantially. It is only an upper-bound reference for “how cheap this model can get on an optimal system,” and you have to recompute your real cost against your own deployment conditions.
Third, do not let “Day 0 full-stack support” mislead you into “any hardware is economical.” AMD’s “usable” at 1–2 tokens per user per second on launch day became genuinely usable only after a 100x-plus improvement a month later, with a month of engineering separating the two. Treating a launch-day support status as evidence of long-term economics is the easiest trap to fall into here.