China's AI Models Just Overtook the US in Usage — And It's Not About Benchmarks

DeepSeek V3.2 will sell you a million output tokens for forty-two cents. Claude Opus will sell you the same million for seventy-five dollars.

That’s not a discount. That’s not even the same market. That’s roughly 170x, and once you’ve internalized that number, almost everything else about the “China overtook the US in AI usage” story stops being surprising. Developers are not philosophers. When the thing in front of them is good enough and costs two orders of magnitude less, they route the traffic and move on. The leaderboard drama is for the timeline. The invoice is what actually moves workloads.

I want to walk through what the numbers actually say, because the headline — Chinese models now process more tokens than US models — is both true and a lot less profound than the breathless version. It’s a real shift. It’s also mostly a story about price, and partly a story about a metric quietly measuring the wrong thing.

What actually happened

In the week of March 30 to April 5, 2026, OpenRouter — the routing layer that sits in front of 400+ models for 5+ million developers — processed about 27 trillion tokens. Chinese models accounted for 12.96 trillion of that, up 31.48% week-on-week. US models generated 3.03 trillion. That was the fifth consecutive week Chinese models came out ahead.

Sit with the ratio for a second: on a vendor-neutral router where developers pick whatever they want, Chinese models did more than 4x the token volume of US models. Not a squeaker. A rout.

Now the context that keeps me honest about it. OpenRouter’s own user base is 47.17% from the US and only 6.01% from China. This is not domestic Chinese traffic inflating a domestic Chinese product. The people reaching for DeepSeek, Qwen, Kimi, MiniMax, and GLM are, in the plurality, American developers making a cost-and-capability call on a neutral platform. That detail is the whole story. The overtaking didn’t happen because of a captive home market. It happened because the global developer population voted, and a lot of those voters hold US passports.

Zoom out from OpenRouter and the scale gets harder to picture. China’s National Data Administration — Liu Liehong, speaking at the China Development Forum in March 2026 — put the country’s daily token consumption above 140 trillion, a thousandfold increase from early 2024. OpenRouter, for comparison, moves about 3 trillion tokens a day total, with Chinese models contributing roughly 1 trillion of that. OpenRouter is a rounding error against China’s domestic firehose. Hold onto that 140-trillion number, because later in this post it turns into the most important caveat in the whole story.

The cost story

Here’s the part that explains the behavior. Lay the prices side by side and the “why” answers itself.

ComparisonChinese modelUS modelGap
Output tokens ($/1M)DeepSeek V3.2 — 0.42Claude Opus — 75.00~170x
Blended ($/1M)Kimi K2 — 1.07Gemini 3 Pro — 4.5076% cheaper
Mid-tier ($/1M)MiniMax M2.5 — 1.10Claude Sonnet — 15.00~14x

These are not rounding-error differences you can wave away with “well, quality costs money.” A 170x gap is a different category of decision. At that spread, you don’t reach for the cheaper model when budgets are tight — you reach for the expensive one only when you can write down, specifically, why the extra 169x is buying you something this particular workload needs.

And the macro picture rhymes with the micro one. Across 2023–2025, Chinese labs reportedly hit 90% of US frontier performance at 82% lower capital expenditure — $124B against $694B. The single most-cited data point: DeepSeek R1 was trained for around $294,000 and landed at GPT-4 Turbo-level capability, against training-cost estimates north of $100M for GPT-4. Even if you haircut those numbers for vendor optimism and accounting games — and you should — the conclusion survives: somebody figured out how to get most of the capability for a sliver of the spend.

This is the same gravity I wrote about when Hetzner stopped being the cheap default. Infrastructure economics don’t care about your loyalties. When the cost curve bends hard enough, behavior follows whether the incumbents like it or not — and right now the cost curve for “good enough” inference is bending straight at the Chinese open-weights labs.

The comparison that matters

Price is only half the decision. Here’s the fuller shape of what you’re actually choosing between in mid-2026.

DimensionChinese frontier (DeepSeek / Qwen / GLM / Kimi)US frontier (Claude / Gemini / GPT)
Output cost$0.42–1.10 / 1M$15–75 / 1M
Performance vs SOTA~90% of frontierthe frontier
Open-weightsMostly yes (MIT / permissive)Mostly no (closed API)
Self-hostableYesNo
Best atStructured, repetitive, moderate-complexity workHard reasoning, long-context fidelity

That open-weights column is doing quiet, heavy lifting. A closed API is a rental agreement. Open weights under a permissive license are an asset — you can audit them, pin a version, run them on your own silicon, and keep running them the day the vendor changes terms. I made this point at length about GLM-5.2: the genuinely new thing isn’t that an open model got good, it’s that open-weights and competitive stopped being mutually exclusive. The usage numbers in this post are what that combination looks like once it escapes the benchmark threads and hits real traffic.

Where it shines

I don’t run AI in production on faith, so let me be specific about where the cheap-and-Chinese option earns its keep.

Structured, repetitive work at moderate complexity. Classification, extraction, transformation, summarization, schema-shaped output, the ten-thousand-row batch job — this is the bread and butter of real AI systems, and it’s exactly where the capability gap with the frontier narrows to almost nothing while the cost gap stays enormous. If your task fits in a JSON schema and runs a million times, paying frontier prices for it is just lighting money on fire.

Agentic loops where volume dominates. Programming-related tasks went from 11% of AI usage in early 2024 to over 50% by late 2025, per the OpenRouter/a16z study, and agent-driven automated workflows now generate more than half of all output tokens. Agents are token gluttons — they read, plan, retry, and generate far more than a human in the loop ever would. When the bottleneck is token volume, a 170x price gap is the entire ballgame.

Adoption has already voted. Around 80% of US AI startups using open-source models picked Chinese ones in 2025. Airbnb has been public about leaning on Alibaba’s Qwen. Qwen alone has spawned 180,000+ derivative models and 600+ million downloads. On the OpenCompass leaderboard in October 2025, 14 of the top 20 were Chinese and nine were open-source — with zero open-source US models in the top 14. This isn’t a fringe preference anymore; it’s where a large slice of the builder population already lives.

Where it doesn’t shine

Now the section the celebratory threads skip.

The hard 10% is still the hard 10%. “90% of frontier performance” is a real and impressive number, and it’s also doing a magic trick. The last 10% — genuinely novel reasoning, holding a complicated thread across a long context without quietly losing the plot, the output quality that separates “technically correct” from “actually right” — is precisely the part you’re paying frontier prices for. For the gnarly debugging session, the architecture decision, the 3 a.m. incident where being subtly wrong is worse than being slow, I still reach for the model that’s best, not the one that’s cheapest. The brief honest version of the limitation, straight from the usage research: Chinese models trail on contextual understanding, complex reasoning, and output quality.

Cheap changes how you spend, not whether you think. A 170x price cut makes it tempting to throw tokens at problems instead of designing around them. That’s the same trap as treating any cheap resource as free. The win is real, but it rewards teams who route deliberately — cheap model for the volume, expensive model for the 10% that needs it — not teams who just swap the base URL and stop thinking.

Governance doesn’t disappear because the weights are open. A frontier-competitive model out of a Chinese lab reads very differently to a compliance team in a regulated industry than it does to a solo developer on a side project. Open weights answer the data-residency question — you can self-host, audit, and keep the data on your network, which is more than a closed API offers. They don’t answer every question your legal team will raise. That’s a real conversation, not a dismissal.

The token volume question

Here’s the part I think most of the coverage gets dangerously wrong, and it’s worth slowing down for.

Tokens consumed is not a measure of intelligence. It’s a measure of tokens consumed.

China’s 140-trillion-tokens-a-day number is the headline that makes the “China is winning AI” story feel overwhelming. But look at where it comes from. ByteDance’s Doubao model alone consumes over 120 trillion tokens per day — the lion’s share of the national total. And a large chunk of that is feeding consumer products: recommendation, content generation, video. Generating frames of short-form video burns staggering token volume. It is not the same thing as advancing the frontier, and it is definitely not the same thing as a developer choosing a model for serious work on a neutral platform.

This is Goodhart’s Law in its natural habitat: when a measure becomes a target, it stops being a good measure. “Tokens processed” started as a rough proxy for adoption. The moment it became the scoreboard everyone cites, it stopped telling you about quality and started telling you about which firehose is pointed at consumer apps. A model that talks itself in circles before answering burns more tokens than one that answers cleanly — and on a pure-volume scoreboard, the chattier, less efficient model looks like it’s winning. I saw exactly this up close with GLM-5.2: it produced roughly 10GB of reasoning trace where a tighter model produced about 1GB for comparable output. More tokens. Not more value.

So hold two true things at once. Chinese models genuinely overtook US models in developer-chosen token volume on a neutral router — that’s the OpenRouter result, and it’s meaningful because of who is choosing. And the eye-popping 140-trillion national number is mostly video and consumer scale, not a frontier-capability claim. Conflate the two and you’ve been Goodharted. The signal worth tracking is the first one. The second is mostly exhaust.

What this means for operators

If you run AI in production — or you’re about to — here’s how I’d actually metabolize all of this.

Route by task, not by team allegiance. The single highest-leverage architectural decision in an AI system right now is a router that sends the boring 90% to a cheap model and reserves the expensive frontier model for the 10% that demonstrably needs it. At a 170x spread, getting that split right is the difference between a sane bill and a board meeting. This is the same instinct I keep coming back to with the coding-model field: the best model is whichever one you actually use for the job in front of you, and “the job in front of you” usually isn’t the hardest one.

Open weights make this a procurement decision, not a brand loyalty one. When the cheap option is also self-hostable under a permissive license, “cheaper” stops being the only argument. You also get version pinning, auditability, and data that never leaves your network. The operational gap that used to make this painful — weights on Hugging Face versus running in your app tomorrow — is the gap that Ollama quietly closed. Pull the model, point your app at one base URL, done. That’s what makes the cost story actionable instead of theoretical.

Don’t run AI on vibes from a leaderboard. Benchmarks are contaminated and contested, vendor numbers are vendor numbers, and “tokens processed” measures the wrong thing. Test the specific models on your workload, measure cost-per-successful-task rather than cost-per-token, and re-run that test every quarter because the field turns over fast. China’s geopolitical AI position doesn’t change which model resolves your support tickets most cheaply. Your own eval harness does.

The frontier still matters — for less of your traffic than you think. None of this says the expensive US models are obsolete. It says the share of your workload that genuinely needs them is smaller than the defaults assume. The cheap models grew from 1.2% to 30% of global usage in twelve months — a 2,400% jump — not because they won an argument, but because builders kept discovering that more of their work fit in the cheap tier than they’d assumed. That discovery is the trend. Plan your stack as if it continues.