Two benchmarks disagree about the AI model in your phone.
On one, the best model you can run on a phone is doing fine, a short distance behind the frontier, the big models that run in datacenters. On the other, it is falling further behind. Both are real. Which one you trust decides whether on-device AI looks like a quiet success or a dead end.
Start with what fits. A flagship phone has 8GB of RAM (every iPhone 16) or 12GB (Galaxy S25, Pixel 10). After the operating system takes its share, that leaves room for a model of roughly 3 to 4 billion parameters, quantized, meaning compressed, to a couple of gigabytes. To make it fit, Apple compresses its on-device model to about 2 bits per weight. That ceiling has barely moved in two years, and it turns out to be most of the story.
Scoreboard One: Looking Fine
On Chatbot Arena, where people vote on which answer they prefer, small models hold up. On a current snapshot the top model, GPT-5.5, sits at 1506 Elo, the same kind of rating chess uses. A phone-class model, Gemma 3 4B, sits at 1293, 213 points back. On this scale that is the gap between a strong player and a clearly weaker one, not a different sport. In mid-2024 the gap was about 139 points. It widened, but the phone stayed in the game.
That is the comforting number, and device makers like it. It is also measuring the wrong thing. Arena rewards tone, formatting, and helpfulness, and a 3B model distilled from a larger one writes a clean, friendly paragraph. It sounds about as good.
Arena is also barely measuring phones at all. The actual phone-class models, Apple's on-device model and Google's Gemini Nano, are not on Arena, because they are not deployed for public voting. The most-shipped on-device models in the world have no public score. Gemma 3 4B is a stand-in.
Scoreboard Two: Falling Behind
MMLU-Pro is the other scoreboard, and on it the picture inverts. It is a wide, hard exam, scored out of 100, across law and physics and medicine, where the model has to reason through multi-step problems rather than just sound good. Ten answer options instead of four. Chain-of-thought, letting the model reason through steps before it answers, actually helps here: about 19 points for GPT-4o, where on the older, easier MMLU it did not. It is the test that did not saturate, the one where scores still have room to move.
In mid-2024 the best phone-class model, Microsoft's Phi-3.5-mini, scored 47.4 against GPT-4o's 72.6. By early 2025 the best had reached 52.8 (Phi-4-mini), and it has not meaningfully improved since. The frontier, meanwhile, went from 72.6 to about 90 (Gemini 3 Pro). Five points for the phone, seventeen for the frontier. The gap widened from about 25 points to about 37.
Of the two scoreboards, only this one shows the divergence. And the field is worse than its best example suggests: phone models built for general use did not even hold the 2024 line. Gemma 3 4B, designed to run on phones, scores 43.6, below Phi-3.5-mini's 47.4 from a year earlier. The phone tier's progress is carried by a few reasoning-tuned outliers, not the field.
Why They Disagree
Same tier, two scoreboards, opposite verdicts. Arena measures the easy work: a pleasant, helpful answer, which a 3B model gives. MMLU-Pro measures the multi-step reasoning that the frontier's spending actually buys, which a 3B model cannot fake. This is the floor-versus-peak split from build for the floor, at the level of a whole tier rather than one model. The floor is rising slowly. The peak is pulling away. Arena only sees the floor.
Why It Is Happening
This is an investment story before it is a hardware one.
The frontier outspends the phone by orders of magnitude, and the spend compounds. A single frontier training run went from roughly $40 million for GPT-4 toward a projected $1 billion by 2027, on tens of thousands of GPUs. A phone-class model gets a rounding error of that: Apple's on-device model was trained on a single 2,048-chip slice, the earlier Gemma 2 2B on 512 chips. I covered why that frontier compute is now capacity-constrained in the compute moat. And phone models are mostly derivatives, distilled or pruned down from a larger parent (Llama 3.2's 1B and 3B from Llama 3.1, Apple's from a bigger internal model). By construction they follow the frontier at a lag and a discount. They cannot lead it.
Reasoning made it worse. The recent frontier jump on MMLU-Pro came mostly from letting the model think in long chains before it answers, spending more compute at inference time. On a hard question, that thinking runs from $5 to well over $1,000. That is a lever priced for the datacenter, not the phone.
And the hardware is moving the wrong way. Phone RAM has been stuck near 8GB for years, and the same shortage that constrains the datacenter is now squeezing phones: mobile DRAM contract prices, what phone makers pay for memory, rose around 90 percent in early 2026 as suppliers shifted capacity to the high-bandwidth memory that AI servers need. The constraint that built the datacenters is pushing the local ceiling down.
The Honest Part
There is a way to make the gap shrink. Let the phone model think too: turn on a long chain of thought and a 4B model claws back from about 37 points toward 20 to 30 on the same test. But that is the one move a phone is worst suited for. Long chains of thought mean many slow tokens, generated one at a time. Many slow tokens mean latency, heat, and a drained battery. The phone can rent some of the frontier's reasoning, but only by spending what it has least of. And those scores are run on uncompressed weights. The model actually on your phone runs at about 2 bits, so the real gap is wider than the benchmark says.
What the Vendors Already Decided
The clearest evidence is in how Apple built its system. The roughly 3B model on the phone handles the easy work: summarizing notifications, rewriting text, calling tools. Anything it judges complex gets routed off the device, to a larger model in Apple's private cloud. Apple did not bet that the phone would catch up. It ships a small model for the floor and sends the hard questions to the datacenter.
The phone gets the floor. The cloud keeps the peak, and the peak is where the gap is widening. That split is the decision on any roadmap. A feature that runs on the device has to be floor work: summarization, autocomplete, voice, the things a small model does well at zero marginal cost and with full privacy. The reasoning the frontier is racing toward is not floor work, and it is not coming to your pocket. The investment is not pointed there, and neither is the memory.