AI · What's new · 8 min read · Apr 19, 2026
Frontier-model tracker, Q2 2026: GPT-5, Claude 4.7, Gemini 3
Three labs, three releases, three different bets on what the next year of AI looks like. Here's what each model is actually good at, and which one I reach for when.
Every quarter the frontier shifts and every quarter the same handful of mid-budget AI commentators rank the models on a single benchmark, declare a winner, and move on. The reality is more useful: each lab is now optimizing for a different shape of work. If you treat them as interchangeable, you're paying more and getting less.
GPT-5: the consumer surface, sharpened
OpenAI's bet is that the average person doesn't want to pick a model. GPT-5 routes silently — small tasks to a fast tier, hard ones to a deeper tier — and most users won't notice. The result is the most fluent conversational model on the market and the one I'd hand to anyone who isn't a technical buyer.
Where it's strongest: writing, casual reasoning, multimodal interactions where a person is in the loop. Where it's weakest: long deterministic engineering work. Code-heavy sessions still drift on it more than they do on Claude or Codex.
Claude 4.7: the operator's model
Anthropic optimized in the opposite direction. Claude 4.7 is the model I trust to do real work while I'm not watching. Long sessions, tool use, file edits, multi-step coding tasks where you want it to recover from its own mistakes. The 1M token window is now the default, not a beta perk, and that changes what's tractable.
I drive my whole studio off Claude. It's not the most charming chat partner. It's the one that finishes the job. If you're billing for outcomes, that's the trade you want.
Gemini 3: the mass-input model
Google is leaning into context that no one else can match: video, audio, gigantic mixed-media inputs. Gemini 3 Pro is the only frontier model where I can drop a forty-five minute Loom recording and get useful structured output without a transcription step. For research, for legal review, for any work where the input is large and heterogeneous, it's the strongest tool.
Where it lags: agentic loops and long coding sessions. Gemini is best inside a single, large turn. The other two are better across many small ones.
Which one I reach for
- Building a feature, fixing a bug, reviewing code, writing tests: Claude.
- Drafting an email to a client, writing a brief, talking through a design problem: GPT-5.
- Watching a video, reading a 200-page document, ingesting a recording: Gemini 3.
- Long-running, unattended workflow that has to recover from its own errors: Claude.
- Anything that has to feel friendly to a non-technical user: GPT-5.
What they all got better at this quarter
Three things improved across the board, regardless of which lab you talk to. Tool-use reliability climbed sharply — the rate at which a model invents an API call is now under one percent on all three. Long-context recall is genuinely solid up to about 800k tokens; you can paste in real codebases and trust it to find the relevant function. And refusal noise dropped, especially around legitimate technical and medical work.
What none of them solved
Real-world physical reasoning is still bad. Spatial layout in design tools is still bad. Anything involving timing in audio or video editing is still bad. And the models are all terrible at admitting when they don't know something — they confabulate confidently, just less often than before. Treat that last one as a permanent feature, not a bug to be fixed in the next release.
The prediction nobody asked for
By Q4, the differences between these three will narrow on benchmarks but widen on workflow. The model isn't the product anymore. The product is the harness — Claude Code, ChatGPT's agent mode, Gemini's workspace integrations. The lab that wins the next year is the one whose harness disappears the most cleanly into how its users already work. That's the angle we drive in the agent stack reset.
Sources & further reading