"Which model should we use?" It's one of the most common questions I get, and the most honestly answered by: it depends on what you're actually doing. The uncomfortable follow-up is that most people asking the question are comparing models on benchmark scores, which are about as useful for production decisions as judging a car on its top speed.
MMLU, HumanEval, MATH — these benchmarks are valid scientific measurements. They're just measuring something different from what you need when you're deploying an agent that processes supplier invoices at 3am, or a support bot that needs to handle an angry customer asking about a billing dispute in three languages, or an extraction pipeline that needs to pull specific fields from 40 different PDF formats reliably.
This article is about what we've found from actual production deployments. The observations are from projects where we've had the data to compare: accuracy on domain-specific tasks, per-call cost at volume, latency distributions under realistic load, and failure modes under edge case inputs.
Why benchmark scores mislead production decisions
Benchmarks test the model in isolation on clean, standard problems. Production systems have noise. Here are the differences that reliably matter more than overall benchmark scores:
- Instruction following consistency. How reliably does the model return structured output (JSON, specific field names, constrained formatting) across thousands of calls? A model that gets output format right 97% of the time vs 99.5% of the time is the difference between 30 errors per 1,000 calls and 5 — which has a massive downstream impact on a pipeline that needs to parse the output.
- Graceful degradation on edge cases. What happens when the input is ambiguous, poorly formatted, or genuinely unclear? Does the model flag uncertainty, make a best-effort attempt, or confabulate confidently? Confident wrong answers are worse than honest "I'm not sure" responses in most production contexts.
- Cost at your volume. At 10,000 calls/day, a £1 per million token price difference isn't academic. Context window usage, output length control, and whether you can batch effectively all matter to the actual cost.
- Latency under load. Average latency is almost useless. You want p95 and p99. The 99th percentile user in your support workflow is waiting how long?
- API reliability. All three major providers have had outages. But they differ significantly in how quickly they resolve incidents, what their rate limits are at various pricing tiers, and whether there's a fallback path (Azure OpenAI, Bedrock, etc.) when the primary endpoint is degraded.
The comparison: at a glance
| Dimension | GPT-4o | Claude 3.5 Sonnet | Gemini 2.0 Flash |
|---|---|---|---|
| Structured output / JSON | Excellent Native JSON mode | Very good Strong format adherence | Good Occasional schema drift |
| Document understanding | Excellent Vision + text | Excellent Best long-doc reasoning | Very good 1M token context |
| Instruction following | Very reliable | Best in class | Moderate Occasional role drift |
| Latency (typical agent call) | 1.2–2.5s avg | 1.4–3.0s avg | 0.4–1.2s avg |
| Cost (input, per 1M tokens) | $2.50 | $3.00 | $0.075 |
| Context window | 128K | 200K | 1M |
| API reliability / fallback | Azure fallback available | AWS Bedrock available | GCP native only |
| Function / tool calling | Excellent Most consistent | Very good | Good Occasionally over-calls |
Prices are approximate as of early 2026 and shift often — always check the current provider pricing for your exact use case.
Document processing pipelines
This is the use case that shows up most often in our work: extracting structured data from PDFs, contracts, invoices, and forms. The inputs are messy — scanned documents, mixed formats, tables with irregular layouts, handwritten annotations alongside typed text.
Customer support agent systems
This is multi-turn: the agent needs to understand context from conversation history, reference product documentation, handle queries in multiple languages, and decide when to escalate to a human without being overly trigger-happy about it.
Structured data extraction at scale
Pure extraction — pulling specific fields from text, normalising them, and outputting a clean JSON structure. The kind of thing that runs millions of times and needs high reliability and low cost.
For high-volume extraction with well-defined schemas, Gemini 2.0 Flash at $0.075/1M tokens handles what would cost $2.50 with GPT-4o. At 50M tokens/month that's ~$120 vs ~$4,000. The accuracy tradeoff is real but often acceptable.
The practical approach we've landed on for several clients: use Gemini Flash as the first pass at scale, and route low-confidence outputs (where the model flags uncertainty or the downstream validation fails) to GPT-4o for a second pass. You pay GPT-4o prices on maybe 8–12% of volume, everything else gets processed at Flash prices. Total cost is close to Flash cost. Accuracy is close to GPT-4o accuracy. You have to build the routing logic, but it's worth it at volume.
The honest recommendation
There is no universally best model for production. Here's how to decide:
- Use GPT-4o as your default if you need a reliable, broadly capable model and don't have strong reasons to deviate. The tooling ecosystem (structured output mode, function calling, Azure fallback) is the most mature.
- Switch to Claude 3.5 Sonnet when document reasoning quality or nuanced instruction following is the bottleneck. If you're seeing quality issues with GPT-4o on complex analytical tasks, test Claude — the difference is sometimes significant.
- Use Gemini 2.0 Flash when you have high volume and cost is a constraint, and your task is well-defined enough that you can validate output quality systematically. Build a fallback path to a stronger model.
- For anything latency-sensitive (live chat, real-time suggestions, interactive tools), Gemini Flash's speed advantage is real and should be weighted heavily.
The thing that matters more than model choice
I'd be doing you a disservice if I finished this article without saying: model selection is typically the fifth or sixth most important decision in a production AI system. Far ahead of it are: prompt design and how you handle context, data quality and pipeline reliability, error handling and fallback behaviour, evaluation methodology (how you actually know the system is working), and deployment infrastructure.
A well-designed system using GPT-4o will consistently outperform a poorly-designed system using whichever model wins the latest benchmark. The models are close enough in capability on most production tasks that system design is the differentiator. The wrong model choice costs you maybe 10–20% on a given metric. Poor system design costs you the entire thing.
That said: model costs are real, model quality differences are real, and choosing deliberately rather than defaulting to whatever is most familiar is worth the hour it takes to think it through.
Trying to decide which model fits your pipeline?
Describe your use case and we can give you an honest read on the right tool — including when the answer is something other than a frontier LLM.
Start a conversation