whichllm - pick the right model for the job

001 / the models

every model that matters right now

ranked by what they actually win at, not marketing benchmarks.

openai

GPT-5.2

the generalist. leads on structured reasoning, GPQA Diamond (93.2%), and AIME 2025 (100%). 400K input context.

400Kcontext

$2.50per M input

$10per M output

best for: structured reasoning, math, general tasks

anthropic

Claude 4.5 Opus

the coder. first model to break 80% on SWE-bench Verified (80.9%). arena code elo 1548. the best at writing code that works.

200Kcontext

$5per M input

$25per M output

best for: coding, nuanced writing, agentic tasks

anthropic

Claude Sonnet 4.5

the workhorse. 77.2% SWE-bench at a fraction of Opus cost. the default choice for most production coding pipelines.

200Kcontext

$3per M input

$15per M output

best for: production coding at scale, balanced cost

google

Gemini 3 Deep Think

the abstract reasoner. leads ARC-AGI-2 (45.1%) and GPQA (94.3%). 1M+ context window. strongest on scientific benchmarks.

1M+context

$1.25per M input

$5per M output

best for: science, abstract reasoning, long documents

google

Gemini 2.5 Flash

the cost killer. 10x cheaper on input than competitors. reasoning capabilities with 1M context. the budget pick that doesn't feel budget.

1Mcontext

$0.15per M input

$0.60per M output

best for: high-volume, cost-sensitive workloads

xai

Grok 4

the dark horse. leads HLE benchmark (50.7%). 2M context on the Fast variant. aggressive pricing and surprisingly strong reasoning.

2Mcontext (fast)

$2per M input

$8per M output

best for: long context, hard reasoning, value

deepseek

DeepSeek-V3.2

the open-source contender. strong reasoning at $0.28/M output. the model that proved you don't need $25/M tokens to be competitive.

128Kcontext

$0.14per M input

$0.28per M output

best for: budget reasoning, self-hosting, sovereignty

Llama 4 Scout

the context monster. 10M token context window. fully open source. production-ready for enterprise with full data sovereignty.

10Mcontext

freeself-host

openweights

best for: massive context, on-prem, data sovereignty

002 / by task

what to use for what

the answer to "which model?" is always "for what?"

task	best pick	runner up	budget pick
code generation	Claude 4.5 Opus	Claude Sonnet 4.5	DeepSeek-V3.2
math / formal reasoning	GPT-5.2	Gemini 3 Deep Think	DeepSeek-R1
scientific research	Gemini 3 Deep Think	GPT-5.2	Gemini 2.5 Flash
long document analysis	Gemini 3 Pro (1M)	Llama 4 Scout (10M)	Gemini 2.5 Flash
creative writing	Claude 4.5 Opus	GPT-5.2	Llama 4
classification / extraction	Gemini 2.5 Flash	GPT-5 Mini	DeepSeek-V3.2
agentic workflows	Claude Sonnet 4.5	GPT-5.2	Grok 4.1 Fast
multimodal (image/video)	Gemini 3 Pro	GPT-5.2	Gemini 2.5 Flash
summarization	Claude Sonnet 4.5	Gemini 2.5 Flash	DeepSeek-V3.2
structured JSON output	GPT-5.2	Claude Sonnet 4.5	Gemini 2.5 Flash

003 / pricing reality

the three tiers of production AI

arxiv proved cheaper models can cost 28x more due to token verbosity. list price is not real price.

cheap tier

$0.14 - $0.60

per M output tokens. handles 62% of production traffic. classification, extraction, simple Q&A, routing decisions.

DeepSeek-V3.2 Gemini 2.5 Flash GPT-5 Mini Grok 4.1 Fast

mid tier

$3 - $15

per M output tokens. handles 27% of traffic. summarization, structured generation, moderate reasoning, coding.

Claude Sonnet 4.5 GPT-5.2 Gemini 3 Pro Grok 4

frontier tier

$25 - $75

per M output tokens. only 11% of traffic should ever touch this. complex multi-step reasoning, hard coding, research.

Claude 4.5 Opus GPT-5.2 Pro Gemini 3 Deep Think

004 / routing

the smart way: route by complexity

a model-agnostic architecture with rule-based routing saves 40-60% on token costs.

classify the query

use a tiny model or heuristic to score task complexity. simple extraction? cheap tier. multi-step reasoning? escalate. this classifier costs almost nothing.

match to tier

route to the cheapest model that can handle the complexity. most requests are simpler than you think. only escalate when the cheap model's confidence is low.

verify and fallback

check output quality. if the cheap model failed, retry with mid-tier. if mid-tier failed, hit frontier. cascading saves money without sacrificing quality.

005 / decide

the 5-question decision tree

what is the task?

coding goes to Claude. math goes to GPT-5. science goes to Gemini. if you don't know, start with GPT-5.2 as the generalist.

how much context?

under 128K: any model works. 128K-1M: Gemini or Grok. over 1M: Llama 4 Scout or Gemini. context length eliminates options fast.

what is your budget?

under $100/mo: DeepSeek or Gemini Flash. $100-1000/mo: Sonnet or GPT-5. unlimited: Opus or GPT-5 Pro. be honest about this upfront.

does data leave your infra?

if no: Llama 4, DeepSeek, or Qwen (self-host). if yes: any API provider. data sovereignty is a hard constraint, not a preference.

is this one model or a pipeline?

single model: pick the best for your task. pipeline: use cheap models for 90% of steps, frontier for the hard 10%. this is where routing pays off.