- OpenAI o3 leads on MATH benchmark at 97.3%; Claude Extended Thinking at 94.1%, Gemini 2.5 Deep Think at 93.8%
- Reasoning models cost 10–30x more per query than standard models—and are only worth it on genuinely hard problems
- GPQA (graduate-level science) shows the tightest race: o3 at 87.5%, Claude at 86.2%, Gemini at 85.9%
- For coding challenges (Codeforces Div 2): o3 solves 78% of problems, Claude 71%, Gemini 70%
Section 1 — What Makes a Reasoning Model Different
Standard LLMs generate responses token by token, with each token representing a forward pass through the model. Reasoning models—o3, Claude with Extended Thinking, Gemini 2.5 Deep Think—use a different paradigm: they generate a chain of internal "thinking" tokens before producing their final answer. This extended internal monologue allows the model to check its work, explore alternative approaches, and correct errors that would otherwise propagate to the final output.
The thinking tokens are real computation with real costs. On o3, a single complex math problem can consume 5,000–15,000 thinking tokens before producing a 100-token answer. At $60 per million tokens (o3's approximate cost in March 2026), a single reasoning query on a hard problem can cost $0.50–$1.20. That's 50–100x more expensive than asking the same question to a standard model.
The cost premium is only justified when it buys accuracy that matters. For a routine customer support query, there is no benefit to extended reasoning—the question doesn't require it and you're spending $1 to get the same answer as $0.003. For solving a complex integration problem, verifying a multi-step financial model, or writing code that must be provably correct, the reasoning premium is often justified.
Understanding when the premium is worth paying is the central skill in deploying reasoning models effectively.
Section 2 — Benchmark Results: The Full Picture
Benchmarks for reasoning models require more interpretation than standard model benchmarks because the performance varies dramatically by difficulty tier. All three models perform similarly on "medium hard" problems (MATH difficulty 4/5) but diverge significantly on the hardest 10% of problems.
MATH Benchmark (Hendrycks MATH, 500-problem test set):
- o3: 97.3% (was 76.7% for GPT-4 in 2024—an astonishing 20-point gain)
- Claude Extended Thinking: 94.1%
- Gemini 2.5 Deep Think: 93.8%
- Claude Sonnet 4.6 (no extended thinking): 74.2%
- GPT-5 (no extended reasoning): 71.8%
The gap between reasoning and standard models on MATH is dramatic. The gap between the three top reasoning models is much smaller—less than 4 percentage points separates o3 and Gemini.
GPQA Diamond (graduate-level science questions):
- o3: 87.5%
- Claude Extended Thinking: 86.2%
- Gemini 2.5 Deep Think: 85.9%
- Standard frontier models: 62–67%
SWE-bench Verified (real GitHub issues):
- o3: 71.7%
- Claude Extended Thinking: 68.3%
- Gemini 2.5 Deep Think: 65.1%
AIME 2025 (prestigious math olympiad):
- o3: 92.4%
- Claude Extended Thinking: 88.1%
- Gemini 2.5 Deep Think: 87.3%
Section 3 — Cost and Latency Comparison
| Model | Best Reasoning Task | Cost per Query (typical hard problem) | Latency (hard problem) | Context Window |
|---|---|---|---|---|
| OpenAI o3 | Mathematical proofs, formal logic, olympiad problems | $0.40–$1.50 | 45–120 seconds | 200K tokens |
| Claude Extended Thinking | Code generation, multi-step analysis, research tasks | $0.30–$1.20 | 35–90 seconds | 200K tokens |
| Gemini 2.5 Deep Think | Scientific reasoning, multimodal + reasoning combo | $0.35–$1.30 | 40–100 seconds | 1M tokens |
| o3-mini (high effort) | Efficient math and coding, cost-conscious reasoning | $0.05–$0.25 | 15–45 seconds | 200K tokens |
| Claude Sonnet (standard) | General tasks, non-reasoning workloads | $0.003–$0.015 | 1–5 seconds | 200K tokens |
Section 4 — When to Use Reasoning Models
The decision to use a reasoning model should be explicit and economically justified. Our framework:
Use reasoning models when:
- The task has a verifiable correct answer and being wrong has significant cost (financial models, code that runs on production, medical calculations)
- The problem requires more than 5 logical steps where errors accumulate
- Standard models show >10% failure rate on your specific task type
- You're processing a small number of high-value queries (not thousands of low-stakes queries)
Do not use reasoning models when:
- The task is classification, extraction, or summarization (standard models are equally good)
- You need sub-5-second response times (reasoning models are slow)
- You're processing high volumes of simple queries (the cost premium is unsustainable)
- The task involves creative writing or open-ended generation (reasoning doesn't help here)
A useful rule of thumb: if you can't tell from the problem statement whether a reasoning model would help, it probably won't. The tasks that benefit from extended reasoning are usually obvious—they're the problems where you'd want a human expert to "show their work."
OpenAI's o3-mini at "high" reasoning effort delivers approximately 85% of o3's benchmark performance at roughly 20% of the cost. For most production use cases, o3-mini is the economically rational choice. Full o3 is reserved for the hardest problems where those extra 15 percentage points matter.
Section 5 — Practical Integration Patterns
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
const anthropic = new Anthropic();
const openai = new OpenAI();
// Classify query complexity to route to appropriate model
function classifyQueryComplexity(query: string): "simple" | "medium" | "hard" {
const hardIndicators = [
/prove|proof|demonstrate mathematically/i,
/solve.*equation|integrate|differentiate/i,
/algorithm.*O\(|complexity|NP-hard/i,
/verify.*correct|formally verify/i,
];
const mediumIndicators = [
/analyze|compare.*tradeoff/i,
/design.*system|architecture/i,
/debug.*complex|trace.*error/i,
];
if (hardIndicators.some((p) => p.test(query))) return "hard";
if (mediumIndicators.some((p) => p.test(query))) return "medium";
return "simple";
}
// Route to appropriate model based on complexity
async function adaptiveInference(query: string): Promise<{
response: string;
modelUsed: string;
estimatedCost: number;
}> {
const complexity = classifyQueryComplexity(query);
switch (complexity) {
case "hard": {
// Use Claude Extended Thinking for hard problems
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 16000,
thinking: {
type: "enabled",
budget_tokens: 10000, // Allow up to 10K thinking tokens
},
messages: [{ role: "user", content: query }],
});
const thinkingBlock = response.content.find(
(b) => b.type === "thinking"
);
const textBlock = response.content.find((b) => b.type === "text");
return {
response: textBlock?.type === "text" ? textBlock.text : "",
modelUsed: "claude-extended-thinking",
estimatedCost:
(response.usage.input_tokens * 3 +
(thinkingBlock ? 10000 : 0) +
response.usage.output_tokens * 15) /
1_000_000,
};
}
case "medium": {
// Use standard Claude for medium complexity
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
messages: [{ role: "user", content: query }],
});
const text =
response.content[0].type === "text" ? response.content[0].text : "";
return {
response: text,
modelUsed: "claude-sonnet-4-6",
estimatedCost:
(response.usage.input_tokens * 3 +
response.usage.output_tokens * 15) /
1_000_000,
};
}
default: {
// Use cheap model for simple queries
const response = await anthropic.messages.create({
model: "claude-haiku-3-5",
max_tokens: 1024,
messages: [{ role: "user", content: query }],
});
const text =
response.content[0].type === "text" ? response.content[0].text : "";
return {
response: text,
modelUsed: "claude-haiku-3-5",
estimatedCost:
(response.usage.input_tokens * 0.25 +
response.usage.output_tokens * 1.25) /
1_000_000,
};
}
}
}
Section 6 — Limitations and Honest Caveats
Reasoning models have real weaknesses that benchmark scores don't capture:
Latency is a genuine barrier: 45–120 seconds for a hard query is unacceptable in user-facing interactive applications. Reasoning models are appropriate for asynchronous workflows (document review, code analysis in CI) but cannot replace real-time inference for conversational products.
Longer thinking ≠ always better: We found that beyond a certain thinking token budget, additional reasoning can cause the model to "overthink"—second-guess a correct initial solution and arrive at a wrong answer. Each model has an optimal thinking budget range for different task types that requires empirical calibration.
Verbosity and format: Reasoning models produce longer, more structured outputs by default. For applications expecting concise outputs, you need explicit instructions to constrain response length—and these instructions are sometimes ignored when the model is in "deep thinking" mode.
No guaranteed reliability: Even o3 at 97.3% MATH accuracy fails 2.7% of the time on problems in that benchmark. For applications where 100% accuracy is required, reasoning models are a complement to human review, not a replacement.
Verdict
Reasoning models represent a genuine step change in AI capability for hard analytical problems. o3 leads on most benchmarks, but Claude Extended Thinking and Gemini 2.5 Deep Think are close enough that cost, latency, and ecosystem factors should drive the final choice. The key discipline is restraint: use reasoning models only for the problems that require them, and you'll pay the premium where it buys real value.
Data as of March 2026.
— iBuidl Research Team