Reasoning Models Are Getting Scary Good. Most Teams Are Using Them Wrong.
Google launched Gemini 3 with deep reasoning. OpenAI has GPT-5 Thinking. Anthropic has extended thinking in Claude. Every major provider now offers a model that "thinks before it answers" by generating an internal chain of reasoning before producing a response.
The results are impressive. On complex math, multi-step logic, and code generation with constraints, reasoning models blow standard models out of the water. They solve problems that stump their non-reasoning counterparts.
They're also 5 to 20 times more expensive per request. And 3 to 10 times slower.
Most teams using reasoning models today are burning money on tasks that don't need them.
How reasoning models work
Standard models generate tokens left to right. Each token is predicted based on everything before it. Fast. Cheap. Great for most tasks.
Reasoning models add a step. Before generating the visible response, they produce a hidden chain of thought. They break the problem into sub-problems. They evaluate multiple approaches. They check their own work. Then they generate the final answer based on that internal deliberation.
This is why they're better at hard problems. They're literally doing more work per request. And that extra work costs real money in compute time and token usage.
When reasoning models earn their cost
Complex code generation. "Write a function that parses this nested XML format, handles three different encoding schemes, validates against this schema, and returns structured errors for each type of failure." Tasks with multiple interacting constraints benefit enormously from step-by-step reasoning. The model plans the approach before writing code. Standard models try to do it all at once and miss edge cases.
Multi-step analysis. "Given this financial data, identify the three accounts with the highest anomaly scores, explain the methodology, and flag potential regulatory issues." Anything that requires holding multiple analytical threads and synthesizing them into a coherent answer.
Mathematical and scientific problems. Proofs, derivations, quantitative modeling. Tasks where the answer depends on a chain of logical steps where each step must be correct for the final answer to be valid. Reasoning models verify their intermediate steps. Standard models guess and hope.
Strategic planning with constraints. "Design a database migration plan for this schema change that minimizes downtime, preserves backwards compatibility with the v2 API, and can be rolled back in under 5 minutes." Real-world engineering decisions with competing requirements.
When standard models win
Classification. "Is this email spam or not?" "What category does this support ticket belong to?" These are pattern-matching tasks. The model doesn't need to reason through them. A standard model (or even a fine-tuned small model) handles classification faster, cheaper, and just as accurately.
Extraction. "Pull the company name, invoice number, and total from this document." Structured data extraction from unstructured text is mechanical work. Reasoning adds latency without improving results.
Summarization. "Summarize this meeting transcript into bullet points." Unless the transcript contains complex technical discussions that require understanding to summarize correctly, a standard model handles this fine.
Conversational interfaces. Chatbots, customer support, Q&A systems. Users expect fast responses. A 15-second reasoning delay kills the experience. Use a standard model with good retrieval and save reasoning for edge cases that get escalated.
Routing and intent detection. "What does this user want?" This is the first step in most AI pipelines. It needs to be fast and cheap because it runs on every single request. Reasoning models here are like using a sledgehammer to hang a picture frame.
The architecture that works
The best production systems don't pick one model tier and use it everywhere. They route requests to the right model based on the task.
Layer 1: A small, fast model for routing. Classifies the incoming request. Determines complexity. Routes simple tasks to a standard model and complex tasks to a reasoning model. This layer runs on every request, so speed and cost matter most.
Layer 2: A standard model for 80% of tasks. The workhorse. Handles classification, extraction, summarization, and conversational responses. Fast. Affordable. Good enough for the majority of work.
Layer 3: A reasoning model for the hard 20%. Complex analysis, multi-step generation, tasks where accuracy on intricate problems justifies the cost and latency. The reasoning model only runs when the router sends it work.
This architecture gives you the best answers on hard problems and the best economics on easy ones. One team we worked with reduced their AI spend by 60% by adding a routing layer that sent only genuinely complex requests to their reasoning model. Their accuracy on hard tasks stayed the same. Their cost on everything else dropped dramatically.
The decision checklist
Before routing any task to a reasoning model, ask these questions:
- Does this task require multiple logical steps? If the answer is a single classification or extraction, use a standard model.
- Does accuracy on this specific task justify 5 to 20x higher cost? If a wrong answer means a bad user experience, maybe. If a wrong answer means a regulatory violation, definitely.
- Can users tolerate the latency? Reasoning models take seconds, not milliseconds. If this is a real-time interface, that delay matters.
- Is this a recurring high-volume task? If you're running this thousands of times a day, the cost difference between reasoning and standard compounds fast.
The models keep getting better. Reasoning will get faster and cheaper over time. But right now, the gap between reasoning and standard models is wide enough that using the wrong tier for a workload has real financial consequences.
Pick the right tool for the job. Your most complex 20% of tasks deserve reasoning. The other 80% don't need it and shouldn't pay for it.
Talvez goste de
88% of AI Agents Never Reach Production. The Model Isn't the Problem.
Everyone's building AI agents. Almost no one is shipping them. The bottleneck isn't model quality. It's the unglamorous engineering that makes agents survive contact with real users.
Your AI Project Will Fail Without These 3 Things
87% of AI projects never reach production. The bottleneck is almost never the model. Here's what separates teams that ship from teams that stall.
GPT-5 Is the Biggest Model Jump in History. Your AI Strategy Still Needs the Same 3 Things.
Independent researchers call GPT-5 the largest single-version capability leap ever. Better models make good systems better. They don't rescue bad ones.