Claude Opus 4.7 Just Dropped. Here's What Actually Changed.

Anthropic shipped Claude Opus 4.7 today. On paper, it's an incremental version bump. In practice, it's the biggest coding-focused upgrade since Opus 4.5 introduced extended thinking.

The headline numbers are strong. SWE-bench Pro jumps from 53.4% to 64.3%. SWE-bench Verified climbs from 80.8% to 87.6%. Rakuten's production task evaluation shows it resolving 3x more real-world software engineering problems than its predecessor. Those aren't cherry-picked academic benchmarks. They're the closest proxies the industry has for "can this model actually write code that ships."

But the numbers only tell part of the story. Three changes matter more than the benchmarks.

Vision went from a gimmick to a tool

Previous Claude models could look at images. Opus 4.7 can actually see them.

The resolution ceiling tripled. The model now processes images up to 2,576 pixels on the long edge, over three times what Opus 4.6 handled. Visual acuity jumped from 54.5% to 98.5% on Anthropic's internal benchmark. That's not an improvement. That's a capability that didn't exist before.

What this means in practice: you can hand it a screenshot of a complex UI and get back accurate analysis of layout, spacing, and component hierarchy. You can feed it architectural diagrams, chemical structures, or dense technical schematics and expect it to read them correctly. The gap between "I can describe this image" and "I can understand what this image means" just closed.

For teams building multimodal workflows, this is the release where vision stops being the weak link.

Instruction following got sharper, and that's a double-edged sword

Opus 4.7 interprets instructions more literally than any previous Claude model. Anthropic flags this directly in the release notes as something that "may require users to adjust existing prompts."

This is a meaningful change for anyone running production systems. If your prompts relied on the model inferring what you meant from vague instructions, 4.7 will do exactly what you said instead of what you intended. Tightly written prompts will perform better. Sloppy ones will perform worse. The model didn't get dumber at understanding context. It got more disciplined about following directions.

The new "xhigh" effort level sits between high and max, giving you finer control over the reasoning-speed tradeoff. Combined with the stricter instruction following, the message is clear: Anthropic wants you to be more precise about what you ask for and how hard the model should think about it.

The competitive picture shifted

Here's where the benchmarks get interesting when you put them side by side.

On SWE-bench Pro, Opus 4.7 at 64.3% leads GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. That's a comfortable margin on the hardest coding benchmark in the industry.

On GPQA Diamond (graduate-level reasoning), all three models are within 0.2% of each other. Opus 4.7 scores 94.2%. GPT-5.4 Pro scores 94.4%. Gemini 3.1 Pro hits 94.3%. Reasoning has plateaued at the frontier.

On MCP-Atlas (scaled tool use), Opus 4.7 leads at 77.3% versus GPT-5.4's 68.1%. For anyone building agentic systems, that gap matters. Tool use reliability is what separates a demo from a deployment.

The pattern: Opus 4.7 dominates on coding and tool use. It trades blows on reasoning. It loses to GPT-5.4 Pro on agentic search (79.3% vs 89.3%). Your model choice depends on what you're building.

What didn't change

Pricing stays at $5 per million input tokens and $25 per million output tokens. Same as Opus 4.6. Same context window at 1 million tokens. Prompt caching still cuts costs up to 90%. Batch processing still gives you 50%.

The tokenizer changed, though. Depending on your content, the same input can now produce 1.0x to 1.35x more tokens. That's a hidden price increase if your workload is token-heavy. Run your numbers before assuming the upgrade is free.

Availability is broad from day one. Claude Pro, Max, Team, and Enterprise. The API. Amazon Bedrock. Google Cloud Vertex AI. Microsoft Foundry. No waitlist, no preview period.

The Mythos shadow

Opus 4.7 ships with cyber safeguards that weren't present in previous versions. Anthropic explicitly states the model has "reduced cyber capabilities" compared to Claude Mythos Preview, the unreleased model that finds zero-day exploits in audited operating systems.

This is the first time a Claude release note has defined a model partly by what it can't do relative to an internal capability. The subtext is hard to miss: Anthropic is now shipping models with deliberate capability ceilings in domains where the frontier is too dangerous for general access.

New cyber verification programs let security professionals request elevated access. For everyone else, the guardrails are tighter than 4.6.

Who should upgrade today

If you're running Claude Code, upgrade now. The coding improvements are real and the pricing is identical. Replit reports matching Opus 4.6 quality at lower cost. Vercel calls it "phenomenal on one-shot coding tasks." CodeRabbit saw code review recall improve over 10% while maintaining precision.

If you're running production API workloads, test first. The stricter instruction following means your existing prompts may behave differently. The tokenizer change means your cost projections may be off. Allocate a day for prompt regression testing before switching traffic.

If you're building multimodal applications, this is the release that makes Claude's vision competitive. The jump from 54.5% to 98.5% on visual acuity isn't incremental. It's a new capability.

The model is live. The benchmarks look good. The real test is what it does in your codebase tomorrow morning.