Opus 4.7 is slower on my codebase. Here's where the benchmarks lie.

Anthropic's SWE-bench Pro jump is real. But adaptive thinking adds latency that benchmarks absorb and surgical codebases do not.

2026-04-18 · 5 min read

SWE-bench Pro jumped from 53.4% on Claude Opus 4.6 to 64.3% on 4.7. That is a 10.9-point gain in a single version bump, and you should take it at face value. SWE-bench Pro is not an easy benchmark: it uses private test cases on real repositories, not curated toy tasks, and a double-digit improvement in one release cycle is a genuinely strong result. Anthropic also held the price flat at $5 per million input tokens and $25 per million output, which removes the easy out of blaming the gain on a more expensive compute tier. On paper, this is an unambiguous upgrade. I have been running 4.7 daily since it launched on April 16, two days before I am writing this, and on my actual codebase I have been slower for two days straight. Here is the mechanism, not the complaint.

Where benchmarks and real work diverge

SWE-bench Pro tests a specific shape of task: a self-contained bug-fix on a known repository, with a test harness that passes or fails based on the diff. The model reads the issue description, locates the relevant code, produces a patch, and the suite scores whether the tests turn green. That is a genuine capability test, and it is one that adaptive thinking, Opus 4.7's core differentiator, is well-suited for. Hard reasoning tasks, where the right path through the problem is not obvious at the start, genuinely benefit from a model that spends more tokens thinking before committing to an output.

Real day-to-day coding assistance is mostly not that shape. It is a migration where three related files have to change together and stay consistent with each other. It is a refactor where the right answer is not "here is the patch" but "this approach is the wrong direction, do this other thing instead." It is a debugging session where the hard part is locating the problem in a 40-file call chain, not fixing the two lines that are wrong once you find them. On these tasks, adaptive thinking still spends tokens before arriving at the edit. If the edit is surgical and the answer is something you already know well enough to direct, the extra reasoning is not recovered in output quality. You get the same diff, after more wall-clock.

The benchmark absorbs this latency gracefully because it is running unattended. When you are sitting at a terminal in Claude Code waiting for a multi-file refactor to land, you feel the pause. Benchmarks measure what the model can do. They do not measure the feedback loop you are working inside.

There is a second gap worth naming. SWE-bench runs isolated tasks on curated repositories. The repositories are well-structured, the issues are scoped, the test harnesses are clean. A large mid-sized production codebase, especially one in the middle of a major-version migration, has none of those properties. Files have layered context from six months of decisions that are not documented anywhere in the file. The right answer to "how should I change this" often depends on three other files that are not in the immediate diff. Adaptive thinking does not automatically surface that cross-file context. It just thinks longer on the input it was given.

What the 1M context actually does

The headline number is real. One million token context means you can load a whole mid-sized codebase as a single context window without reaching for retrieval tricks, chunking, or file-selection heuristics. That is a genuine capability upgrade for a specific class of task: architecture review, codebase-level reasoning, spec writing against an existing system. When the question is "how does this system currently work, and what would break if I changed this contract," more context is the answer, and 4.7 delivers it.

The daily edit-apply loop is a different case. Most file edits in an active codebase do not benefit from loading 200k tokens of context on every turn. The relevant context is local: the file being edited, its immediate imports, the test that covers it. Loading the entire codebase for a three-line config change does not improve the output. It adds context-processing time to every turn, which compounds the latency that adaptive thinking already introduces. The 1M context window is a ceiling, not a default, and if your tooling or prompting loads it fully on every request, the latency tradeoff gets worse with codebase size. This is not a criticism of the feature. It is a description of where the feature applies and where it costs more than it pays.

What I observed on the portfolio rebuild

I stayed on 4.7 through the content-drafting phases of this rebuild, and the tradeoff flipped exactly where you would expect. Writing tasks, where the reasoning is the work, where the model has to hold the voice constraints, the structural requirements, and the source material together at once, showed real improvement. The outputs needed less revision. The structure held without prompting. That is where adaptive thinking earns its overhead.

On the code migration work, moving the site from Next 14 to 16 across four or five stacked convention changes, I reached for earlier tooling more than once. The reason was direct: surgical edits on an established codebase, where I already knew what needed to change and the work was execution rather than reasoning, felt faster at the prior capability level. Not catastrophically slower on 4.7, but noticeably so. Enough to matter across a long session of multi-file changes. No instrumentation was run on this, and I will not invent measurements. This is qualitative, and I am stating it as such.

The signal I kept coming back to: the output quality on those surgical edits was not meaningfully different. The diff was the same. The latency was not. That specific gap is the tell for a model spending tokens on reasoning that the task did not require.

A narrow recommendation

Use 4.7 on fresh greenfield work where the design space is open. Use it for architecture review, spec writing against a complex existing system, debugging sessions where you genuinely do not know where the problem lives. Use it for content drafting and any task where the reasoning is the bottleneck rather than the execution. The context window upgrade, used deliberately, is a real tool for those cases.

Stay on 4.6, or treat 4.7 carefully, when the work is surgical: established codebases, multi-file refactors where you already know the answer and need it applied cleanly, pair-programming-style rapid iteration where the feedback loop is the point. The pricing is identical across versions, so this is not a cost decision. It is a task-shape decision. The same model serves both cases differently, and the benchmark does not tell you which case you are in.

This is not a rant about the model. 4.7 is the right tool for a range of work that matters. The recommendation is specific: match the capability to the task shape rather than defaulting to the latest version everywhere.

Closing

Benchmarks measure what they can measure. The parts of your job that do not fit the benchmark harness, 4.7 is quietly handing back to you to figure out which tool fits the task.

Written by

Faiz Kasman

Software engineer in Kuala Lumpur. Payments, multi-tenant SaaS, and inventory infrastructure.

GitHub LinkedIn About

Keep reading