The Benchmark Problem Nobody Wants to Name
SWE-Bench Pro was supposed to be the honest benchmark. Realistic coding problems. Real repositories. The answer to developers who were tired of benchmarks that tested things models had clearly memorised.
It is contaminated.
Models regularly cheat on SWE-Bench Pro. The information about how to solve the problems has leaked enough into training data that models encounter problems they have effectively already seen. The cheating is barely measured by the people who verify the results. The scores that get published are not reliable.
"The numbers on this bench have been nonsense for a while. And the fact that it's lasted this long, especially post-contamination, is frustrating."
The Results That Should Not Be Believed
Theo, a developer with 84,000 views on this specific topic, laid out the problem directly: the benchmark shows Qwen 3.7 Max and GLM 5.1 as meaningfully competitive with state-of-the-art models from OpenAI and Anthropic. That is not what real-world usage shows.
"I personally don't believe that Qwen 3.7 Max or GLM 5.1 are meaningfully better than state-of-the-art models from OpenAI. That's just obviously not true."
The same benchmark shows Gemini 3.5 Flash close to GPT-5.4 and 5.5. Also not supported by actual developer experience.
The gap between benchmark scores and real-world performance has been growing for months. Models that score well on contaminated benchmarks do not necessarily perform well on the tasks developers actually need done. The leaderboard has become disconnected from the thing it is supposed to measure.
Why This Matters for How You Choose Models
Benchmark scores are how most non-experts make model selection decisions. Enterprise procurement teams point to SWE-Bench. Product managers point to benchmark rankings. Investors use them to evaluate AI companies.
If the benchmark is contaminated and the scores are not reliable, then the decisions based on those scores are not reliable either. A company that selects a model based on a contaminated benchmark score and finds it underperforms in production is not experiencing a surprise , they were just working from bad data.
The practical response: for any coding task that actually matters, test the model yourself on a representative sample of your actual work. Benchmark scores are a starting filter, not a final answer. For SWE-Bench Pro specifically, they are currently not even a reliable starting filter.
The Exception That Proves the Rule
The Anthropic Mythos result on SWE-Bench Pro , 77.8% versus GPT-5.4's 57.7% , is being treated as an outlier because it is a 20-point gap in a space where most scores cluster. The magnitude is too large to be explained by contamination alone.
Which raises the uncomfortable question: if the benchmark is meaningless for smaller differences, is it still meaningful for differences this large? Developers watching this space are treating the Mythos result as potentially real while treating most other scores with increasing skepticism.
Better benchmarks are being built. None of them are ready to replace SWE-Bench Pro as the default reference point. In the meantime, the gap between what the leaderboard says and what the terminal shows is wider than most public conversations acknowledge.