The Number Everyone Already Knew
Carnegie Mellon published a study in May 2026. AI agents were given real-world software engineering tasks , debugging, writing tests, implementing features from specifications. Tasks that developers do every day.
The agents got them right 30% of the time.
The r/programming thread hit 11,902 upvotes. Not because engineers were surprised. Because someone had finally put a rigorous number on what they had been observing for months. The pile of failed internal AI projects now had a citation.
Why the 70% Failure Rate Is the Wrong Number to Focus On
The 30% success rate on individual tasks tells you something. The bigger problem is what happens when you chain tasks together , which is how real enterprise deployments actually work.
If an agent succeeds on a single step 30% of the time, it succeeds on two consecutive steps 9% of the time. Three steps: 2.7%. A five-step workflow: 0.24%.
Most enterprise AI deployments are not single-step systems. They are multi-step processes where the agent's output at step three depends on what it got right at steps one and two. The compounding failure rate is why so many projects that looked promising in pilot phase collapsed in production.
What the Successful Deployments Did Differently
The deployments that survived had one thing in common: they did not try to remove humans from the loop. They redesigned the loop.
Human checkpoints at each stage boundary. Instead of a five-step automated workflow, a three-step workflow with human review between steps. The agent handles the step. A human validates the output before the next step begins. The compounding failure problem disappears when a human catches the error before it propagates.
Narrow scope with clean inputs. The agents that work reliably are the ones with well-defined, narrow tasks and clean, structured inputs. The agents that fail are the ones given broad mandates and messy real-world data. The scope is not a limitation of the technology , it is the design constraint that makes the technology usable.
Error detection before error correction. Successful teams instrumented their AI systems to surface uncertainty before it became a mistake. Confidence scoring on outputs. Flagging of inputs that fall outside the training distribution. The ability to say "I do not know" before acting , rather than acting confidently on an answer that is wrong.
The Pattern Across Rollbacks
The projects that got rolled back shared a different pattern. They were sold on demo performance. They were deployed with broad scope. The error rate in production was higher than in testing because production data is messier than demo data. The failure mode was not obvious until something important broke.
The 70% figure is not a ceiling on what AI agents can do. It is a baseline measurement taken at a specific capability level, on a specific benchmark, with no human oversight. Real-world deployment is a design challenge on top of a capability constraint.
The teams that treated it as a design challenge mostly succeeded. The teams that treated it as a capability problem , and waited for the technology to improve enough to fix it , mostly did not.