Every Builder Right Now Has This Feeling
Every AI builder right now has the same feeling. There are giant armies of robots doing incredible things around you. You are in the middle, confused about what you should be building, whether you should have fifteen agents running constantly, whether you are already behind.
Ara Khan, who builds AI systems at Clinician, named this directly at a conference: it is a mass psychosis. The feeling is not a reflection of reality. It is a reflection of the pace of announcements, which is not the same as the pace of useful production deployment.
The framework she uses to cut through it: four levels of AI agent maturity. Not complexity, maturity. The question is not how sophisticated your system is. It is how much your system actually solves a real problem reliably.
The Four Levels
Level 1: Reactive. The agent responds to direct prompts. It answers questions, summarises documents, drafts emails when asked. Useful, but entirely dependent on human initiation. No memory, no continuity, no improvement over time.
Level 2: Tool-using. The agent can call external tools, search, run code, access databases, send messages. This is where most production agents live today. The agent can do things in the world, not just generate text. Still requires significant human direction for anything complex.
Level 3: Self-directing. The agent can break down a complex goal into steps, pursue them in sequence, handle errors, and adjust its approach based on what it finds. This is where the interesting failures happen, and where most teams discover the limits of their system design.
Level 4: Self-improving. The agent can identify gaps in its own capabilities, author new tools, and update its own behavior based on what it learns. This exists in production at a small number of companies. The $2M ARR startup example above is one of them.
The Rule That Took Ara's Team Years to Learn
The hardest thing Ara's team learned: every single thing you add to an agent risks making it worse.
Large system prompts, extensive edge case handling, complex conditional logic, all of it, for frontier models, tends to degrade performance. The model gets so many instructions that it cannot figure out the right thing to do. Sensory overload, applied to AI.
The evidence is in the model providers' own codebases. The system prompt for GPT-5.3 is one-third the size of the prompt for GPT-5. The newer, more capable model needed less instruction, not more. The team rewrote their entire client to remove accumulated junk when they realized this.
The principle: just get out of the way of the model. If you are adding instructions to handle an edge case, ask whether the model would actually handle that edge case better without your intervention. The answer is often yes.
One Question to Ask Before Adding Anything
If every addition to an agent risks making it worse, then the builder's job is not to make agents more capable. It is to make them more focused and more minimal while still solving the problem.
The question to ask before adding anything: does this addition give the model information it cannot get itself, or is it attempting to constrain a behavior that the model would handle better without the constraint?
Most additions fail this test.
Most systems are over-engineered relative to what frontier models actually need to perform well.
The agents that work best in production are often the simplest ones, not the most sophisticated.
That is both a relief and a discipline.