A job posting quietly circulating inside enterprise hiring teams captures the industry's identity crisis perfectly. It calls for a "prompt engineer" — then lists requirements that include distributed systems architecture, API design, MLOps pipeline ownership, security engineering, and product management. IBM Technology's recent deep-dive on agent engineering opened with exactly this contradiction, and the punchline lands hard: that's not one job with a vague title. That's five separate engineering disciplines being collapsed into a phrase that sounds approachable. The result is that teams ship agents they don't fully understand, then wonder why production looks nothing like the demo.
The distinction IBM draws is sharper than most industry commentary bothers to make. Prompt engineering is a real skill — but it's a narrow one. Writing clear instructions, structuring context, steering model behavior with well-crafted language: these matter. The problem is that an agent isn't a prompt. An agent is a distributed software system that happens to use a language model as its decision engine. Calling yourself an agent engineer because you can write prompts is like calling yourself a chef because you can follow a recipe. Anyone can follow a recipe. The chef understands ingredients, heat physics, kitchen workflow, food safety protocols, and how to improvise when the third-best option is all that's left at 8 PM on a Saturday. The recipe is the starting point. Running the kitchen is a different job entirely.
IBM's framework doesn't try to mystify agent engineering. It does the opposite — it makes the full stack visible so builders can be honest about where they're actually weak. Most people entering the agent space come from one of two directions: they're machine learning practitioners who understand models but have never built resilient distributed systems, or they're software engineers who understand backend reliability but are still treating the LLM as a black box. Both groups are missing critical pieces. The seven skills below aren't a ladder to climb one at a time — they're a map of the whole terrain.
"The prompt engineer got us here. The agent engineer will take us forward."
IBM TechnologyThe skill that IBM identifies as the root cause of 90% of production failures is retrieval — Skill 3. It's also the one that gets the least attention in the tutorials, the blog posts, and the demo videos. It's unglamorous work: figuring out how to split documents, which embedding model captures the right semantic relationships for your domain, whether a second re-ranking pass is worth the latency cost. But the model's performance is bounded by what you hand it. A state-of-the-art frontier model given garbage context will produce garbage output with high confidence and beautiful prose. The failure mode is invisible until a user catches it — or until something worse happens.
| Skill | What It Is | Where People Fail |
|---|---|---|
| System Design | Architecting LLMs, tools, state stores, and sub-agents into coherent data flow | Treating agents as scripts instead of distributed systems; no failure mode planning |
| Tool & Contract Design | Writing precise, typed schemas that tell the model exactly how to call each tool | Vague schemas that let the model improvise inputs — especially dangerous for write operations |
| Retrieval Engineering | Designing RAG pipelines — chunking, embeddings, re-ranking — that surface relevant context | Naive chunking, wrong embedding model, no re-ranking; irrelevant docs poison every response |
| Reliability Engineering | Building retry logic, timeouts, fallback paths, and circuit breakers for external dependencies | No retry strategy; agents hang or hammer failing APIs; no graceful degradation |
| Security & Safety | Protecting against prompt injection, enforcing least-privilege, validating inputs and outputs | Overprivileged tools; no injection defense; trusting user-supplied text passed to tool calls |
| Evaluation & Observability | Tracing decisions, measuring success rates, running regression tests against known-good cases | Shipping on vibes; no way to diagnose regressions or attribute failures after the fact |
| Product Thinking | Designing UX for probabilistic systems — clarification flows, human escalation, trust mechanics | Optimizing the code while ignoring what the non-determinism means for the person using it |
IBM closes with two concrete actions that have an unusually high return on time invested. The first is deceptively simple: read your tool schemas out loud. Pretend you're a new engineer who has never seen this codebase. Would you know exactly what each tool does, what it requires, what format those inputs need to be in? If the answer is no — and for most agents in production, the answer is no — add strict types, required field flags, and concrete examples. IBM calls this the highest-leverage fix available to most agent builders right now, because tool contract quality directly determines whether the LLM uses tools correctly or improvises. And in a system where improvisation means a misformatted API call, a stale record getting overwritten, or a transaction processed with the wrong user ID, improvisation is not a feature.
The second action targets the debugging instinct that trips up nearly every builder moving from prompt engineering into agent work. When something breaks, the reflex is to rewrite the prompt. Add more instructions. Clarify the language. Try different phrasing. IBM's prescription is different: trace backward. Was the right document retrieved? Was the correct tool selected? Was the schema clear enough to guide the selection? Nine times out of ten, the root cause isn't the words in the system prompt — it's the system underneath it. Retrieval sent the wrong context. A vague schema left the model guessing. A missing retry left the agent spinning on a timeout. Fixing the prompt papers over the structural problem and creates a different one. The agent engineer's job is to find the actual failure point — and fix that instead.