Prompt Chaining for Complex Tasks
Most prompts fail because you're asking one model to do too much at once. Prompt chaining breaks complex tasks into focused, sequential steps — each one feeding the next — to reduce hallucinations, improve auditability, and produce reliably better output.
01 Why Chaining Works
Language models have finite attention and reasoning capacity per forward pass. When you ask a model to simultaneously extract facts, verify them, synthesise a narrative, and format the output as structured JSON — you're asking it to context-switch between four different cognitive modes at once. The result is usually a compromise that does none of them well.
Chaining solves this by giving the model one job at a time. Each prompt is focused, short, and specific. Errors are contained within a single step and can be caught before they propagate. Intermediate outputs become inspectable artifacts you can log, validate, and reuse.
What chaining gives you
- Auditability — log and inspect every intermediate output, not just the final answer
- Error isolation — a failure in step 2 doesn't corrupt step 4's output; you know exactly where it broke
- Per-step tuning — use
temperature=0for extraction; higher for writing; use cheap model for classification, expensive for synthesis - Reusability — a fact-extraction step built for one pipeline can be dropped into another
- Controllability — inject human review or validation between any two steps
02 When to Chain vs Single Prompt
Chaining adds overhead — multiple API round-trips, more total tokens, more code. Don't reach for it automatically.
| Signal | Single Prompt | Chain |
|---|---|---|
| Task has 1–2 clear steps | ✓ Prefer | Overkill |
| Output format is simple (prose, short answer) | ✓ Prefer | Overkill |
| Latency budget is very tight (<500ms) | ✓ Prefer | Multiple round-trips |
| Task requires 3+ distinct cognitive modes | Struggles | ✓ Prefer |
| You need to audit intermediate results | Not possible | ✓ Prefer |
| Different steps suit different model tiers | All-or-nothing | ✓ Prefer |
| Output fails quality checks inconsistently | Hard to debug | ✓ Prefer |
| Human review needed mid-workflow | Not possible | ✓ Prefer |
03 Sequential Chains
The foundational pattern: each step produces an output that becomes the input for the next. Steps run in series — step N cannot start until step N-1 is complete.
from openai import OpenAI
import json
client = OpenAI()
def llm(system: str, user: str, temperature: float = 0.0) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=temperature,
)
return resp.choices[0].message.content.strip()
def run_chain(source_text: str) -> str:
# Step 1 — Extract
raw_claims = llm(
system=(
"You are a research assistant. Extract every factual claim from the text. "
"Return ONLY a JSON array of strings. No commentary."
),
user=source_text,
)
claims = json.loads(raw_claims)
# Step 2 — Score
scored_raw = llm(
system=(
"You are a fact-checker. For each claim, rate confidence 0–10. "
"Return JSON: [{\"claim\": \"...\", \"confidence\": 8}]"
),
user=json.dumps(claims),
)
scored = json.loads(scored_raw)
# Step 3 — Filter (pure Python — no LLM needed)
strong = [c for c in scored if c["confidence"] >= 7]
if not strong:
return "Insufficient high-confidence claims found."
# Step 4 — Synthesise
brief = llm(
system=(
"You are a senior analyst. Using ONLY the provided claims, "
"write a concise 3-paragraph executive brief. Cite claims as [n]."
),
user=json.dumps(strong),
temperature=0.6,
)
return brief
temperature=0 for extraction, verification, classification, and JSON formatting — you want determinism. Use temperature=0.5–0.8 only for creative or synthesis steps where variety improves quality.04 The JSON Contract Pattern
The single biggest source of chain failures is unpredictable output format. If step 2 produces prose when step 3 expects a JSON array, the chain breaks. The JSON Contract pattern solves this by defining a strict schema for every inter-step handoff before writing any prompts.
Define the schema first
from pydantic import BaseModel
# Step 1 output contract
class ExtractedClaim(BaseModel):
text: str
source_quote: str
class ExtractionResult(BaseModel):
claims: list[ExtractedClaim]
# Step 2 output contract
class ScoredClaim(BaseModel):
text: str
source_quote: str
confidence: int # 0–10
needs_source: bool
note: str = ""
class ScoringResult(BaseModel):
claims: list[ScoredClaim]
# Step 3 is pure Python — no LLM call
def filter_claims(result: ScoringResult, min_score: int = 7) -> list[ScoredClaim]:
return [c for c in result.claims if c.confidence >= min_score]
Enforce the schema in the prompt
You are a research assistant. Extract every factual claim from the text below.
Return ONLY valid JSON matching this exact schema. No text before or after.
Schema:
{
"claims": [
{
"text": "The factual claim in one sentence",
"source_quote": "Exact phrase from source this claim comes from"
}
]
}
If no factual claims are present, return: {"claims": []}
TEXT TO PROCESS:
{source_text}
Parse and validate at every step boundary
import json
from pydantic import BaseModel, ValidationError
from typing import TypeVar, Type
T = TypeVar("T", bound=BaseModel)
def parse_step_output(raw: str, schema: Type[T], step_name: str) -> T:
# Strip markdown code fences if model wrapped output
cleaned = raw.strip()
if cleaned.startswith("```"):
cleaned = "\n".join(cleaned.split("\n")[1:-1])
try:
data = json.loads(cleaned)
except json.JSONDecodeError as e:
raise ValueError(f"[{step_name}] JSON parse error: {e}\nRaw: {raw[:200]}")
try:
return schema.model_validate(data)
except ValidationError as e:
raise ValueError(f"[{step_name}] Schema validation failed: {e}")
response_format={"type": "json_schema", ...} guarantees JSON conformance at the API level, eliminating the parse step for supported models. Anthropic offers equivalent guarantees via tool_use with a JSON schema.05 Parallel Chains
When a task has multiple independent sub-tasks that don't depend on each other's outputs, run them simultaneously. Total wall-clock time drops from the sum of all steps to the duration of the longest single step.
import asyncio, json
from openai import AsyncOpenAI
aclient = AsyncOpenAI()
async def llm_async(system: str, user: str, temperature: float = 0.0) -> str:
resp = await aclient.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=temperature,
)
return resp.choices[0].message.content.strip()
async def parallel_extract(document: str) -> dict:
# All three branches launch simultaneously
facts_raw, sentiment_raw, actions_raw = await asyncio.gather(
llm_async(
'Extract every factual claim. Return JSON: {"claims": ["..."]}',
document,
),
llm_async(
'Analyse tone. Return JSON: {"sentiment": "positive|negative|neutral", '
'"confidence": 0.9, "signals": ["..."]}',
document,
),
llm_async(
'List action items. Return JSON: {"actions": [{"item": "...", "owner": "..."}]}',
document,
),
)
return {
"facts": json.loads(facts_raw),
"sentiment": json.loads(sentiment_raw),
"actions": json.loads(actions_raw),
}
async def parallel_chain(document: str) -> str:
extracted = await parallel_extract(document) # Parallel
synthesis = await llm_async( # Sequential — needs all results
system="You are a senior analyst. Synthesise the data into a concise briefing.",
user=str(extracted),
temperature=0.6,
)
return synthesis
result = asyncio.run(parallel_chain(document_text))
asyncio.gather() is one of the highest-leverage optimisations in any multi-step pipeline.06 Conditional Branching
Not every input needs the same processing path. A conditional chain inspects the output of an early step and routes to different downstream steps based on what it finds.
import json
ROUTER_PROMPT = """Classify the user input into one of these categories.
Return ONLY JSON: {{"type": "question" | "task" | "feedback" | "unknown"}}
User input: {message}"""
async def conditional_chain(user_message: str) -> str:
# Step 1 — Classify
raw = await llm_async(
system="You are a router. Classify inputs accurately.",
user=ROUTER_PROMPT.format(message=user_message),
)
input_type = json.loads(raw).get("type", "unknown")
# Step 2 — Branch
if input_type == "question":
return await handle_question(user_message)
elif input_type == "task":
return await handle_task(user_message)
elif input_type == "feedback":
return await handle_feedback(user_message)
else:
return await llm_async(
system="You are a helpful assistant.",
user=user_message,
temperature=0.7,
)
async def handle_task(message: str) -> str:
# Decompose first, then execute
subtasks_raw = await llm_async(
system='Break this task into numbered subtasks. Return JSON: {"subtasks": ["step 1", "step 2"]}',
user=message,
)
subtasks = json.loads(subtasks_raw)["subtasks"]
return "Action plan:\n" + "\n".join(f"{i+1}. {t}" for i, t in enumerate(subtasks))
Routing table pattern
For complex multi-level routing, keep the decision tree in Python — not buried in a single prompt. Use a lookup table that maps (type, sub-type) to handlers:
ROUTING_TABLE = {
("question", "factual"): handle_factual_question,
("question", "opinion"): handle_opinion_question,
("task", "simple"): handle_simple_task,
("task", "complex"): handle_complex_task,
}
async def route(message: str) -> str:
top_type = await classify_type(message) # "question" | "task"
sub_type = await classify_complexity(message) # "factual" | "simple" | etc.
handler = ROUTING_TABLE.get((top_type, sub_type))
return await handler(message) if handler else await fallback(message)
07 Error Handling Between Steps
Chains break at step boundaries. Without explicit error handling, a bad output in step 2 silently corrupts everything downstream.
The three failure types
Format failures
Model returned prose instead of JSON, or JSON with the wrong schema. Catch at parse time. Retry the step with a corrective prompt appended to the user turn.
Quality failures
Valid JSON but wrong content (empty results, hallucinated data, low confidence). Catch at validation time. Check business rules — not just schema conformance.
API / network failures
Rate limits, timeouts, provider outages. Catch at network time. Retry with exponential backoff; fall back to an alternate model if retries are exhausted.
import asyncio
from pydantic import BaseModel
from typing import Type, TypeVar
T = TypeVar("T", bound=BaseModel)
async def llm_with_retry(
system: str,
user: str,
schema: Type[T],
step_name: str,
max_retries: int = 3,
) -> T:
last_error = None
current_user = user
for attempt in range(max_retries):
try:
raw = await llm_async(system=system, user=current_user)
return parse_step_output(raw, schema, step_name)
except ValueError as e:
last_error = str(e)
# Append corrective context for the next attempt
current_user = (
f"{user}\n\n"
f"PREVIOUS ATTEMPT FAILED:\n{last_error}\n\n"
"Return ONLY valid JSON matching the required schema. No other text."
)
if attempt < max_retries - 1:
await asyncio.sleep(0.5 * (attempt + 1))
except Exception as e:
last_error = str(e)
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt) # Exponential back-off for API errors
raise RuntimeError(
f"[{step_name}] Failed after {max_retries} attempts. Last: {last_error}"
)
Graceful degradation
async def run_chain_safely(source_text: str) -> dict:
result = {"status": "success", "output": None, "warnings": []}
try:
claims = await extract_claims(source_text)
except RuntimeError as e:
result["warnings"].append(f"Extraction failed: {e}")
claims = [] # Continue with empty list rather than crashing
try:
scored = await score_claims(claims) if claims else []
except RuntimeError as e:
result["warnings"].append(f"Scoring failed: {e}")
scored = claims # Fall back to unscored claims
try:
result["output"] = await synthesise(scored)
except RuntimeError as e:
result["status"] = "partial"
result["output"] = "Synthesis unavailable."
result["raw_claims"] = [c["text"] for c in (scored or [])]
return result
08 State Management
As chains grow longer, a chain state object carries everything — original input, each step's output, timings, errors — through the pipeline as a single structured artifact. This makes debugging, logging, and resuming failed runs straightforward.
from dataclasses import dataclass, field
import time, uuid
@dataclass
class ChainState:
run_id: str = field(default_factory=lambda: str(uuid.uuid4()))
source_text: str = ""
started_at: float = field(default_factory=time.time)
# Step outputs — filled as chain progresses
claims: list[dict] = field(default_factory=list)
scored: list[dict] = field(default_factory=list)
filtered: list[dict] = field(default_factory=list)
output: str = ""
# Execution metadata
step_timings: dict[str, float] = field(default_factory=dict)
errors: list[str] = field(default_factory=list)
completed: bool = False
def record_step(self, step_name: str, duration_ms: float):
self.step_timings[step_name] = duration_ms
def to_log_record(self) -> dict:
return {
"run_id": self.run_id,
"duration": int((time.time() - self.started_at) * 1000),
"steps": self.step_timings,
"n_claims": len(self.claims),
"errors": self.errors,
"completed": self.completed,
}
# Each step mutates state in place
async def step_extract(state: ChainState) -> ChainState:
t0 = time.time()
raw = await llm_async(EXTRACT_SYSTEM, state.source_text)
state.claims = json.loads(raw)["claims"]
state.record_step("extract", (time.time() - t0) * 1000)
return state
async def run_pipeline(source_text: str) -> ChainState:
state = ChainState(source_text=source_text)
state = await step_extract(state)
state = await step_score(state)
state.filtered = [c for c in state.scored if c["confidence"] >= 7]
state.completed = True
return state
09 Real Pipeline: Research Material → Article
A complete, production-style pipeline that turns raw research (transcripts, notes, web content) into a structured article draft — the kind of pipeline that powers the content behind Aether Intel.
Step 1 — Extract claims
You are a research analyst. Read the source material and extract every distinct
factual claim, statistic, opinion, or insight.
Return ONLY JSON:
{
"claims": [
{
"text": "The claim in one clear sentence",
"type": "fact" | "statistic" | "opinion" | "insight",
"source_quote": "Verbatim phrase from source",
"importance": 1-5
}
]
}
Source material:
{source_text}
Step 2 — Score and deduplicate (parallel)
You are an editor. Review this list of claims and remove duplicates.
Two claims are duplicates if they express the same idea, even with different words.
Keep the version with the clearest wording.
Return JSON: {"claims": [...same schema, duplicates removed...]}
Claims:
{claims_json}
Step 3 — Outline
You are a content strategist. Using the claims below, create an article outline
with 4–6 sections. Each section should have a clear angle supported by 2–4 claims.
Return JSON:
{
"title": "Article headline (compelling, specific, under 70 chars)",
"sections": [
{
"heading": "Section heading",
"angle": "What this section argues or explains",
"claims": [0, 3, 7]
}
]
}
Claims:
{filtered_claims_json}
Step 4 — Draft
You are a technology journalist writing for an informed audience.
Write the full article following the outline. Use ONLY the provided claims as your
factual foundation — do not add information not in the claims.
Style: clear, direct, no filler. Active voice.
Avoid: "In today's rapidly evolving landscape..." and similar empty openers.
Add inline citations as (claim_id) when using a specific claim.
Outline: {outline_json}
Claims: {claims_json}
Step 5 — Polish
You are a copy editor. Improve the article below for:
1. Clarity — remove redundant words, tighten sentences
2. Flow — ensure paragraphs connect naturally
3. Consistency — unified voice throughout
4. Headlines — make title and headings compelling
Do NOT change factual content, add new information, or alter citations.
Return the improved article as plain text.
Article:
{draft}
10 Common Failure Modes
Take the Chaining Templates
Download the complete prompt chaining reference — sequential template, JSON contract patterns, parallel chain, conditional branching, error handling, and the full research pipeline — all in one Markdown file.