Prompt Chaining for Complex Tasks
Prompts Intermediate ~30 min to learn

Prompt Chaining for Complex Tasks

Most prompts fail because you're asking one model to do too much at once. Prompt chaining breaks complex tasks into focused, sequential steps — each one feeding the next — to reduce hallucinations, improve auditability, and produce reliably better output.

Aether Intel Team May 2025 16 min read

01 Why Chaining Works

Language models have finite attention and reasoning capacity per forward pass. When you ask a model to simultaneously extract facts, verify them, synthesise a narrative, and format the output as structured JSON — you're asking it to context-switch between four different cognitive modes at once. The result is usually a compromise that does none of them well.

Chaining solves this by giving the model one job at a time. Each prompt is focused, short, and specific. Errors are contained within a single step and can be caught before they propagate. Intermediate outputs become inspectable artifacts you can log, validate, and reuse.

🧠
The cognitive load principle: A model performing extraction is in a different "mode" than one doing analysis or synthesis. Keeping steps atomic lets you tune each step independently — different temperature, different role, different model tier — without those decisions interfering with each other.

What chaining gives you

  • Auditability — log and inspect every intermediate output, not just the final answer
  • Error isolation — a failure in step 2 doesn't corrupt step 4's output; you know exactly where it broke
  • Per-step tuning — use temperature=0 for extraction; higher for writing; use cheap model for classification, expensive for synthesis
  • Reusability — a fact-extraction step built for one pipeline can be dropped into another
  • Controllability — inject human review or validation between any two steps

02 When to Chain vs Single Prompt

Chaining adds overhead — multiple API round-trips, more total tokens, more code. Don't reach for it automatically.

SignalSingle PromptChain
Task has 1–2 clear steps✓ PreferOverkill
Output format is simple (prose, short answer)✓ PreferOverkill
Latency budget is very tight (<500ms)✓ PreferMultiple round-trips
Task requires 3+ distinct cognitive modesStruggles✓ Prefer
You need to audit intermediate resultsNot possible✓ Prefer
Different steps suit different model tiersAll-or-nothing✓ Prefer
Output fails quality checks inconsistentlyHard to debug✓ Prefer
Human review needed mid-workflowNot possible✓ Prefer
⚠️
Don't chain for the sake of it. If a single well-structured prompt reliably produces the output you need, use it. Add chaining only when single-prompt approaches are demonstrably failing or when auditability / cost-routing is a real requirement.

03 Sequential Chains

The foundational pattern: each step produces an output that becomes the input for the next. Steps run in series — step N cannot start until step N-1 is complete.

Step 1
Extract
→ JSON list
Step 2
Verify
→ scored JSON
Step 3
Filter
→ clean list
Step 4
Synthesise
→ final prose
Python — sequential chain
from openai import OpenAI
import json

client = OpenAI()

def llm(system: str, user: str, temperature: float = 0.0) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": user},
        ],
        temperature=temperature,
    )
    return resp.choices[0].message.content.strip()


def run_chain(source_text: str) -> str:
    # Step 1 — Extract
    raw_claims = llm(
        system=(
            "You are a research assistant. Extract every factual claim from the text. "
            "Return ONLY a JSON array of strings. No commentary."
        ),
        user=source_text,
    )
    claims = json.loads(raw_claims)

    # Step 2 — Score
    scored_raw = llm(
        system=(
            "You are a fact-checker. For each claim, rate confidence 0–10. "
            "Return JSON: [{\"claim\": \"...\", \"confidence\": 8}]"
        ),
        user=json.dumps(claims),
    )
    scored = json.loads(scored_raw)

    # Step 3 — Filter (pure Python — no LLM needed)
    strong = [c for c in scored if c["confidence"] >= 7]
    if not strong:
        return "Insufficient high-confidence claims found."

    # Step 4 — Synthesise
    brief = llm(
        system=(
            "You are a senior analyst. Using ONLY the provided claims, "
            "write a concise 3-paragraph executive brief. Cite claims as [n]."
        ),
        user=json.dumps(strong),
        temperature=0.6,
    )
    return brief
🌡️
Temperature by step role: Use temperature=0 for extraction, verification, classification, and JSON formatting — you want determinism. Use temperature=0.5–0.8 only for creative or synthesis steps where variety improves quality.

04 The JSON Contract Pattern

The single biggest source of chain failures is unpredictable output format. If step 2 produces prose when step 3 expects a JSON array, the chain breaks. The JSON Contract pattern solves this by defining a strict schema for every inter-step handoff before writing any prompts.

Define the schema first

Python — Pydantic schemas
from pydantic import BaseModel

# Step 1 output contract
class ExtractedClaim(BaseModel):
    text:         str
    source_quote: str

class ExtractionResult(BaseModel):
    claims: list[ExtractedClaim]

# Step 2 output contract
class ScoredClaim(BaseModel):
    text:         str
    source_quote: str
    confidence:   int   # 0–10
    needs_source: bool
    note:         str = ""

class ScoringResult(BaseModel):
    claims: list[ScoredClaim]

# Step 3 is pure Python — no LLM call
def filter_claims(result: ScoringResult, min_score: int = 7) -> list[ScoredClaim]:
    return [c for c in result.claims if c.confidence >= min_score]

Enforce the schema in the prompt

Prompt Template
You are a research assistant. Extract every factual claim from the text below.

Return ONLY valid JSON matching this exact schema. No text before or after.

Schema:
{
  "claims": [
    {
      "text":         "The factual claim in one sentence",
      "source_quote": "Exact phrase from source this claim comes from"
    }
  ]
}

If no factual claims are present, return: {"claims": []}

TEXT TO PROCESS:
{source_text}

Parse and validate at every step boundary

Python — parse-and-validate wrapper
import json
from pydantic import BaseModel, ValidationError
from typing import TypeVar, Type

T = TypeVar("T", bound=BaseModel)

def parse_step_output(raw: str, schema: Type[T], step_name: str) -> T:
    # Strip markdown code fences if model wrapped output
    cleaned = raw.strip()
    if cleaned.startswith("```"):
        cleaned = "\n".join(cleaned.split("\n")[1:-1])

    try:
        data = json.loads(cleaned)
    except json.JSONDecodeError as e:
        raise ValueError(f"[{step_name}] JSON parse error: {e}\nRaw: {raw[:200]}")

    try:
        return schema.model_validate(data)
    except ValidationError as e:
        raise ValueError(f"[{step_name}] Schema validation failed: {e}")
💡
OpenAI Structured Outputs — using response_format={"type": "json_schema", ...} guarantees JSON conformance at the API level, eliminating the parse step for supported models. Anthropic offers equivalent guarantees via tool_use with a JSON schema.

05 Parallel Chains

When a task has multiple independent sub-tasks that don't depend on each other's outputs, run them simultaneously. Total wall-clock time drops from the sum of all steps to the duration of the longest single step.

Input
Branch A — Extract facts
factual claims as JSON
Branch B — Extract sentiment
tone + emotional signals
Branch C — Extract action items
next steps as structured list
Merge & Synthesise
Python — asyncio parallel branches
import asyncio, json
from openai import AsyncOpenAI

aclient = AsyncOpenAI()

async def llm_async(system: str, user: str, temperature: float = 0.0) -> str:
    resp = await aclient.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": user},
        ],
        temperature=temperature,
    )
    return resp.choices[0].message.content.strip()


async def parallel_extract(document: str) -> dict:
    # All three branches launch simultaneously
    facts_raw, sentiment_raw, actions_raw = await asyncio.gather(
        llm_async(
            'Extract every factual claim. Return JSON: {"claims": ["..."]}',
            document,
        ),
        llm_async(
            'Analyse tone. Return JSON: {"sentiment": "positive|negative|neutral", '
            '"confidence": 0.9, "signals": ["..."]}',
            document,
        ),
        llm_async(
            'List action items. Return JSON: {"actions": [{"item": "...", "owner": "..."}]}',
            document,
        ),
    )
    return {
        "facts":     json.loads(facts_raw),
        "sentiment": json.loads(sentiment_raw),
        "actions":   json.loads(actions_raw),
    }


async def parallel_chain(document: str) -> str:
    extracted = await parallel_extract(document)           # Parallel

    synthesis = await llm_async(                           # Sequential — needs all results
        system="You are a senior analyst. Synthesise the data into a concise briefing.",
        user=str(extracted),
        temperature=0.6,
    )
    return synthesis


result = asyncio.run(parallel_chain(document_text))
Parallel latency wins: Three sequential steps at 1.5s each = 4.5s. Three parallel steps = 1.5s + synthesis time. asyncio.gather() is one of the highest-leverage optimisations in any multi-step pipeline.

06 Conditional Branching

Not every input needs the same processing path. A conditional chain inspects the output of an early step and routes to different downstream steps based on what it finds.

Step 1: Classify Input Type
type = "question"
Step 2A: RAG retrieval → ground answer in docs → return cited response
type = "task"
Step 2B: Decompose into subtasks → execute each → return action plan
Python — conditional branching
import json

ROUTER_PROMPT = """Classify the user input into one of these categories.
Return ONLY JSON: {{"type": "question" | "task" | "feedback" | "unknown"}}

User input: {message}"""

async def conditional_chain(user_message: str) -> str:
    # Step 1 — Classify
    raw = await llm_async(
        system="You are a router. Classify inputs accurately.",
        user=ROUTER_PROMPT.format(message=user_message),
    )
    input_type = json.loads(raw).get("type", "unknown")

    # Step 2 — Branch
    if input_type == "question":
        return await handle_question(user_message)
    elif input_type == "task":
        return await handle_task(user_message)
    elif input_type == "feedback":
        return await handle_feedback(user_message)
    else:
        return await llm_async(
            system="You are a helpful assistant.",
            user=user_message,
            temperature=0.7,
        )


async def handle_task(message: str) -> str:
    # Decompose first, then execute
    subtasks_raw = await llm_async(
        system='Break this task into numbered subtasks. Return JSON: {"subtasks": ["step 1", "step 2"]}',
        user=message,
    )
    subtasks = json.loads(subtasks_raw)["subtasks"]
    return "Action plan:\n" + "\n".join(f"{i+1}. {t}" for i, t in enumerate(subtasks))

Routing table pattern

For complex multi-level routing, keep the decision tree in Python — not buried in a single prompt. Use a lookup table that maps (type, sub-type) to handlers:

Python — routing table
ROUTING_TABLE = {
    ("question", "factual"):  handle_factual_question,
    ("question", "opinion"):  handle_opinion_question,
    ("task",     "simple"):   handle_simple_task,
    ("task",     "complex"):  handle_complex_task,
}

async def route(message: str) -> str:
    top_type = await classify_type(message)       # "question" | "task"
    sub_type = await classify_complexity(message) # "factual" | "simple" | etc.

    handler = ROUTING_TABLE.get((top_type, sub_type))
    return await handler(message) if handler else await fallback(message)

07 Error Handling Between Steps

Chains break at step boundaries. Without explicit error handling, a bad output in step 2 silently corrupts everything downstream.

The three failure types

TYPE 1

Format failures

Model returned prose instead of JSON, or JSON with the wrong schema. Catch at parse time. Retry the step with a corrective prompt appended to the user turn.

TYPE 2

Quality failures

Valid JSON but wrong content (empty results, hallucinated data, low confidence). Catch at validation time. Check business rules — not just schema conformance.

TYPE 3

API / network failures

Rate limits, timeouts, provider outages. Catch at network time. Retry with exponential backoff; fall back to an alternate model if retries are exhausted.

Python — retry with corrective prompt
import asyncio
from pydantic import BaseModel
from typing import Type, TypeVar

T = TypeVar("T", bound=BaseModel)

async def llm_with_retry(
    system: str,
    user: str,
    schema: Type[T],
    step_name: str,
    max_retries: int = 3,
) -> T:
    last_error = None
    current_user = user

    for attempt in range(max_retries):
        try:
            raw = await llm_async(system=system, user=current_user)
            return parse_step_output(raw, schema, step_name)

        except ValueError as e:
            last_error = str(e)
            # Append corrective context for the next attempt
            current_user = (
                f"{user}\n\n"
                f"PREVIOUS ATTEMPT FAILED:\n{last_error}\n\n"
                "Return ONLY valid JSON matching the required schema. No other text."
            )
            if attempt < max_retries - 1:
                await asyncio.sleep(0.5 * (attempt + 1))

        except Exception as e:
            last_error = str(e)
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)   # Exponential back-off for API errors

    raise RuntimeError(
        f"[{step_name}] Failed after {max_retries} attempts. Last: {last_error}"
    )

Graceful degradation

Python — partial result instead of crash
async def run_chain_safely(source_text: str) -> dict:
    result = {"status": "success", "output": None, "warnings": []}

    try:
        claims = await extract_claims(source_text)
    except RuntimeError as e:
        result["warnings"].append(f"Extraction failed: {e}")
        claims = []   # Continue with empty list rather than crashing

    try:
        scored = await score_claims(claims) if claims else []
    except RuntimeError as e:
        result["warnings"].append(f"Scoring failed: {e}")
        scored = claims   # Fall back to unscored claims

    try:
        result["output"] = await synthesise(scored)
    except RuntimeError as e:
        result["status"] = "partial"
        result["output"] = "Synthesis unavailable."
        result["raw_claims"] = [c["text"] for c in (scored or [])]

    return result

08 State Management

As chains grow longer, a chain state object carries everything — original input, each step's output, timings, errors — through the pipeline as a single structured artifact. This makes debugging, logging, and resuming failed runs straightforward.

Python — chain state dataclass
from dataclasses import dataclass, field
import time, uuid

@dataclass
class ChainState:
    run_id:       str   = field(default_factory=lambda: str(uuid.uuid4()))
    source_text:  str   = ""
    started_at:   float = field(default_factory=time.time)

    # Step outputs — filled as chain progresses
    claims:       list[dict] = field(default_factory=list)
    scored:       list[dict] = field(default_factory=list)
    filtered:     list[dict] = field(default_factory=list)
    output:       str        = ""

    # Execution metadata
    step_timings: dict[str, float] = field(default_factory=dict)
    errors:       list[str]        = field(default_factory=list)
    completed:    bool             = False

    def record_step(self, step_name: str, duration_ms: float):
        self.step_timings[step_name] = duration_ms

    def to_log_record(self) -> dict:
        return {
            "run_id":    self.run_id,
            "duration":  int((time.time() - self.started_at) * 1000),
            "steps":     self.step_timings,
            "n_claims":  len(self.claims),
            "errors":    self.errors,
            "completed": self.completed,
        }


# Each step mutates state in place
async def step_extract(state: ChainState) -> ChainState:
    t0 = time.time()
    raw = await llm_async(EXTRACT_SYSTEM, state.source_text)
    state.claims = json.loads(raw)["claims"]
    state.record_step("extract", (time.time() - t0) * 1000)
    return state

async def run_pipeline(source_text: str) -> ChainState:
    state = ChainState(source_text=source_text)
    state = await step_extract(state)
    state = await step_score(state)
    state.filtered = [c for c in state.scored if c["confidence"] >= 7]
    state.completed = True
    return state
💾
Persist state for long chains. For chains running over many minutes or with human-in-the-loop steps, serialise state to a database after each step. This lets you resume a failed run from the last successful checkpoint rather than re-running the whole pipeline.

09 Real Pipeline: Research Material → Article

A complete, production-style pipeline that turns raw research (transcripts, notes, web content) into a structured article draft — the kind of pipeline that powers the content behind Aether Intel.

Step 1
Extract
claims JSON
Step 2 ∥
Score + Dedupe
parallel
Step 3
Outline
section plan
Step 4
Draft
prose article
Step 5
Polish
final article

Step 1 — Extract claims

Prompt
You are a research analyst. Read the source material and extract every distinct
factual claim, statistic, opinion, or insight.

Return ONLY JSON:
{
  "claims": [
    {
      "text":         "The claim in one clear sentence",
      "type":         "fact" | "statistic" | "opinion" | "insight",
      "source_quote": "Verbatim phrase from source",
      "importance":   1-5
    }
  ]
}

Source material:
{source_text}

Step 2 — Score and deduplicate (parallel)

Prompt — deduplication branch
You are an editor. Review this list of claims and remove duplicates.
Two claims are duplicates if they express the same idea, even with different words.
Keep the version with the clearest wording.

Return JSON: {"claims": [...same schema, duplicates removed...]}

Claims:
{claims_json}

Step 3 — Outline

Prompt
You are a content strategist. Using the claims below, create an article outline
with 4–6 sections. Each section should have a clear angle supported by 2–4 claims.

Return JSON:
{
  "title": "Article headline (compelling, specific, under 70 chars)",
  "sections": [
    {
      "heading": "Section heading",
      "angle":   "What this section argues or explains",
      "claims":  [0, 3, 7]
    }
  ]
}

Claims:
{filtered_claims_json}

Step 4 — Draft

Prompt
You are a technology journalist writing for an informed audience.
Write the full article following the outline. Use ONLY the provided claims as your
factual foundation — do not add information not in the claims.

Style: clear, direct, no filler. Active voice.
Avoid: "In today's rapidly evolving landscape..." and similar empty openers.

Add inline citations as (claim_id) when using a specific claim.

Outline: {outline_json}
Claims:  {claims_json}

Step 5 — Polish

Prompt
You are a copy editor. Improve the article below for:
1. Clarity — remove redundant words, tighten sentences
2. Flow — ensure paragraphs connect naturally
3. Consistency — unified voice throughout
4. Headlines — make title and headings compelling

Do NOT change factual content, add new information, or alter citations.
Return the improved article as plain text.

Article:
{draft}

10 Common Failure Modes

No format enforcement — each step returns whatever the model feels like, causing downstream parse failures
✅ Define a JSON schema for every inter-step handoff. Include it verbatim in the system prompt. Parse and validate before passing to the next step.
Error propagation — a hallucinated fact or empty result in step 2 flows silently to the final output, which looks plausible but is wrong
✅ Validate business rules at every boundary, not just JSON schema. Fail fast: if step 2 returns empty claims, stop and report rather than running 3 more steps on nothing.
Over-chaining — every task gets decomposed into 8 steps; most add no value, inflating latency and cost
✅ Add a step only when it has a clear responsibility that conflicts with adjacent steps. If two consecutive steps could be merged without quality loss, merge them.
Bloated late-step prompts — by step 5, the model sees the source text, claims, scored claims, outline, and draft all at once — exceeding the useful context window
✅ Pass only what each step needs. Step 5 (polish) should receive the draft only — not the full upstream chain state. Prune aggressively between steps.
No state persistence — a chain fails at step 6 of 8; the whole run restarts from step 1, wasting time and money
✅ Serialise chain state to a DB after each successful step. Implement resume-from-checkpoint for long-running or human-in-loop pipelines.
Uniform temperature across all steps — using 0.7 everywhere produces "creative" extractions and "deterministic" prose — neither is what you want
✅ Set temperature per step role: 0.0 for extraction/verification/classification, 0.2–0.4 for analysis, 0.5–0.8 for synthesis and writing.

Take the Chaining Templates

Download the complete prompt chaining reference — sequential template, JSON contract patterns, parallel chain, conditional branching, error handling, and the full research pipeline — all in one Markdown file.