Chain-of-Thought Prompting: The Technique That Forces AI to Actually Think

Why AI Skips Reasoning by Default

Ask an AI a maths problem and it will often give you an answer instantly. Ask it to diagnose a system failure and it jumps straight to a solution. Ask it to evaluate a business strategy and it presents a polished conclusion with no visible working.

This looks like intelligence. It is not reasoning.

What's happening is token prediction. The model has learned, from billions of examples in its training data, which tokens statistically follow which other tokens. For a question that looks like a common type of problem, it has learned what a confident, well-formed answer looks like — and it produces that, fast, without needing to generate any intermediate steps. The output sounds authoritative precisely because it has learned to mimic authoritative outputs.

The problem is that this process fails silently. When a problem genuinely requires working through intermediate steps — multi-step maths, logic chains, diagnostic trees — the model can produce a confident wrong answer that has all the surface markers of a correct one. You have no way to detect the error unless you verify independently.

⚠

The confidence trap

LLMs trained on human-generated text have learned that confident writing sounds a certain way. A model producing a wrong answer will often produce it with exactly the same tone and structure as a correct one. Fluency is not a signal of accuracy.

This is not a flaw to be patched. It is the fundamental architecture of the technology. The model is a next-token predictor, not a symbolic reasoner. But — and this is the key insight — you can prompt it in a way that forces it to generate reasoning steps, and those steps change what it produces next.

Three signs your prompt is getting direct-prediction answers

Instant conclusions with no visible logic: The model states an answer but doesn't show any intermediate steps, even for problems where intermediate steps exist.
Over-confident tone on uncertain topics: The answer reads decisively on a topic where the correct answer requires weighing several uncertain factors.
Errors that are surprisingly basic: Multi-step arithmetic errors, logic reversals, or ignored constraints — the kind of mistakes that would not survive a single step of explicit working-out.

What Chain-of-Thought Actually Does

Chain-of-thought (CoT) prompting is the practice of instructing a model to work through its reasoning explicitly before arriving at a final answer. In its simplest form, it means appending a phrase like "Let's think through this step by step" to your prompt.

That's it. That phrase — or any variation of it — is sufficient to trigger a measurable improvement in reasoning quality across a wide range of tasks.

The original CoT paper (Wei et al., 2022, Google Brain) demonstrated this across arithmetic, commonsense reasoning, and symbolic reasoning benchmarks. On challenging multi-step maths problems, models with chain-of-thought prompting dramatically outperformed the same models using direct-answer prompting. The improvement wasn't from a better model — it was from asking the same model to show its work.

✓

The key insight

Chain-of-thought doesn't make the model smarter. It forces it to generate tokens that look like reasoning, and those tokens condition subsequent tokens toward more accurate conclusions. The improvement is architectural, not magical.

Standard prompting vs. chain-of-thought: a direct comparison

Dimension	Standard Prompt	With Chain-of-Thought
Multi-step maths	High error rate on problems with 3+ steps	Significantly lower error rate; working visible
Logic problems	Plausible-sounding but often wrong	Explicit steps catch logical reversals
Decision analysis	Jumps to conclusion, ignores trade-offs	Surfaces assumptions and competing factors
Factual recall	Fast, usually accurate for known facts	No benefit; adds unnecessary tokens
Creative writing	Natural, fluid output	Can make output feel mechanical
Error auditability	Errors invisible without external verification	Reasoning steps expose where logic went wrong

The secondary benefit — the one that matters at least as much as accuracy — is auditability. When the model shows its reasoning, you can read the steps and identify exactly where a mistake occurred. With standard prompting, a wrong answer is a black box. With CoT, it's a traceable log.

The Token Prediction Mechanics

To understand why chain-of-thought works, you need to understand one thing about how LLMs generate text: each token is predicted based on everything that came before it in the context.

When a model generates a reasoning step — "First, I need to calculate X" — that step becomes part of the context for the next token. The next token is now conditioned on the content of that reasoning step. The model is not retrieving a pre-computed answer; it is generating each piece of text in light of everything it has already generated.

This means that intermediate reasoning steps act as scaffolding. A correct intermediate conclusion steers subsequent tokens toward further correct conclusions. An explicit "checking my assumptions" step forces the model into a mode of output where it generates text that looks like uncertainty-checking, which in turn surfaces assumptions it would otherwise paper over.

◆

Why "step by step" is enough

The phrase "Let's think through this step by step" works because the model has seen vast amounts of text that follows this exact pattern — worked examples, tutorials, explanations. When you invoke this phrase, you're placing the model into a high-probability distribution of step-by-step reasoning outputs drawn from that training data.

The self-consistency principle

Because token generation is probabilistic, the same prompt can produce different reasoning paths on different runs. Self-consistency CoT exploits this: run the same CoT prompt multiple times at a moderate temperature (0.7–1.0), then take the majority answer across all runs.

This approach treats the model like a panel of independent experts, each reasoning from scratch. When multiple independent reasoning chains converge on the same answer, that convergence is strong evidence the answer is correct. The technique was introduced in Wang et al., 2022, and consistently outperforms single-run CoT on maths and factual reasoning benchmarks.

In practice, three independent runs is usually sufficient. Five gives you more confidence on high-stakes decisions. Beyond five, the marginal benefit of each additional run diminishes rapidly.

The Five CoT Templates

There isn't one chain-of-thought technique — there are five distinct patterns, each suited to a different type of task. The downloadable template at the bottom of this guide includes ready-to-use versions of all five. Here's how each works and when to reach for it.

Zero-Shot CoT

Append "Let's think through this step by step." to any prompt. No examples required. Best for quick tasks where you just need the model to slow down and show its working.

Few-Shot CoT

Provide 2–3 worked examples of the reasoning pattern before your actual question. More reliable than zero-shot, especially for domain-specific problems where the reasoning style matters.

Structured Reasoning

Give the model an explicit numbered framework: Restate, Identify, Reason, Check, Conclude. Best for analysis tasks where you need to surface assumptions and trade-offs.

Self-Consistency

Run the same CoT prompt 3–5 times at temperature 0.7–1.0 and take the majority answer. Highest accuracy on maths and factual tasks where a single wrong reasoning path is costly.

Decision CoT

Structured around options, pros/cons, assumptions, and conditions. Best for consequential decisions where you need to understand what would need to be true for each option to be correct.

Template 1: Zero-Shot CoT

The simplest form. Add one phrase to the end of any prompt:

Zero-Shot CoT — Minimal

[Your question or task here]

Let's think through this step by step.

That's the entire technique. Variations that work equally well: "Work through this carefully before giving your final answer." / "Think aloud before concluding." / "Show your reasoning." All of these invoke the same pattern. The exact phrasing matters less than the presence of an explicit instruction to generate reasoning before the answer.

Template 2: Few-Shot CoT

More reliable for domain-specific or unusual reasoning patterns. You provide examples that show the model what correct step-by-step reasoning looks like for your type of problem, then ask your real question in the same format:

Few-Shot CoT — Structure

Q: A store sells apples for £1.20 each and oranges for £0.80 each.
   If I buy 3 apples and 5 oranges, how much do I spend?

A: Let me work through this:
   - 3 apples × £1.20 = £3.60
   - 5 oranges × £0.80 = £4.00
   - Total = £3.60 + £4.00 = £7.60
   Answer: £7.60

---

Q: [YOUR ACTUAL QUESTION]

A: Let me work through this:

The critical detail: the example answer must genuinely demonstrate the reasoning style you need. If you copy-paste examples that happen to have a conclusion but no real intermediate steps, the model will mirror that shallow structure.

Template 3: Structured Reasoning

Best for analysis, evaluation, and any task where surfacing assumptions matters as much as reaching a conclusion:

Structured Reasoning — Five-Step Framework

You are an expert analyst. Before giving your final answer, work
through the following steps explicitly:

1. RESTATE:  Restate the core question in your own words.
2. IDENTIFY: List the key factors or variables involved.
3. REASON:   Work through the logic, considering each factor.
4. CHECK:    Identify any assumptions you made and flag uncertainties.
5. CONCLUDE: State your final answer clearly.

---

Question: [YOUR QUESTION]

Step 4 — the CHECK step — is the most valuable one and is almost always omitted from standard prompts. Forcing the model to enumerate its assumptions makes them visible to you, giving you the chance to correct any that are wrong before the conclusion is used.

Template 4: Self-Consistency CoT

For maximum accuracy on high-stakes decisions, maths, or factual questions where a wrong answer has real consequences:

Self-Consistency — Majority Vote

[Your CoT prompt — any of the templates above]

Note: Generate 3 independent solutions to this problem, each reasoning
from scratch. At the end, identify which answer appears most often
across all three solutions and state that as your final answer.

◆

Temperature matters here

For self-consistency to work, each run needs to take a slightly different reasoning path. At temperature 0, all runs will produce the same output. Set temperature to at least 0.7. If your API client doesn't expose this, run the prompt in separate sessions.

Template 5: Decision CoT

Structured around the anatomy of a good decision — options, trade-offs, assumptions, and conditions. Use when the answer is "it depends" and you need to know what it depends on:

Decision CoT — Full Framework

I need to make a decision about: [DECISION]

Context: [RELEVANT BACKGROUND]

Please reason through this decision using the following structure:

1. OPTIONS:     What are the main options available?
2. PROS/CONS:   What are the key advantages and disadvantages of each?
3. ASSUMPTIONS: What assumptions are we making?
4. CONDITIONS:  What would need to be true for each option to be correct?
5. RECOMMEND:  Given all of the above, what is the recommended decision
               and why?

Step-by-Step Walkthrough: Debugging a Production Issue

Seeing templates in isolation is less useful than watching them work on a real problem. Here's a full walkthrough using structured reasoning CoT on a concrete technical task: diagnosing why a scheduled job is intermittently failing in production.

The problem

A nightly data export job runs at 02:00 UTC. It succeeds on most nights but fails silently roughly once per week. No error is thrown — the job just produces no output file. There's no obvious pattern in the failures.

Standard prompt (what most people send)

"Why is my scheduled job intermittently producing no output with no error?"

Result: A generic list of possibilities — race conditions, resource limits, external API timeouts, disk space — with no prioritisation and no way to narrow down the cause. Technically correct. Practically useless.

CoT prompt (structured reasoning version)

Applying the five-step framework

Same question, but now preceded by: "Before diagnosing this, explicitly: (1) restate what we know and don't know, (2) identify which category of cause is most likely given the intermittent pattern, (3) reason through the top three candidates, (4) state what diagnostic data would rule each one out, (5) recommend the first investigation step."

The CoT version produces a fundamentally different kind of output. Because it's forced to identify what we know versus don't know (step 1), it immediately surfaces that "no error thrown" is diagnostic — it rules out application-level exceptions and points toward either a silent external dependency failure or a job not running at all. Because it's forced to reason through candidates (step 3), it distinguishes between causes that would produce consistent failures (misconfiguration) and those that produce intermittent ones (time-sensitive dependencies, resource contention). And because it must state what would rule each one out (step 4), it gives you an actual investigation plan rather than a list of possibilities.

✓

The auditability advantage

When the CoT reasoning is wrong, you can see exactly where it went wrong. If the model's step 2 incorrectly categorises the failure pattern, you can correct that classification and ask it to re-reason from there. This is the real operational advantage of chain-of-thought: not just more accurate answers, but recoverable, correctable reasoning.

What the output looks like in practice

Example CoT Output — Diagnostic Reasoning

RESTATE: A nightly job runs at 02:00 UTC and roughly once/week produces
no output and no error. The failure pattern is intermittent and
time-correlated rather than consistent.

IDENTIFY: Key factors:
  - "No error" rules out application exceptions caught by the job itself
  - Intermittent (not consistent) rules out static misconfiguration
  - Time-correlated pattern suggests dependency on something that varies
    at that time of night (load, external APIs, infrastructure maintenance)
  - Weekly frequency could correlate with weekly maintenance windows

REASON:
  Candidate 1 — External API rate limit or maintenance window (HIGH)
    Most external services have weekly maintenance windows or rate resets.
    If the job calls an API that is down or rate-limited at 02:00 UTC once
    per week, it would silently return empty data rather than an error.

  Candidate 2 — Database replication lag / lock contention (MEDIUM)
    If the job reads from a replica that experiences lag during a weekly
    batch process, it could read an empty or stale dataset and write
    nothing, with no error.

  Candidate 3 — Infrastructure auto-scaling cold start (LOW-MEDIUM)
    If the job runs on a node that was scaled down overnight, the first
    run after scale-up may timeout before the container is warm. This
    would typically show up in infrastructure logs.

CHECK:
  Assumptions: (a) "no output" means zero bytes, not missing file —
  clarify this. (b) "no error" means no logged exception — confirm
  whether external API calls log timeouts separately.

CONCLUDE: First investigation step: check the external API's status
  page for maintenance windows and compare timestamps with your 6–8
  failure dates. If they correlate, you have your answer in 10 minutes.

Notice what the CoT output gives you that a direct answer doesn't: a ranked candidate list, explicit assumptions to verify, and a concrete first step. The model didn't know the answer before it reasoned. The reasoning itself produced the answer.

When CoT Helps — and When It Doesn't

Chain-of-thought isn't universally beneficial. Applying it indiscriminately adds tokens (cost and latency) with no accuracy gain on tasks that don't involve multi-step reasoning. Knowing when to use it and when to skip it is part of using it well.

Task Type	Use CoT?	Reason
Multi-step maths / calculation	Always	Step generation directly prevents arithmetic errors
Logic and deductive reasoning	Always	Explicit steps catch reversals and invalid inferences
Root-cause / diagnostic analysis	Always	Surfaces assumptions; produces auditable, recoverable reasoning
Consequential decisions with trade-offs	Always	Exposes what conditions make each option correct
Code debugging	Usually	Forces the model to trace execution before guessing a fix
Simple factual recall	Skip	No benefit; adds tokens and latency
Creative writing / brainstorming	Skip	Analytical framing reduces fluency and originality
Translation or reformatting	Skip	Task doesn't involve reasoning; CoT adds nothing
Summarisation	Sometimes	Helps if the summary requires ranking importance; skip for routine compression

A useful mental model: ask yourself whether the task has a right answer that requires working out. If yes, CoT helps. If the task is about retrieving known facts, producing creative content, or transforming text in a well-defined way, CoT is overhead.

Common Mistakes

CoT is simple in concept but easy to implement poorly. These are the mistakes that reduce its effectiveness:

1. Using CoT on tasks that don't need it

The most common mistake. Prefixing "Let's think step by step" onto a simple factual question or a reformatting task doesn't improve accuracy and adds tokens you'll pay for. Learn the cases where CoT helps (reasoning, analysis, calculation) and keep your other prompts clean.

2. Using poor few-shot examples

If you use few-shot CoT, the examples you provide set the template for the model's reasoning style. Examples with shallow "reasoning" — where the answer is just restated in slightly different words before the conclusion — will produce shallow reasoning on your actual question. Use examples where the intermediate steps are genuine and necessary.

⚠

The pseudo-reasoning trap

"I need to find the answer to this question, which is [answer]." — This is not reasoning. It is reformatting. If your examples model this pattern, the CoT output will mirror it: confident-sounding steps that contain no actual working.

3. Ignoring temperature for self-consistency

Self-consistency relies on different reasoning paths on different runs. If your temperature is 0, every run produces the same tokens. You need temperature ≥ 0.7 for self-consistency to provide any benefit over a single run. Many API implementations default to temperature 1.0, which is fine. The risk is users who have explicitly set temperature to 0 for "determinism" and then try to run self-consistency.

4. Accepting wrong reasoning steps without correction

The whole point of CoT is that the reasoning is visible. If you read the reasoning steps and one of them is wrong, correct it. Don't just note the error and accept the conclusion anyway. Because each step conditions the next, a wrong intermediate step can contaminate an otherwise sound chain of reasoning. Fix the step and ask the model to continue from the corrected position.

5. Skipping the CHECK/ASSUMPTIONS step

In the structured reasoning framework, step 4 (CHECK) is the one most people drop when building abbreviated versions of the template. It's also the step that most often surfaces the single assumption that changes the entire answer. Keep it.

6. Treating CoT output as a black box

The value of CoT is not just the final answer — it's the reasoning trace. If you only read the CONCLUDE section and ignore the rest, you're using CoT inefficiently. The intermediate steps are where you validate the logic, catch incorrect assumptions, and learn whether the model's reasoning matches your actual situation.

Free CoT Template Pack (.md)

All five chain-of-thought templates — zero-shot, few-shot, structured reasoning, self-consistency, and decision CoT — in one ready-to-paste reference file. Includes when-to-use guidance and example outputs.