What Few-Shot Examples Actually Do
The standard explanation is: "You give the model examples so it knows what you want." That's technically correct but misses the mechanism, and missing the mechanism means you'll write bad examples.
Here's what's actually happening. Language models learn from patterns in their training data. When you include examples in a prompt, you are placing the model into a local context where those examples are the most recent, most relevant patterns it has seen. The model doesn't "study" your examples the way a human would — it uses them as immediate context that conditions every subsequent token it generates.
This is called in-context learning: the model learns the task pattern from the examples in its active context window, without updating any weights. The examples define the statistical neighbourhood the model is operating in. They answer the question "given this kind of input, what kind of output is expected here?" before you've even asked your actual question.
The model isn't reasoning about your examples — it's pattern-matching to them. This means the format, length, tone, and structure of your examples are at least as important as their content. An example with the right answer but the wrong format will teach the wrong format.
Few-shot examples communicate several things simultaneously that you can't easily express in words:
- Output format: Exactly how the response should be structured — whether that's JSON, bullet points, a specific sentence pattern, or a table.
- Output length: The model will calibrate the length of its response to match your examples. Long examples produce long outputs; short examples produce short ones.
- Tone and register: Formal, casual, terse, discursive — the model infers the appropriate register from how your examples are written.
- Edge-case handling: If you include an example that demonstrates what to do when data is missing or ambiguous, the model learns to apply that handling to similar cases.
- Reasoning depth: If your examples show surface-level answers, the model will produce surface-level answers. If they show reasoning steps, the model will reason.
None of these can be fully conveyed by instruction alone. You can write "respond in JSON" and get JSON — but the exact schema, nesting structure, field names, and null-handling will all be guessed. One example eliminates that guessing entirely.
Zero-Shot, One-Shot, Few-Shot: The Spectrum
These three terms describe how many examples you include in your prompt. Understanding where each sits on the spectrum — and what each is buying you — helps you choose the right approach for each task.
| Approach | Examples | What You Get | Best For |
|---|---|---|---|
| Zero-Shot | 0 | Model relies entirely on its training for format and style | Simple, well-understood tasks; fast iteration |
| One-Shot | 1 | Format anchored; tone and edge cases still inferred | When format matters but content is predictable |
| Few-Shot (2–5) | 2–5 | Format, tone, edge cases, and reasoning style all anchored | Classification, extraction, formatting, scoring tasks |
| Many-Shot (5+) | 5+ | Strong in-context fine-tuning effect; diminishing returns past ~8 | Highly domain-specific outputs; unusual formats |
The jump from zero-shot to one-shot is the largest quality improvement per token spent. A single well-chosen example eliminates format ambiguity entirely. The jump from one-shot to three-shot further anchors tone and handles variation. Beyond five examples, you're usually in diminishing-returns territory — unless your task has high variance and you're explicitly trying to demonstrate many different cases.
Start with zero-shot. If the output format is wrong, add one example. If the tone or handling of edge cases is wrong, add a second or third. Stop when the output consistently matches what you need. Each example adds tokens to every subsequent call — keep the minimum that achieves the quality you need.
How Many Examples to Use
There's no universal right number, but there are clear guiding principles.
One example locks in output structure and eliminates schema ambiguity. Sufficient for most formatting tasks.
Two or three examples establish tone and show the model how to handle variation. The sweet spot for most classification and extraction tasks.
Use a 4th or 5th example specifically to demonstrate edge-case handling: missing data, ambiguous inputs, or boundary conditions.
Past 5–6 examples, gains are small unless your task has genuinely high output variance. Watch context cost.
The ceiling depends on your model. For frontier models (Claude Sonnet, GPT-4o, Gemini 1.5 Pro), 3 well-crafted examples typically deliver near-peak quality. Smaller or older models may need more. If you're on a budget-tier model and outputs are still inconsistent after 5 examples, the issue is usually example quality — not quantity.
Context window cost
Every example in your prompt is repeated in full on every API call. If you're processing 10,000 records, 4 examples of 200 tokens each adds 800 tokens per call — 8 million extra tokens across the batch, which shows up on your bill. This is why "minimum viable examples" is the right target, not "as many as possible."
A single 500-token few-shot example running against 50,000 records at $3/M input tokens costs $75 — just for that one example, on every call. Keep examples as concise as they can be while still demonstrating what matters.
The Anatomy of a Good Example
Most people treat examples as demonstrations of the right answer. They are — but an example has several distinct components, each doing a specific job. Understanding them lets you deliberately construct examples that teach exactly what you need.
The four components of a useful example
A realistic input
The input in your example should look like the actual inputs the model will see in production. If you're classifying customer support tickets, use a realistic ticket — messy, potentially ambiguous, written like a real person wrote it. A cleaned-up, idealised input will cause the model to handle idealised real inputs well but fail on the messy real ones.
The exact output format
The output in your example should use exactly the format you want: same field names, same capitalisation, same nesting structure, same null representation. The model will mirror this precisely. If you use "N/A" for missing values in your example and write null in your instructions, you'll get inconsistent outputs. Let the example override the instruction.
Representative content (not just the easy case)
Your first example is often an easy, clear case. That's fine. But subsequent examples should cover variation: an input that's ambiguous, one with missing fields, one at the boundary of two categories. Each edge-case example is worth 2–3 straightforward ones in terms of robustness.
A consistent separator
Between your examples and between the final example and your actual question, use a consistent separator — ---, ###, or a blank line with a label. This tells the model where one example ends and the next begins, and crucially where the real task starts. Without clear separation, the model may treat your actual question as another example input.
Good vs. Bad Examples Side by Side
The difference between a few-shot prompt that works and one that doesn't is almost always in the quality of the examples. Here are direct comparisons across three common task types.
Task: Sentiment classification
Input: "This is great!" Output: Positive Input: "Terrible experience." Output: Negative Input: [YOUR TEXT] Output:
Input: "Honestly wasn't expecting much
but the support team sorted it in
20 mins. Will come back."
Output: {"sentiment": "positive",
"confidence": "high",
"hedge": false}
Input: "Works fine I guess, does
what it says."
Output: {"sentiment": "neutral",
"confidence": "medium",
"hedge": true}
Input: [YOUR TEXT]
Output:
The bad version will produce consistent results for clear-cut cases but fails on hedged, sarcastic, or mixed-sentiment inputs because the examples only cover the extremes. The good version shows the exact JSON schema, includes a confidence field, and explicitly demonstrates how to handle hedged language — which covers the 30% of real inputs that sit in the middle.
Task: Data extraction
Text: "John Smith joined Acme Corp as VP of Engineering in March 2024." Output: Name: John Smith Company: Acme Corp Role: VP of Engineering Start Date: March 2024
Text: "John Smith joined Acme Corp
as VP of Engineering in March 2024."
Output: {"name":"John Smith",
"company":"Acme Corp",
"role":"VP of Engineering",
"start_date":"2024-03",
"end_date":null}
---
Text: "Sarah Lee left Google last
year after five years as a PM."
Output: {"name":"Sarah Lee",
"company":"Google","role":"PM",
"start_date":null,
"end_date":"approx 2025"}
The bad version produces a key-value format that will be inconsistent across runs (sometimes with colons, sometimes without, sometimes with quotes). It also has no null handling — when a field is absent, the model will either omit the field, write "unknown", or make something up. The good version locks in the JSON schema including nulls, and the second example explicitly demonstrates how to handle missing and approximate data.
Task: Rewriting for tone
Original: "Your account has been suspended due to non-payment." Rewrite: "We've paused your account. To restore access, please update your payment details."
Original: "Your account has been suspended due to non-payment." Rewrite: "We've paused your account — updating your payment details takes less than two minutes and restores access immediately." --- Original: "This feature is not available on your current plan." Rewrite: "This one's on a higher plan. Here's a quick look at what's included if you'd like to upgrade."
The bad version shows one rewrite but leaves tone poorly defined — is this "warm"? "Clear"? "Empathetic"? The model will guess. The good version shows two rewrites that both demonstrate the same tone: direct, human, action-oriented, no corporate hedging. Two examples of the same tone in different contexts is enough for the model to reliably apply it to new inputs.
Step-by-Step Walkthrough: Building a Review Triage Prompt
Here's a complete example of constructing a few-shot prompt from scratch for a real task: triaging customer reviews into actionable categories for a product team.
The task
You receive 200+ customer reviews per week. You need to classify each one into: bug, feature_request, praise, billing, or other — plus extract the specific product area mentioned, and flag whether the review needs an urgent response.
Step 1 — Try zero-shot first
Classify the following customer review. Return JSON with fields: category, product_area, urgent. Review: "The export button stopped working after the last update. I've tried on Chrome and Safari — same issue. I need this for a client presentation tomorrow."
The model will return something — but the category values will be inconsistent across runs ("Bug" vs "bug" vs "technical issue"), the urgent field might be true/false or "yes/no" or "high/low", and product_area could be "export", "Export feature", or "file export button". Downstream code will break.
Step 2 — Add a format-anchoring example
Classify customer reviews. Return JSON exactly as shown below. Review: "Love the new dashboard. Way faster than before and the filters actually work the way I expect them to." JSON: { "category": "praise", "product_area": "dashboard", "urgent": false } --- Review: [PASTE REVIEW] JSON:
Now the schema is locked. But urgent flag behaviour on edge cases is still ambiguous. Add a second example that demonstrates a true urgent case and a tricky category:
Step 3 — Add coverage examples
Classify customer reviews. Use these exact category values: bug | feature_request | praise | billing | other Return JSON as shown. urgent = true if customer mentions a deadline, financial loss, or blocked workflow. Review: "Love the new dashboard. Way faster than before and the filters actually work the way I expect them to." JSON: {"category":"praise","product_area":"dashboard","urgent":false} --- Review: "The export button stopped working after the last update. Tried Chrome and Safari. Need this for a client presentation tomorrow." JSON: {"category":"bug","product_area":"export","urgent":true} --- Review: "Why was I charged twice this month? No one has replied to my support ticket from 4 days ago." JSON: {"category":"billing","product_area":"payments","urgent":true} --- Review: [PASTE REVIEW] JSON:
Three examples. The first anchors format for a clean case. The second demonstrates bug + urgent. The third demonstrates billing and shows that "no reply to support ticket" maps to urgent: true — which the model would have to guess otherwise. The instruction comment above specifies the allowed category values so the model can't invent new ones.
This prompt produces consistent, parseable JSON across all review types, handles the urgent flag correctly without ambiguity, and keeps the category set closed. Three examples was sufficient. Adding a fourth wouldn't meaningfully change output quality.
When to Prefer Few-Shot Over Zero-Shot
Few-shot isn't always better than zero-shot — it's better in specific situations. Using it when you don't need it wastes tokens and adds prompt-management overhead. Here's the decision framework:
| Situation | Recommended Approach | Reason |
|---|---|---|
| Output format must be exact (JSON schema, specific fields) | Few-shot (1–2 examples) | Instructions describe format; examples enforce it |
| Tone is unusual or highly specific (brand voice, niche register) | Few-shot (2–3 examples) | Tone is almost impossible to specify; examples show it |
| Task involves edge cases or ambiguous inputs | Few-shot (3–5 examples) | Include examples that explicitly handle the edge cases |
| Classification with a closed label set | Few-shot (2–3 examples) | Examples constrain the output space alongside instructions |
| Simple factual question or lookup | Zero-shot | Model already knows the format; examples add cost only |
| Open-ended generation (drafting, brainstorming) | Zero-shot | Examples constrain creative range; often counterproductive |
| Reasoning task | Zero-shot + CoT first | Try "think step by step" before adding examples; see CoT guide |
| Model is consistently producing wrong structure despite instructions | Few-shot | Examples override persistent format misunderstandings |
The heuristic is simple: if you could fully describe the output you want using only words, start with zero-shot. If there's something about the output that is easier to show than to say — tone, exact schema, edge-case behaviour, reasoning style — add examples.
Common Mistakes
1. Using only easy, clean examples
The most common mistake. You test your prompt on a textbook-clear input, write an example for that, and the prompt works perfectly — until it hits a real-world input that's messy, ambiguous, or partially empty. Include at least one example that demonstrates how to handle variation. Edge-case examples are worth more than additional easy ones.
2. Examples that contradict the instructions
If your system prompt says "always output lowercase field names" but your example has "Category" with a capital C, the model has to decide which to follow. It will usually follow the example, not the instruction. Make sure your examples are consistent with every constraint in your prompt. When they conflict, the example wins.
In a conflict between a written instruction and an example, the example usually wins. This is useful (you can override global instructions for specific tasks) but dangerous (accidental inconsistencies in your examples silently override the rules you thought were enforced).
3. Examples that are too long
If your example output is 400 words, the model will produce 400-word outputs. If your actual target is a 2-sentence summary, your example has taught the wrong length. Match the example output length to your actual target length. This applies to both short-form (where padding is wasteful) and long-form (where truncated examples train truncated outputs).
4. No separator between examples and the real question
Without a clear separator, the model can misread your actual question as another example input and generate a fictional answer rather than answering your real question. Use ---, three blank lines, or a labelled divider between the last example and the live question. Be consistent.
5. Using biased examples for classification
If all your few-shot examples show positive examples of a class (this is a "bug", this is a "bug", this is a "bug") without showing non-examples, the model develops a bias toward that class. For binary classification, always include at least one positive and one negative example. For multi-class, aim for rough balance across the classes you care about.
6. Treating few-shot as a substitute for a clear task description
Examples clarify a task — they don't define it. If your prompt has no instruction and only examples, the model is entirely pattern-matching with no grounding in what the task actually is. A clear task description plus 2 examples consistently outperforms 5 examples with no description. Always lead with the instruction; use examples to refine, not to replace it.
Free Few-Shot Template Pack (.md)
Ready-to-paste templates for classification, extraction, tone rewriting, and structured reasoning — plus a decision checklist for choosing zero-shot vs. few-shot and a worked example walkthrough you can adapt.