Cost-Optimised Model Routing
You don't need GPT-4o to answer a yes/no question. Smart routing classifies every incoming request, then sends it to the cheapest model that can handle it well — cutting inference costs 60–80% without degrading output quality.
01 Why Model Routing Matters
Most AI applications send every request to the same model — usually the most capable one available. That's the path of least resistance, but it's expensive at scale.
Consider a customer support bot handling 50,000 queries per day. Most of those queries are routine: "What's my order status?", "What are your hours?", "How do I reset my password?" These don't need GPT-4o or Claude Sonnet. A GPT-4o-mini or Gemini Flash could handle them just as well at a fraction of the cost.
When You Don't Need It
Routing adds complexity. If your application is:
- A one-person tool with low request volume (fewer than ~500 requests/day)
- A use case where every task genuinely requires frontier-level reasoning
- A prototype where latency/cost aren't yet constraints
… then start simple. Add a router when cost becomes a real concern or when task types clearly vary in complexity.
02 The Four-Tier Model Landscape
Think of models as four tiers, each one roughly an order of magnitude more expensive than the tier below. The router's job is to identify the lowest tier that can handle a given request reliably.
Cheap & Quick
Gemini 2.0 Flash
Llama 3.1 8B
Claude Haiku 3.5
Mid-Range
GPT-4o
Gemini 2.5 Pro
Mistral Medium
High Capability
GPT-4.5
Gemini 2.5 Pro Exp
Llama 3.3 70B
Max Reasoning
o3
Gemini 2.5 Pro (full)
GPT-4.1
| Task Type | Tier | Example Models | Why |
|---|---|---|---|
| Yes/no questions, fact lookup, simple classification | Tier 1 | GPT-4o-mini, Haiku, Flash | Deterministic, no reasoning chain needed |
| Summarisation, data extraction, template filling | Tier 1–2 | Haiku, GPT-4o-mini, Mistral Small | Pattern-based, minimal judgment |
| Writing, analysis, structured output with nuance | Tier 2 | Sonnet, GPT-4o, Gemini 2.5 Pro | Requires coherence, style, contextual reasoning |
| Complex code review, legal analysis, multi-step logic | Tier 3 | Sonnet 4.5, GPT-4.5 | Needs deep reasoning, attention to detail |
| Research synthesis, autonomous agents, scientific reasoning | Tier 4 | Opus, o3 | Multi-step, novel problem solving, high stakes |
03 Request Classification
The router needs to decide which tier a request belongs to. There are three main approaches — each with different accuracy/cost profiles.
Approach 1: LLM-based Classifier
Use a cheap, fast model (Tier 1) as the router itself. This is the most accurate approach and the most widely used.
Classify the following user request by complexity.
Output ONLY one word: SIMPLE, MEDIUM, COMPLEX, or FRONTIER.
SIMPLE — Yes/no, single fact lookup, basic classification, greeting
MEDIUM — Summarise, extract, fill template, basic writing, FAQ answer
COMPLEX — Multi-step reasoning, code generation, long-form writing, analysis
FRONTIER — Deep research, autonomous task, scientific reasoning, adversarial
User request: {user_message}
Classification:
This prompt is designed to run against gpt-4o-mini or claude-haiku. At ~150 input tokens, it costs ~$0.00002 per classification — essentially free even at high volume.
Approach 2: Rule-based (Zero-cost)
Use regex and heuristics. No model call needed. Lower accuracy but zero latency and zero cost.
import re
SIMPLE_PATTERNS = [
r"\b(what is|what are|who is|when did|yes or no|true or false)\b",
r"\b(define|meaning of|how do you spell)\b",
]
FRONTIER_PATTERNS = [
r"\b(research|synthesise|autonomous|multi-step|write a comprehensive)\b",
r"\b(analyze the entire|compare all|explain in depth)\b",
]
def rule_classify(text: str) -> str:
t = text.lower()
if any(re.search(p, t) for p in FRONTIER_PATTERNS):
return "FRONTIER"
if len(text.split()) < 15 and any(re.search(p, t) for p in SIMPLE_PATTERNS):
return "SIMPLE"
if len(text.split()) > 200:
return "COMPLEX"
return "MEDIUM" # default
Approach 3: Embedding Similarity (Calibrated)
Pre-label a set of example prompts by tier. At runtime, embed the new request and find the nearest example. Fast after one-time indexing, handles novel phrasings better than rules.
from openai import OpenAI
import numpy as np
client = OpenAI()
# Pre-computed: {label: [embedding_vectors]}
TIER_CENTROIDS = load_tier_centroids() # dict[str, np.ndarray]
def embed(text: str) -> np.ndarray:
resp = client.embeddings.create(
model="text-embedding-3-small", input=text
)
return np.array(resp.data[0].embedding)
def embedding_classify(text: str) -> str:
vec = embed(text)
scores = {
tier: float(np.dot(vec, centroid) /
(np.linalg.norm(vec) * np.linalg.norm(centroid)))
for tier, centroid in TIER_CENTROIDS.items()
}
return max(scores, key=scores.get)
Hybrid: Rules first, LLM fallback
The pragmatic approach — use rules for obvious cases (no latency cost), fall back to LLM classification for ambiguous requests.
def classify_request(text: str) -> str:
# Fast path: rule-based
rule_result = rule_classify(text)
if rule_result in ("SIMPLE", "FRONTIER"):
# High-confidence extremes — trust the rule
return rule_result
# Ambiguous middle ground → LLM classifier
return llm_classify(text) # uses gpt-4o-mini
04 Router Architecture
The router sits between your application and the LLM APIs. Every request passes through it before reaching a model.
Complexity
The router has four responsibilities:
Classify
Determine the complexity tier for the incoming request using your chosen classification strategy (rules, LLM, or hybrid). This happens before any model call.
Select
Map tier → model using a configurable TIER_MODELS dict. This keeps model selection decoupled from business logic — swap models without touching routing code.
Generate & Log
Call the selected model via OpenRouter (or direct API). Capture tier, model, prompt_tokens, completion_tokens, and cost_usd for every request.
Validate (optional)
Check the response before returning it. If validation fails (malformed JSON, too short, policy violation), escalate to the next tier. This is the cascade pattern — covered in section 6.
05 Implementation: Python Router
Here's a full, production-ready router class built on OpenRouter. OpenRouter is the simplest way to access multiple model providers through a single API — and it returns per-request cost in the response body.
import os, time, json, logging
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI
logger = logging.getLogger(__name__)
# ── Model tier configuration ──────────────────────────────
TIER_MODELS: dict[str, str] = {
"SIMPLE": "openai/gpt-4o-mini",
"MEDIUM": "anthropic/claude-haiku-3-5",
"COMPLEX": "anthropic/claude-sonnet-4",
"FRONTIER": "anthropic/claude-opus-4",
}
CLASSIFIER_MODEL = "openai/gpt-4o-mini"
CLASSIFIER_PROMPT = """Classify the following user request.
Output ONLY one word: SIMPLE, MEDIUM, COMPLEX, or FRONTIER.
SIMPLE — Yes/no, single fact, basic classification, greeting
MEDIUM — Summarise, extract, fill template, basic writing
COMPLEX — Multi-step reasoning, code, long-form writing, analysis
FRONTIER — Deep research, autonomous tasks, scientific reasoning
User request: {message}
Classification:"""
@dataclass
class RouterResult:
content: str
tier: str
model: str
input_tokens: int = 0
output_tokens: int = 0
cost_usd: float = 0.0
latency_ms: int = 0
escalated: bool = False
class ModelRouter:
def __init__(self, api_key: Optional[str] = None):
self.client = OpenAI(
api_key=api_key or os.environ["OPENROUTER_API_KEY"],
base_url="https://openrouter.ai/api/v1",
)
# ── Classification ─────────────────────────────────────
def classify(self, message: str) -> str:
prompt = CLASSIFIER_PROMPT.format(message=message[:800])
resp = self.client.chat.completions.create(
model=CLASSIFIER_MODEL,
messages=[{"role": "user", "content": prompt}],
max_tokens=5,
temperature=0,
)
tier = resp.choices[0].message.content.strip().upper()
return tier if tier in TIER_MODELS else "MEDIUM"
# ── Core request ───────────────────────────────────────
def _call_model(
self,
model: str,
messages: list[dict],
temperature: float = 0.7,
max_tokens: int = 2048,
) -> tuple[str, dict]:
start = time.time()
resp = self.client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
latency = int((time.time() - start) * 1000)
usage = resp.usage
# OpenRouter returns cost in usage.cost (USD)
cost = getattr(usage, "cost", 0.0) or 0.0
return resp.choices[0].message.content, {
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
"cost_usd": float(cost),
"latency_ms": latency,
}
# ── Public route() method ──────────────────────────────
def route(
self,
messages: list[dict],
system_prompt: str = "",
temperature: float = 0.7,
force_tier: Optional[str] = None,
) -> RouterResult:
user_message = next(
(m["content"] for m in reversed(messages) if m["role"] == "user"), ""
)
tier = force_tier or self.classify(user_message)
model = TIER_MODELS[tier]
full_messages = messages
if system_prompt:
full_messages = [{"role": "system", "content": system_prompt}] + messages
content, meta = self._call_model(model, full_messages, temperature)
logger.info(
"routed",
extra={"tier": tier, "model": model, "cost_usd": meta["cost_usd"]}
)
return RouterResult(content=content, tier=tier, model=model, **meta)
Usage
router = ModelRouter()
result = router.route(
messages=[{"role": "user", "content": "What's the capital of France?"}],
system_prompt="You are a helpful assistant."
)
print(result.content) # "Paris"
print(result.tier) # "SIMPLE"
print(result.model) # "openai/gpt-4o-mini"
print(f"${result.cost_usd:.6f}") # "$0.000012"
06 Cascade & Fallback Patterns
A cascade router tries the cheapest model first. If the response fails a quality check, it automatically escalates to the next tier. This is the most cost-effective pattern when you can define a quality signal.
When to escalate
- Response doesn't parse as expected format (e.g., expected JSON but got prose)
- Response is too short (below a minimum character threshold)
- Response contains refusal or uncertainty markers ("I'm not sure", "I cannot")
- Response fails a downstream validation rule
- Timeout exceeded (model too slow for the SLA)
import json
ESCALATION_PATH = ["SIMPLE", "MEDIUM", "COMPLEX", "FRONTIER"]
UNCERTAINTY_PHRASES = [
"i'm not sure", "i don't know", "i cannot", "i'm unable",
"i don't have enough", "as an ai",
]
def response_is_uncertain(text: str) -> bool:
lower = text.lower()
return any(phrase in lower for phrase in UNCERTAINTY_PHRASES)
def requires_json(task_type: str) -> bool:
return task_type in ("extraction", "classification", "structured_output")
class CascadeRouter(ModelRouter):
def cascade(
self,
messages: list[dict],
task_type: str = "general",
start_tier: str = "SIMPLE",
min_length: int = 20,
) -> RouterResult:
path = ESCALATION_PATH[ESCALATION_PATH.index(start_tier):]
last_result = None
for tier in path:
result = self.route(messages, force_tier=tier)
result.escalated = (tier != start_tier)
# Quality checks
if len(result.content.strip()) < min_length:
continue # Too short → escalate
if response_is_uncertain(result.content):
continue # Uncertain → escalate
if requires_json(task_type):
try:
json.loads(result.content)
except json.JSONDecodeError:
continue # Bad JSON → escalate
return result # Passed all checks
return last_result or result # Return best attempt
Parallel Routing (Confidence-Based)
An alternative pattern: run Tier 1 and Tier 2 in parallel. If they agree on the answer (measured by embedding similarity), return the Tier 1 response. This eliminates cascade latency at the cost of always paying for two model calls.
import asyncio
import numpy as np
async def parallel_route(router, messages, similarity_threshold=0.92):
# Run both tiers concurrently
simple_task = asyncio.create_task(
asyncio.to_thread(router.route, messages, force_tier="SIMPLE")
)
complex_task = asyncio.create_task(
asyncio.to_thread(router.route, messages, force_tier="COMPLEX")
)
simple_result, complex_result = await asyncio.gather(simple_task, complex_task)
# Compare semantic similarity of responses
emb_simple = embed(simple_result.content)
emb_complex = embed(complex_result.content)
similarity = float(np.dot(emb_simple, emb_complex) /
(np.linalg.norm(emb_simple) * np.linalg.norm(emb_complex)))
if similarity >= similarity_threshold:
return simple_result # Agree → use cheaper
else:
return complex_result # Disagree → trust the better model
07 Latency vs Cost Trade-offs
Lower tiers aren't just cheaper — they're also faster. Smaller models have fewer parameters to run inference through, which translates directly to lower time-to-first-token (TTFT) and faster generation throughput.
| Model | Tier | TTFT (typical) | Tokens/sec | Input cost / M | Output cost / M |
|---|---|---|---|---|---|
| GPT-4o-mini | 1 | ~0.3s | ~110 | $0.15 | $0.60 |
| Claude Haiku 3.5 | 1 | ~0.4s | ~120 | $0.80 | $4.00 |
| Claude Sonnet 4 | 2 | ~0.8s | ~85 | $3.00 | $15.00 |
| GPT-4o | 2 | ~0.8s | ~80 | $2.50 | $10.00 |
| Gemini 2.5 Pro | 3 | ~1.2s | ~60 | $1.25 | $10.00 |
| Claude Opus 4 | 4 | ~2s | ~40 | $15.00 | $75.00 |
Streaming Latency
Streaming response tokens as they're generated dramatically improves perceived latency — even when the model is slow. For user-facing apps, always stream. The user sees the first token in 0.3–0.8s regardless of total response length.
def stream_routed(router, messages, system_prompt=""):
tier = router.classify(messages[-1]["content"])
model = TIER_MODELS[tier]
full_messages = messages
if system_prompt:
full_messages = [{"role": "system", "content": system_prompt}] + messages
stream = router.client.chat.completions.create(
model=model,
messages=full_messages,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield delta.content # Stream tokens to caller
08 Prompt Adaptation by Tier
A prompt optimised for GPT-4o may not work as well for a smaller model. Tier 1 models benefit from simpler, more directive prompts. Tier 4 models can handle nuanced, layered instructions that would confuse smaller models.
Prompt tiers in practice
SYSTEM_PROMPTS: dict[str, str] = {
"SIMPLE": (
"You are a helpful assistant. Answer questions directly and briefly. "
"For yes/no questions, answer yes or no first, then optionally explain."
),
"MEDIUM": (
"You are a helpful assistant. Provide clear, well-structured answers. "
"Use bullet points for lists. Keep responses focused and concise."
),
"COMPLEX": (
"You are an expert assistant. Analyse questions carefully before answering. "
"Structure your response with reasoning, then conclusion. "
"If there are trade-offs, enumerate them clearly."
),
"FRONTIER": (
"You are an expert research assistant with deep domain knowledge. "
"Approach problems methodically: clarify assumptions, break down the problem, "
"reason step by step, consider edge cases, and synthesise a nuanced answer. "
"Cite your reasoning at each step."
),
}
def route_with_adapted_prompt(router, user_message: str) -> RouterResult:
tier = router.classify(user_message)
system_prompt = SYSTEM_PROMPTS[tier]
return router.route(
messages=[{"role": "user", "content": user_message}],
system_prompt=system_prompt,
force_tier=tier,
)
Token budget by tier
Set appropriate max_tokens per tier. Over-allocating tokens wastes money; under-allocating truncates responses.
MAX_TOKENS_BY_TIER: dict[str, int] = {
"SIMPLE": 256, # Short factual answers
"MEDIUM": 1024, # Summaries, extractions
"COMPLEX": 4096, # Analysis, long-form writing
"FRONTIER": 16384, # Deep research, complex agents
}
09 Tracking & Observability
Without logging, you're flying blind. You won't know which tiers are being used, whether your classifier is accurate, or how much each feature costs.
What to log on every request
import time, json, hashlib
def log_request(result: RouterResult, user_message: str, feature: str):
record = {
"ts": int(time.time()),
"feature": feature, # e.g. "support_bot", "code_review"
"msg_hash": hashlib.sha256(user_message.encode()).hexdigest()[:12],
"tier": result.tier,
"model": result.model,
"input_tokens": result.input_tokens,
"output_tokens": result.output_tokens,
"cost_usd": round(result.cost_usd, 8),
"latency_ms": result.latency_ms,
"escalated": result.escalated,
}
print(json.dumps(record)) # Pipe to your log aggregator (Datadog, Grafana, etc.)
Key metrics to track
- Cost by feature: Which endpoint is responsible for the most spend?
- Tier distribution: What % of requests hit each tier? (Should be >50% Tier 1 for most support/Q&A apps)
- Escalation rate: What % of Tier 1 requests escalate? (<10% is good; >20% = classifier problem)
- Latency p50/p95 by tier: Is your SLA being met?
- Cost per feature per day: Trend over time as usage grows
OpenRouter cost tracking
OpenRouter returns actual cost in the response. No estimation needed — it's the actual billed amount.
resp = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=messages,
)
# OpenRouter includes cost in the usage object
usage = resp.usage
cost_usd = getattr(usage, "cost", None)
if cost_usd is not None:
print(f"This request cost: ${float(cost_usd):.8f}")
# e.g. "This request cost: $0.00002340"
# Alternatively, use the generation stats endpoint
# GET https://openrouter.ai/api/v1/generation?id={generation_id}
# Returns detailed cost breakdown, model used, and token counts
10 Common Failure Modes
TIER_MODELS dict. The tier is stable; the model name is not. Update the mapping in one place when model availability changes.Take the Routing Templates
Download the complete model routing reference — router class, classifier prompt, cascade pattern, tier configs, and observability logging — all in one Markdown file.