Tools Advanced ~1 hour to implement

Cost-Optimised Model Routing

You don't need GPT-4o to answer a yes/no question. Smart routing classifies every incoming request, then sends it to the cheapest model that can handle it well — cutting inference costs 60–80% without degrading output quality.

Aether Intel Team May 2025 18 min read

01 Why Model Routing Matters

Most AI applications send every request to the same model — usually the most capable one available. That's the path of least resistance, but it's expensive at scale.

Consider a customer support bot handling 50,000 queries per day. Most of those queries are routine: "What's my order status?", "What are your hours?", "How do I reset my password?" These don't need GPT-4o or Claude Sonnet. A GPT-4o-mini or Gemini Flash could handle them just as well at a fraction of the cost.

💡

The key insight: Model routing doesn't compromise quality — it matches task requirements to model capabilities. You only pay for the extra capability when you actually need it.

When You Don't Need It

Routing adds complexity. If your application is:

A one-person tool with low request volume (fewer than ~500 requests/day)
A use case where every task genuinely requires frontier-level reasoning
A prototype where latency/cost aren't yet constraints

… then start simple. Add a router when cost becomes a real concern or when task types clearly vary in complexity.

02 The Four-Tier Model Landscape

Think of models as four tiers, each one roughly an order of magnitude more expensive than the tier below. The router's job is to identify the lowest tier that can handle a given request reliably.

TIER 1 — FAST

Cheap & Quick

GPT-4o-mini
Gemini 2.0 Flash
Llama 3.1 8B
Claude Haiku 3.5

~$0.10–0.40 / M tokens

TIER 2 — BALANCED

Mid-Range

Claude Sonnet 4
GPT-4o
Gemini 2.5 Pro
Mistral Medium

~$2–8 / M tokens

TIER 3 — SMART

High Capability

Claude Sonnet 4.5
GPT-4.5
Gemini 2.5 Pro Exp
Llama 3.3 70B

~$10–20 / M tokens

TIER 4 — FRONTIER

Max Reasoning

Claude Opus 4
o3
Gemini 2.5 Pro (full)
GPT-4.1

~$30–150 / M tokens

Task Type	Tier	Example Models	Why
Yes/no questions, fact lookup, simple classification	Tier 1	GPT-4o-mini, Haiku, Flash	Deterministic, no reasoning chain needed
Summarisation, data extraction, template filling	Tier 1–2	Haiku, GPT-4o-mini, Mistral Small	Pattern-based, minimal judgment
Writing, analysis, structured output with nuance	Tier 2	Sonnet, GPT-4o, Gemini 2.5 Pro	Requires coherence, style, contextual reasoning
Complex code review, legal analysis, multi-step logic	Tier 3	Sonnet 4.5, GPT-4.5	Needs deep reasoning, attention to detail
Research synthesis, autonomous agents, scientific reasoning	Tier 4	Opus, o3	Multi-step, novel problem solving, high stakes

⚠️

Model availability changes fast. Specific model names go in and out of availability and pricing. The tier structure stays stable — the exact models in each tier will shift. Parameterise your tier-to-model mapping so you can update it without rewriting business logic.

03 Request Classification

The router needs to decide which tier a request belongs to. There are three main approaches — each with different accuracy/cost profiles.

Approach 1: LLM-based Classifier

Use a cheap, fast model (Tier 1) as the router itself. This is the most accurate approach and the most widely used.

Classifier Prompt

Classify the following user request by complexity.
Output ONLY one word: SIMPLE, MEDIUM, COMPLEX, or FRONTIER.

SIMPLE  — Yes/no, single fact lookup, basic classification, greeting
MEDIUM  — Summarise, extract, fill template, basic writing, FAQ answer
COMPLEX — Multi-step reasoning, code generation, long-form writing, analysis
FRONTIER — Deep research, autonomous task, scientific reasoning, adversarial

User request: {user_message}

Classification:

This prompt is designed to run against gpt-4o-mini or claude-haiku. At ~150 input tokens, it costs ~$0.00002 per classification — essentially free even at high volume.

Approach 2: Rule-based (Zero-cost)

Use regex and heuristics. No model call needed. Lower accuracy but zero latency and zero cost.

Python

import re

SIMPLE_PATTERNS = [
    r"\b(what is|what are|who is|when did|yes or no|true or false)\b",
    r"\b(define|meaning of|how do you spell)\b",
]
FRONTIER_PATTERNS = [
    r"\b(research|synthesise|autonomous|multi-step|write a comprehensive)\b",
    r"\b(analyze the entire|compare all|explain in depth)\b",
]

def rule_classify(text: str) -> str:
    t = text.lower()
    if any(re.search(p, t) for p in FRONTIER_PATTERNS):
        return "FRONTIER"
    if len(text.split()) < 15 and any(re.search(p, t) for p in SIMPLE_PATTERNS):
        return "SIMPLE"
    if len(text.split()) > 200:
        return "COMPLEX"
    return "MEDIUM"  # default

Approach 3: Embedding Similarity (Calibrated)

Pre-label a set of example prompts by tier. At runtime, embed the new request and find the nearest example. Fast after one-time indexing, handles novel phrasings better than rules.

Python

from openai import OpenAI
import numpy as np

client = OpenAI()

# Pre-computed: {label: [embedding_vectors]}
TIER_CENTROIDS = load_tier_centroids()  # dict[str, np.ndarray]

def embed(text: str) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return np.array(resp.data[0].embedding)

def embedding_classify(text: str) -> str:
    vec = embed(text)
    scores = {
        tier: float(np.dot(vec, centroid) /
               (np.linalg.norm(vec) * np.linalg.norm(centroid)))
        for tier, centroid in TIER_CENTROIDS.items()
    }
    return max(scores, key=scores.get)

Hybrid: Rules first, LLM fallback

The pragmatic approach — use rules for obvious cases (no latency cost), fall back to LLM classification for ambiguous requests.

Python

def classify_request(text: str) -> str:
    # Fast path: rule-based
    rule_result = rule_classify(text)
    if rule_result in ("SIMPLE", "FRONTIER"):
        # High-confidence extremes — trust the rule
        return rule_result

    # Ambiguous middle ground → LLM classifier
    return llm_classify(text)  # uses gpt-4o-mini

🎯

Calibration matters. Run your classifier against a labelled test set of at least 200 real requests from your application. Aim for >90% accuracy on SIMPLE and FRONTIER tiers — those are the high-value routing decisions. MEDIUM/COMPLEX misrouting is less costly.

04 Router Architecture

The router sits between your application and the LLM APIs. Every request passes through it before reaching a model.

User Request

raw prompt

→

Classify
Complexity

→

Select Model

tier → model

→

Generate

+ log cost

→

Validate

quality check

Core routing path

Optional: quality gate before escalation

The router has four responsibilities:

Classify

Determine the complexity tier for the incoming request using your chosen classification strategy (rules, LLM, or hybrid). This happens before any model call.

Select

Map tier → model using a configurable TIER_MODELS dict. This keeps model selection decoupled from business logic — swap models without touching routing code.

Generate & Log

Call the selected model via OpenRouter (or direct API). Capture tier, model, prompt_tokens, completion_tokens, and cost_usd for every request.

Validate (optional)

Check the response before returning it. If validation fails (malformed JSON, too short, policy violation), escalate to the next tier. This is the cascade pattern — covered in section 6.

05 Implementation: Python Router

Here's a full, production-ready router class built on OpenRouter. OpenRouter is the simplest way to access multiple model providers through a single API — and it returns per-request cost in the response body.

Python — router.py

import os, time, json, logging
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI

logger = logging.getLogger(__name__)

# ── Model tier configuration ──────────────────────────────
TIER_MODELS: dict[str, str] = {
    "SIMPLE":   "openai/gpt-4o-mini",
    "MEDIUM":   "anthropic/claude-haiku-3-5",
    "COMPLEX":  "anthropic/claude-sonnet-4",
    "FRONTIER": "anthropic/claude-opus-4",
}

CLASSIFIER_MODEL = "openai/gpt-4o-mini"

CLASSIFIER_PROMPT = """Classify the following user request.
Output ONLY one word: SIMPLE, MEDIUM, COMPLEX, or FRONTIER.

SIMPLE   — Yes/no, single fact, basic classification, greeting
MEDIUM   — Summarise, extract, fill template, basic writing
COMPLEX  — Multi-step reasoning, code, long-form writing, analysis
FRONTIER — Deep research, autonomous tasks, scientific reasoning

User request: {message}

Classification:"""


@dataclass
class RouterResult:
    content:   str
    tier:      str
    model:     str
    input_tokens:  int  = 0
    output_tokens: int  = 0
    cost_usd:      float = 0.0
    latency_ms:    int  = 0
    escalated:     bool = False


class ModelRouter:
    def __init__(self, api_key: Optional[str] = None):
        self.client = OpenAI(
            api_key=api_key or os.environ["OPENROUTER_API_KEY"],
            base_url="https://openrouter.ai/api/v1",
        )

    # ── Classification ─────────────────────────────────────
    def classify(self, message: str) -> str:
        prompt = CLASSIFIER_PROMPT.format(message=message[:800])
        resp = self.client.chat.completions.create(
            model=CLASSIFIER_MODEL,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=5,
            temperature=0,
        )
        tier = resp.choices[0].message.content.strip().upper()
        return tier if tier in TIER_MODELS else "MEDIUM"

    # ── Core request ───────────────────────────────────────
    def _call_model(
        self,
        model: str,
        messages: list[dict],
        temperature: float = 0.7,
        max_tokens: int = 2048,
    ) -> tuple[str, dict]:
        start = time.time()
        resp = self.client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        latency = int((time.time() - start) * 1000)
        usage = resp.usage
        # OpenRouter returns cost in usage.cost (USD)
        cost = getattr(usage, "cost", 0.0) or 0.0
        return resp.choices[0].message.content, {
            "input_tokens":  usage.prompt_tokens,
            "output_tokens": usage.completion_tokens,
            "cost_usd":      float(cost),
            "latency_ms":    latency,
        }

    # ── Public route() method ──────────────────────────────
    def route(
        self,
        messages: list[dict],
        system_prompt: str = "",
        temperature: float = 0.7,
        force_tier: Optional[str] = None,
    ) -> RouterResult:
        user_message = next(
            (m["content"] for m in reversed(messages) if m["role"] == "user"), ""
        )
        tier = force_tier or self.classify(user_message)
        model = TIER_MODELS[tier]

        full_messages = messages
        if system_prompt:
            full_messages = [{"role": "system", "content": system_prompt}] + messages

        content, meta = self._call_model(model, full_messages, temperature)

        logger.info(
            "routed",
            extra={"tier": tier, "model": model, "cost_usd": meta["cost_usd"]}
        )

        return RouterResult(content=content, tier=tier, model=model, **meta)

Usage

Python

router = ModelRouter()

result = router.route(
    messages=[{"role": "user", "content": "What's the capital of France?"}],
    system_prompt="You are a helpful assistant."
)

print(result.content)    # "Paris"
print(result.tier)       # "SIMPLE"
print(result.model)      # "openai/gpt-4o-mini"
print(f"${result.cost_usd:.6f}")  # "$0.000012"

06 Cascade & Fallback Patterns

A cascade router tries the cheapest model first. If the response fails a quality check, it automatically escalates to the next tier. This is the most cost-effective pattern when you can define a quality signal.

When to escalate

Response doesn't parse as expected format (e.g., expected JSON but got prose)
Response is too short (below a minimum character threshold)
Response contains refusal or uncertainty markers ("I'm not sure", "I cannot")
Response fails a downstream validation rule
Timeout exceeded (model too slow for the SLA)

Python — cascade_router.py

import json

ESCALATION_PATH = ["SIMPLE", "MEDIUM", "COMPLEX", "FRONTIER"]

UNCERTAINTY_PHRASES = [
    "i'm not sure", "i don't know", "i cannot", "i'm unable",
    "i don't have enough", "as an ai",
]

def response_is_uncertain(text: str) -> bool:
    lower = text.lower()
    return any(phrase in lower for phrase in UNCERTAINTY_PHRASES)

def requires_json(task_type: str) -> bool:
    return task_type in ("extraction", "classification", "structured_output")


class CascadeRouter(ModelRouter):
    def cascade(
        self,
        messages: list[dict],
        task_type: str = "general",
        start_tier: str = "SIMPLE",
        min_length: int = 20,
    ) -> RouterResult:
        path = ESCALATION_PATH[ESCALATION_PATH.index(start_tier):]
        last_result = None

        for tier in path:
            result = self.route(messages, force_tier=tier)
            result.escalated = (tier != start_tier)

            # Quality checks
            if len(result.content.strip()) < min_length:
                continue  # Too short → escalate

            if response_is_uncertain(result.content):
                continue  # Uncertain → escalate

            if requires_json(task_type):
                try:
                    json.loads(result.content)
                except json.JSONDecodeError:
                    continue  # Bad JSON → escalate

            return result  # Passed all checks

        return last_result or result  # Return best attempt

⚠️

Cascade cost accounting: Every escalation step incurs additional cost. Track your escalation rate per tier — if more than 15% of SIMPLE requests escalate, your classifier needs retuning, not a cheaper starting tier.

Parallel Routing (Confidence-Based)

An alternative pattern: run Tier 1 and Tier 2 in parallel. If they agree on the answer (measured by embedding similarity), return the Tier 1 response. This eliminates cascade latency at the cost of always paying for two model calls.

Python

import asyncio
import numpy as np

async def parallel_route(router, messages, similarity_threshold=0.92):
    # Run both tiers concurrently
    simple_task  = asyncio.create_task(
        asyncio.to_thread(router.route, messages, force_tier="SIMPLE")
    )
    complex_task = asyncio.create_task(
        asyncio.to_thread(router.route, messages, force_tier="COMPLEX")
    )
    simple_result, complex_result = await asyncio.gather(simple_task, complex_task)

    # Compare semantic similarity of responses
    emb_simple  = embed(simple_result.content)
    emb_complex = embed(complex_result.content)
    similarity  = float(np.dot(emb_simple, emb_complex) /
                        (np.linalg.norm(emb_simple) * np.linalg.norm(emb_complex)))

    if similarity >= similarity_threshold:
        return simple_result   # Agree → use cheaper
    else:
        return complex_result  # Disagree → trust the better model

07 Latency vs Cost Trade-offs

Lower tiers aren't just cheaper — they're also faster. Smaller models have fewer parameters to run inference through, which translates directly to lower time-to-first-token (TTFT) and faster generation throughput.

Model	Tier	TTFT (typical)	Tokens/sec	Input cost / M	Output cost / M
GPT-4o-mini	1	~0.3s	~110	$0.15	$0.60
Claude Haiku 3.5	1	~0.4s	~120	$0.80	$4.00
Claude Sonnet 4	2	~0.8s	~85	$3.00	$15.00
GPT-4o	2	~0.8s	~80	$2.50	$10.00
Gemini 2.5 Pro	3	~1.2s	~60	$1.25	$10.00
Claude Opus 4	4	~2s	~40	$15.00	$75.00

⏱️

SLA-driven routing: For user-facing applications with a latency SLA (e.g., respond in under 2 seconds), use latency as a hard filter. If the only model that can handle a task is too slow for your SLA, consider streaming the response to the user while it generates.

Streaming Latency

Streaming response tokens as they're generated dramatically improves perceived latency — even when the model is slow. For user-facing apps, always stream. The user sees the first token in 0.3–0.8s regardless of total response length.

Python — streaming

def stream_routed(router, messages, system_prompt=""):
    tier = router.classify(messages[-1]["content"])
    model = TIER_MODELS[tier]

    full_messages = messages
    if system_prompt:
        full_messages = [{"role": "system", "content": system_prompt}] + messages

    stream = router.client.chat.completions.create(
        model=model,
        messages=full_messages,
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield delta.content  # Stream tokens to caller

08 Prompt Adaptation by Tier

A prompt optimised for GPT-4o may not work as well for a smaller model. Tier 1 models benefit from simpler, more directive prompts. Tier 4 models can handle nuanced, layered instructions that would confuse smaller models.

Prompt tiers in practice

Python — tiered system prompts

SYSTEM_PROMPTS: dict[str, str] = {
    "SIMPLE": (
        "You are a helpful assistant. Answer questions directly and briefly. "
        "For yes/no questions, answer yes or no first, then optionally explain."
    ),
    "MEDIUM": (
        "You are a helpful assistant. Provide clear, well-structured answers. "
        "Use bullet points for lists. Keep responses focused and concise."
    ),
    "COMPLEX": (
        "You are an expert assistant. Analyse questions carefully before answering. "
        "Structure your response with reasoning, then conclusion. "
        "If there are trade-offs, enumerate them clearly."
    ),
    "FRONTIER": (
        "You are an expert research assistant with deep domain knowledge. "
        "Approach problems methodically: clarify assumptions, break down the problem, "
        "reason step by step, consider edge cases, and synthesise a nuanced answer. "
        "Cite your reasoning at each step."
    ),
}

def route_with_adapted_prompt(router, user_message: str) -> RouterResult:
    tier = router.classify(user_message)
    system_prompt = SYSTEM_PROMPTS[tier]
    return router.route(
        messages=[{"role": "user", "content": user_message}],
        system_prompt=system_prompt,
        force_tier=tier,
    )

🎯

Chain-of-thought for Tier 3+. Adding "Think step by step" to your system prompt significantly improves Tier 3/4 output quality on reasoning tasks. Don't add it to Tier 1/2 — it wastes tokens and can confuse smaller models.

Token budget by tier

Set appropriate max_tokens per tier. Over-allocating tokens wastes money; under-allocating truncates responses.

Python

MAX_TOKENS_BY_TIER: dict[str, int] = {
    "SIMPLE":   256,    # Short factual answers
    "MEDIUM":   1024,   # Summaries, extractions
    "COMPLEX":  4096,   # Analysis, long-form writing
    "FRONTIER": 16384,  # Deep research, complex agents
}

09 Tracking & Observability

Without logging, you're flying blind. You won't know which tiers are being used, whether your classifier is accurate, or how much each feature costs.

What to log on every request

Python — structured logging

import time, json, hashlib

def log_request(result: RouterResult, user_message: str, feature: str):
    record = {
        "ts":             int(time.time()),
        "feature":        feature,           # e.g. "support_bot", "code_review"
        "msg_hash":       hashlib.sha256(user_message.encode()).hexdigest()[:12],
        "tier":           result.tier,
        "model":          result.model,
        "input_tokens":   result.input_tokens,
        "output_tokens":  result.output_tokens,
        "cost_usd":       round(result.cost_usd, 8),
        "latency_ms":     result.latency_ms,
        "escalated":      result.escalated,
    }
    print(json.dumps(record))   # Pipe to your log aggregator (Datadog, Grafana, etc.)

Key metrics to track

Cost by feature: Which endpoint is responsible for the most spend?
Tier distribution: What % of requests hit each tier? (Should be >50% Tier 1 for most support/Q&A apps)
Escalation rate: What % of Tier 1 requests escalate? (<10% is good; >20% = classifier problem)
Latency p50/p95 by tier: Is your SLA being met?
Cost per feature per day: Trend over time as usage grows

OpenRouter cost tracking

OpenRouter returns actual cost in the response. No estimation needed — it's the actual billed amount.

Python — OpenRouter cost extraction

resp = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=messages,
)

# OpenRouter includes cost in the usage object
usage = resp.usage
cost_usd = getattr(usage, "cost", None)

if cost_usd is not None:
    print(f"This request cost: ${float(cost_usd):.8f}")
    # e.g. "This request cost: $0.00002340"

# Alternatively, use the generation stats endpoint
# GET https://openrouter.ai/api/v1/generation?id={generation_id}
# Returns detailed cost breakdown, model used, and token counts

📊

Set up a cost alert. Use OpenRouter's dashboard to set a daily spend limit. This prevents a classification bug or runaway loop from sending every request to Claude Opus and burning through your budget overnight.

10 Common Failure Modes

❌ Over-routing to expensive models — classifier is too conservative, sends most requests to Tier 3/4

✅ Audit tier distribution weekly. If <40% of requests hit Tier 1, re-calibrate your classifier with more SIMPLE examples from your actual request logs.

❌ Under-routing (quality degradation) — Tier 1 handles tasks that genuinely need Tier 3, producing poor outputs

✅ Add a quality signal (output length, format validation, user thumbs-down) and track it by tier. Escalate systematically rather than one-off exceptions.

❌ Classification cost exceeds savings — LLM classifier uses a larger model than it needs to, or classifies every single request including trivial ones

✅ Use rules for the obvious extremes. Only invoke LLM classification for ambiguous middle-ground requests. Use gpt-4o-mini or Haiku — never Sonnet/GPT-4o as the classifier.

❌ Model name hardcoding — model strings scattered across codebase; when a model is deprecated you need to find and update dozens of references

✅ Always route through the TIER_MODELS dict. The tier is stable; the model name is not. Update the mapping in one place when model availability changes.

❌ No fallback on API error — model API returns 503 or rate-limit error; application crashes instead of trying alternate tier

✅ Wrap every model call in a try/except. On API error, escalate to the next tier (or a backup provider). Never surface raw API errors to users.

❌ Inconsistent classifier across services — different microservices implement different classification logic, leading to unpredictable routing behaviour

✅ Deploy the router as a shared library or internal microservice. One classifier, one TIER_MODELS dict, one log stream. Centralising the router also makes cost attribution easier.

Take the Routing Templates

Download the complete model routing reference — router class, classifier prompt, cascade pattern, tier configs, and observability logging — all in one Markdown file.

← Back to Skills Library

01 Why Model Routing Matters

When You Don't Need It

02 The Four-Tier Model Landscape

Cheap & Quick

Mid-Range

High Capability

Max Reasoning

03 Request Classification

Approach 1: LLM-based Classifier

Approach 2: Rule-based (Zero-cost)

Approach 3: Embedding Similarity (Calibrated)

Hybrid: Rules first, LLM fallback

04 Router Architecture

Classify

Select

Generate & Log

Validate (optional)

05 Implementation: Python Router

Usage

06 Cascade & Fallback Patterns

When to escalate

Parallel Routing (Confidence-Based)

07 Latency vs Cost Trade-offs

Streaming Latency

08 Prompt Adaptation by Tier

Prompt tiers in practice

Token budget by tier

09 Tracking & Observability

What to log on every request

Key metrics to track

OpenRouter cost tracking

10 Common Failure Modes

Take the Routing Templates

Related Skills

RAG Basics

Prompt Chaining

Agent Memory

System Prompts