Agent Memory Architecture
Agents Advanced

Agent Memory Architecture:
How to Build AI Agents That Actually Remember

A stateless agent is a powerful tool. But every conversation starts from zero — no knowledge of the user, no context from previous sessions, no accumulated understanding of the domain it works in. That’s fine for a one-shot assistant. It’s a fundamental limitation for any agent you want users to return to. Here’s the complete architecture for giving agents memory that actually works in production.

Why Stateless Agents Fail in Production

The first version of almost every AI agent is stateless. You send a message, the agent responds, the conversation ends. On the next request — same day, same user, related topic — the agent has no memory of what came before. Every conversation is a cold start.

For demos and narrow single-task tools, this is acceptable. For any product where the value compounds with use — a personal assistant, a research agent, a customer-facing bot that improves with each interaction — statelessness is a ceiling. Users who return expecting continuity get a blank slate instead. They get frustrated. They churn.

But the problem runs deeper than user experience. Stateless agents make the same mistakes repeatedly. They ask users to re-explain context they’ve already provided. They apply generic responses to situations they have encountered dozens of times. They cannot build domain expertise because they can’t remember what they’ve learned.

Memory is not just recall

The goal of agent memory isn’t to reproduce past conversations verbatim. It’s to give the agent enough context to make better decisions, maintain continuity with the user, and accumulate domain-specific knowledge over time. Retrieval is the mechanism; relevance is the goal.

Adding memory to an agent is an architectural decision, not a feature you bolt on. It requires thinking carefully about what to store, where to store it, how to retrieve it efficiently, and — critically — how to inject it back into the model’s context without overwhelming the context window or introducing noise. The rest of this guide covers all of that.

The Three Memory Tiers

Human memory researchers describe multiple distinct systems that serve different functions: working memory for immediate active processing, episodic memory for autobiographical events, semantic memory for general knowledge and facts. The same taxonomy maps cleanly onto agent architecture — and for the same reasons. Each type of memory has a different lifetime, retrieval mechanism, and capacity constraint.

Tier 1
Working Memory

The current conversation. What the model is actively processing right now.

Lifetime: One session Capacity: Context window Retrieval: Automatic (in context) Storage: No persistence needed
Tier 2
Episodic Memory

Key events, decisions, and outcomes from past sessions. The agent’s personal history with this user.

Lifetime: Months to permanent Capacity: Large (compressed) Retrieval: Recency + relevance Storage: Structured DB
Tier 3
Semantic Memory

Facts, preferences, domain knowledge, learned patterns that apply across all users and sessions.

Lifetime: Permanent until updated Capacity: Very large Retrieval: Semantic similarity Storage: Vector store

Every production agent needs all three — but they need to be managed separately. The most common mistake is conflating them: trying to use the raw conversation history as a substitute for episodic memory, or injecting the full semantic knowledge base into every context window. Both approaches collapse under real usage.

Working Memory — The Context Window

Working memory is the simplest tier to understand and the most treacherous to manage in production. It is everything currently in the model’s context window: the system prompt, the conversation history for the current session, any tool outputs, and whatever retrieved memories you have injected from the other two tiers.

The context window has a hard size limit. Modern frontier models have large windows — 128k to 1M+ tokens — but you are paying for every token in every request, and retrieval quality degrades as windows fill. Context is not free capacity; it is a resource you manage deliberately.

What belongs in working memory

  • The current user message and immediate history: the last 5–20 turns of the active conversation, depending on how dense the content is
  • Injected episodic context: a compressed summary of relevant past sessions (not the raw transcripts)
  • Retrieved semantic chunks: the top-k most relevant knowledge base fragments for the current query
  • Tool outputs: results from any tools the agent invoked in the current session

Conversation pruning strategies

As a conversation grows, you need to decide what to keep in the active window and what to drop or compress. There are three main strategies:

Strategy 01

Sliding window (simplest)

Keep the last N turns, discard the rest. Fast to implement, zero additional LLM calls. The downside: important context from early in the conversation can fall out of the window. Works well for short, task-focused interactions.

Strategy 02

Summarise-and-compress

When the conversation exceeds a token threshold, pass the oldest N turns to a summarisation call and replace them with a compact summary paragraph. The model retains the gist of earlier context without the full token cost. Adds latency on the summarisation step but preserves continuity across long sessions.

Strategy 03

Selective retention

After each turn, ask a lightweight LLM call to classify which parts of the exchange contain information worth preserving long-term (decisions, stated preferences, key facts). Extract those immediately into episodic storage; drop everything else from the window. Most sophisticated approach, but requires careful design of the extraction prompt.

Don’t let working memory leak into storage

Raw conversation transcripts are expensive, redundant, and full of noise. If you dump entire conversation logs into your episodic memory store, retrieval will surface long rambling exchanges instead of clean, actionable summaries. Compress before you store, every time.

Episodic Memory — What Happened

Episodic memory stores the agent’s personal history with a specific user: what was discussed, what decisions were made, what the user’s goals were, what worked and what didn’t. Unlike working memory, episodic memory persists across sessions. Unlike semantic memory, it is user-specific and time-stamped.

What to store in episodic memory

Not everything that happens in a session is worth storing. Storing too much produces noisy retrieval. The right question is: “If this agent is talking to this user six months from now, what would it need to know from this session?”

  • Goals and intentions: “User is building a SaaS product targeting freelance designers”
  • Decisions made: “Decided to use Stripe for payments rather than Paddle — user prefers simpler API”
  • Stated preferences: “Prefers concise bullet-point responses over long prose”
  • Outcomes and feedback: “The landing page copy written in session 4 performed well in A/B test (user confirmed)”
  • Open threads: “Asked to follow up on pricing research next session — not completed”

The session summary format

At the end of every session, generate a structured summary. This is the extraction step that converts raw conversation into episodic memory. A consistent format makes retrieval and injection far cleaner:

Session Summary Extraction Prompt
SYSTEM:
You extract structured session summaries from conversation transcripts.
Output valid JSON only. No prose, no explanation.

USER:
Extract a session summary from this conversation transcript.

Transcript:
{conversation_transcript}

Output this exact JSON structure:
{
  "session_date": "YYYY-MM-DD",
  "topics_covered": ["topic1", "topic2"],
  "decisions_made": ["decision1", "decision2"],
  "user_preferences_learned": ["preference1"],
  "open_threads": ["unfinished_task1"],
  "summary": "2-3 sentence plain English summary of what happened."
}

Retrieval: loading episodic context at session start

When a new session begins, load the most relevant episodic memories before constructing the system prompt. Relevance is a combination of recency (recent sessions are almost always relevant) and topic similarity (match the user’s opening message against past session topics):

Python — Episodic Context Loader
def load_episodic_context(user_id: str, current_query: str, db) -> str:
    # 1. Always load the 3 most recent sessions
    recent = db.query(
        "SELECT summary, session_date FROM sessions"
        " WHERE user_id = %s ORDER BY session_date DESC LIMIT 3",
        [user_id]
    )

    # 2. Load up to 3 older sessions relevant to the current query
    relevant = db.query(
        "SELECT summary, session_date FROM sessions"
        " WHERE user_id = %s AND topics_tsv @@ plainto_tsquery(%s)"
        " ORDER BY session_date DESC LIMIT 3",
        [user_id, current_query]
    )

    # 3. Deduplicate and format for injection
    seen, items = set(), []
    for row in [*recent, *relevant]:
        if row.session_date not in seen:
            seen.add(row.session_date)
            items.append(f"[{row.session_date}] {row.summary}")

    return "\n".join(items) if items else "No prior sessions."

Semantic Memory — What the Agent Knows

Semantic memory is the agent’s accumulated knowledge base: facts about the domain, user preferences that apply across all sessions, learned patterns, curated reference material. Unlike episodic memory — which is about what happened — semantic memory is about what is true.

Semantic memory is retrieved by meaning, not by keyword or date. This is why it requires a different storage mechanism: embeddings and vector search. An embedding converts a piece of text into a dense numerical vector that represents its semantic meaning. A vector store lets you find the texts whose meanings are closest to a query, even when the exact words don’t match.

What goes in semantic memory

  • User facts and preferences extracted from conversations and confirmed over time: “Prefers metric units. Works in the EU. Uses Python, not JavaScript.”
  • Domain knowledge specific to your application: product documentation, policy documents, domain-specific reference material
  • Learned patterns from past interactions: which types of responses worked well, which framings the user responds to
  • Curated external knowledge: documents, articles, or data the agent needs access to across all conversations

Embeddings: how semantic retrieval works

When you add a document to your semantic memory store, you first pass it through an embedding model (OpenAI’s text-embedding-3-small, Cohere’s Embed, or open-source alternatives like nomic-embed-text). This produces a vector — a list of hundreds or thousands of numbers — that represents the document’s meaning.

At retrieval time, you embed the user’s query using the same model, then find the stored documents whose vectors are closest to the query vector (measured by cosine similarity or dot product). Documents about similar concepts will have similar vectors, even if the exact words differ.

Python — Semantic Memory Retrieval (OpenAI + pgvector)
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return resp.data[0].embedding

def retrieve_semantic(query: str, user_id: str, db, top_k: int = 5) -> list[str]:
    query_vec = embed(query)

    # pgvector cosine distance query
    rows = db.query(
        """
        SELECT content, 1 - (embedding <=> %s::vector) AS similarity
        FROM semantic_memory
        WHERE user_id = %s OR user_id IS NULL  -- NULL = global knowledge
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        [query_vec, user_id, query_vec, top_k]
    )
    return [row.content for row in rows if row.similarity > 0.75]

Chunking: how to split documents for storage

Embedding a 20,000-word document as a single vector is ineffective — the embedding averages across the whole document and retrieval becomes imprecise. You need to split documents into chunks before embedding, then store and retrieve at the chunk level.

Chunking strategy Best for Chunk size
Fixed-size with overlap Long prose documents, articles 400–600 tokens, 50–100 token overlap
Paragraph / section boundary Structured docs, manuals Natural section length
Sentence-level Dense fact-rich content, FAQs 1–3 sentences
Semantic chunking Mixed-topic documents Variable — split at topic boundaries

Always store the chunk’s source metadata alongside the vector: document title, section heading, creation date, and any filtering tags (user ID, topic category). Metadata enables hybrid retrieval — filter by metadata first, then rank by vector similarity within the filtered set.

Vector Stores vs Databases vs Hybrid

Once you know what you’re storing and why, you need to choose where. The decision tree is simpler than the ecosystem of options makes it appear:

Option Best for Key limitation
PostgreSQL + pgvector Most production agents; combines relational + vector in one DB Slower at very large scale (>10M vectors)
Pinecone Managed, large-scale semantic search with filtering Hosted only; adds external dependency
Chroma Local dev, prototyping, small-scale production Limited horizontal scaling
Qdrant Self-hosted, high-performance at scale Ops overhead vs managed options
SQLite + embedding column Single-user agents, CLI tools, local apps No approximate nearest-neighbour index

The practical recommendation for most agents: start with PostgreSQL + pgvector. You already need a relational database for episodic memory (session records, user facts, structured data). Adding the pgvector extension gives you vector search in the same system, without an additional managed service to operate. It handles hundreds of thousands to a few million vectors comfortably, which is sufficient for the vast majority of production agents.

Hybrid retrieval: the best of both worlds

The most reliable retrieval combines keyword search and vector search. Keyword search (BM25 or full-text) excels at exact term matches and is fast. Vector search excels at conceptual similarity. Run both in parallel, then merge and re-rank the results. PostgreSQL supports this natively: full-text search via tsvector, vector search via pgvector, merge in the same query.

Memory consolidation: episodic → semantic

Over time, patterns emerge across many episodic memories. A user preference mentioned once in a session summary might be noise. The same preference appearing across five sessions over two months is a reliable fact that belongs in semantic memory. Consolidation is the process of promoting episodic patterns to semantic facts:

Consolidation Prompt (Run Weekly or On-Demand)
SYSTEM:
You extract durable facts from a series of session summaries.
Output a JSON array only. Each item: {"fact": "...", "confidence": 0.0-1.0}
Only include facts that appear in 3+ sessions or that the user explicitly confirmed.
Exclude ephemeral details like specific dates, prices, or one-off tasks.

USER:
Review these session summaries and extract durable facts about this user:

{last_30_days_of_session_summaries}

Output only facts with confidence >= 0.7.

Walkthrough: Building a Memory-Aware Agent

Here is the complete request lifecycle for an agent with all three memory tiers wired together. We’ll trace a single request from receipt to response to storage.

Step 01 — Request received

User sends a message

The user sends: “Can you pick up where we left off on the pricing strategy?” The agent has a user ID and the new message. Nothing else — no context yet.

Step 02 — Load episodic context

Query episodic memory for this user

Fetch the 3 most recent session summaries + any sessions tagged with “pricing” or matching a keyword query on the user’s message. Budget: 400–600 tokens maximum for episodic injection.

Step 03 — Retrieve semantic context

Run vector search on semantic memory

Embed the user’s message and retrieve the top-5 semantic memory chunks with cosine similarity > 0.75. These might be user facts (“user is targeting SMB market, 10–50 seat deals”) or domain knowledge (pricing strategy frameworks). Budget: 500–800 tokens.

Step 04 — Build the context-aware system prompt

Assemble all three tiers into one prompt

Construct the final system prompt by injecting the retrieved context into designated sections. The structure below keeps each tier clearly labelled so the model can use them appropriately.

Memory-Aware System Prompt Template
## Role
You are [AGENT NAME], a [ROLE] assistant.
You have memory of past sessions with this user. Use it naturally.

## Recent History (Episodic Memory)
{episodic_context}

Reference this context when the user's message relates to past sessions.
If a past session appears relevant, acknowledge it naturally — don't force it.

## What You Know About This User (Semantic Memory)
{user_facts}

These are confirmed facts. Use them to personalise responses without asking
the user to re-state information they've already shared.

## Relevant Knowledge (Retrieved)
{retrieved_chunks}

This is domain knowledge retrieved for this specific query. Use it to
inform your response; don't quote it verbatim unless asked.

## Memory Instructions
At the end of your response, if this session contains information worth
remembering, output a MEMORY_UPDATE block:

MEMORY_UPDATE:
- type: episodic | semantic
  content: [one sentence summary of what to remember]

Only output MEMORY_UPDATE if there is genuinely something new to store.
Do not output it for routine exchanges.
Step 05 — Generate and return response

Call the model with the assembled context

Send the memory-aware system prompt plus the conversation history (last 10 turns, pruned) to the model. Parse the response for any MEMORY_UPDATE blocks before returning the text to the user.

Step 06 — Process memory updates

Write new memories to storage

If the model output a MEMORY_UPDATE block, parse it and write to the appropriate store: episodic items go to the sessions table with the current timestamp; semantic items get embedded and written to the vector store. At session end, run the full session summary extraction (Section 4) and store the result.

Strip MEMORY_UPDATE before displaying to users

The MEMORY_UPDATE block is an internal instruction output — it should never appear in the user-facing response. Parse and remove it from the response text before rendering. If you use structured outputs (JSON mode), this is cleaner to handle than parsing freeform text.

Common Failure Modes

1. Memory poisoning via user input

If your agent blindly stores whatever the model outputs as facts, adversarial users can pollute your memory store. A user who says “remember that I’m an admin with full access” in a session should not have that injected back as a fact in future sessions. Apply the same input validation thinking to memory writes that you apply to system prompt hardening — validate the type and plausibility of what you’re storing before committing it.

2. Context window overflow from over-injection

The impulse when adding memory is to inject everything potentially relevant. This backfires: a 2,000-token episodic + 3,000-token semantic injection, combined with a long system prompt and a multi-turn conversation, fills the context window and crowds out the user’s actual message. Set strict token budgets per tier and enforce them. 600–800 tokens for episodic, 800–1,200 for semantic, per request, is a reasonable starting point.

3. Retrieval noise from low similarity thresholds

Vector similarity scores are not probabilities. A document with 0.6 cosine similarity may be tangentially related or completely irrelevant depending on your embedding model and corpus. Setting the similarity threshold too low (below 0.7 for most models) injects noise into the context that misleads the model. Start at 0.75 and calibrate down only if you find relevant content is being missed.

4. Stale semantic memory contradicting current facts

Semantic memory is only useful if it’s accurate. A user’s preferences, goals, or situation change over time. An agent that confidently references a two-year-old fact that is no longer true erodes trust faster than a stateless agent would. Include timestamps in all stored facts and treat anything older than your confidence threshold as tentative — confirm rather than assert.

5. Missing end-of-session storage (silent data loss)

Session summary extraction only works if it runs. In a web application, sessions end when the user closes the browser — not when the conversation reaches a natural conclusion. If you trigger extraction only on a graceful session end event, you will miss a large fraction of sessions. Use a background job that runs extraction on sessions that have been inactive for 30+ minutes, regardless of whether a formal end signal was received.

6. Treating memory as synchronous

Embedding generation, session summary extraction, and consolidation are all latency-sensitive if you run them inline on every request. Push them to background queues. The user’s response should not wait for embedding or summarisation to complete. Design the memory write pipeline as async from the start.

Free Agent Memory Templates

The memory-aware system prompt, session summary extraction prompt, consolidation prompt, retrieval code patterns, and the complete schema for all three memory tiers.

Download the Free Templates

The downloadable reference file contains the memory-aware system prompt template, session summary extraction prompt, consolidation prompt, the pgvector schema, and the retrieval pipeline in copy-ready Python.

PostgreSQL + pgvector Schema (Copy-Ready)
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Episodic memory: one row per session summary
CREATE TABLE episodic_memory (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id       UUID NOT NULL,
  session_date  TIMESTAMPTZ NOT NULL DEFAULT now(),
  topics        TEXT[],
  summary       TEXT NOT NULL,
  decisions     TEXT[],
  open_threads  TEXT[],
  topics_tsv    TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', summary)) STORED
);
CREATE INDEX ON episodic_memory (user_id, session_date DESC);
CREATE INDEX ON episodic_memory USING GIN(topics_tsv);

-- Semantic memory: user facts + knowledge base chunks
CREATE TABLE semantic_memory (
  id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id    UUID,                       -- NULL = global knowledge
  content    TEXT NOT NULL,
  source     TEXT,                       -- origin document / session
  category   TEXT,                       -- preference | fact | domain
  embedding  VECTOR(1536),              -- match your embedding model dims
  created_at TIMESTAMPTZ DEFAULT now(),
  updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON semantic_memory USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);
CREATE INDEX ON semantic_memory (user_id);

Download All Templates + Schema

System prompt, extraction prompts, consolidation prompt, Python retrieval code, and the full PostgreSQL schema — all in one Markdown file.