Agent Memory and Context Management

An agent without memory is not an agent. It is a very expensive stateless function.

Every time it runs, it starts from zero. It does not know which prospects it already emailed. It does not remember that a particular topic drove three times the average engagement last month. It cannot learn from mistakes because it has no record of having made them. Each run is isolated, and the knowledge built up through every prior run evaporates the moment the context window closes.

This is the default state of most agent implementations. People build agents that work well in demos — single-run scenarios with clean inputs — and then are confused when they do not compound value over time. Memory is what makes the difference between a system that gets smarter and a system that merely executes.

This guide covers three kinds of memory, the implementation patterns that work in production, and how to manage context windows before they work against you.

Why Memory Fails Without Intention

Before looking at solutions, it helps to understand exactly where memoryless agents break down.

They repeat mistakes. An agent that tried a specific outreach angle and got zero responses will try the same angle next week. It has no record of the failure. You end up in a loop of the same low-quality outputs, no feedback mechanism to break the cycle.

They waste tokens. Without long-term storage, agents re-derive context from scratch. Every run re-analyzes the same competitor landscape, re-reads the same documentation, re-establishes the same baseline understanding. You pay in tokens and latency for work that was already done.

They cannot coordinate. In a multi-agent system where multiple agents run independently, without shared memory each agent operates in its own isolated world. The lead scoring agent does not know what the outreach agent already sent. The content agent does not know what the SEO agent already published. Work gets duplicated, and worse, contradictory decisions get made.

They break trust. Users and downstream systems expect continuity. An agent that forgets it already handled a request, or cannot recall why it made a previous decision, is not a system you can rely on.

The solution is to be deliberate about what your agents remember, where that memory lives, and how it flows across runs and across agents.

The Three Types of Memory

It helps to use a clear taxonomy before choosing an implementation. There are three kinds of memory, and they have different properties and different storage needs.

Short-Term Memory

Short-term memory is everything inside a single agent run — the conversation history, tool call results, intermediate reasoning, and accumulated context within one context window.

This is managed for you by the LLM runtime. You pass messages in, the model reasons over them, and responses come back. The challenge is not implementing it, it is managing it: context windows are finite, and as runs get longer, you either hit limits or start paying for tokens that dilute rather than help.

Long-Term Memory

Long-term memory persists across runs. It lives in a storage system — a file, a database, a vector store — and is explicitly loaded at the start of a run and explicitly saved at the end.

This is the memory you have to build. Nothing gives it to you automatically. An agent without save_memory and load_memory tools has no long-term memory by definition, no matter how sophisticated its reasoning.

Shared Memory

Shared memory is long-term memory that multiple agents can read from and write to. It is the blackboard pattern applied to memory. Agent A saves a finding, Agent B reads it, builds on it, and saves its own additions. No direct coordination between agents — they communicate through the store.

This is what enables emergent coordination in a multi-agent system. It is also the source of the hardest bugs if you do not manage write conflicts and staleness carefully.

Memory Types by Use Case

Beyond the three categories above, it helps to think about what kind of knowledge a memory entry represents. There are three classical types from cognitive science that map cleanly to agent systems.

Episodic memory is memory of events. What happened, when it happened, and what the outcome was. "We emailed this prospect on March 1st. They opened the email but did not click. We followed up on March 5th. No response." Episodic memory is the basis for learning — you can look back at a history of events and extract patterns.

Semantic memory is factual knowledge. Things your agent knows to be true about the world and about your domain. "Competitor X recently launched a feature targeting mid-market teams. Topic Y consistently outperforms Topic Z in our audience." Semantic memory is context that informs decisions without being tied to a specific event.

Procedural memory is knowledge of how to do things. Successful prompt templates, tool call sequences that work, heuristics learned from experience. "When sending outreach to technical founders, lead with a specific integration use case rather than a general pitch. Conversion rate is significantly higher." Procedural memory is the highest-value form of agent learning because it directly improves future performance.

In practice, most agent memory systems store a mix of all three. The important thing is to tag memories by type so they can be retrieved and weighted appropriately.

Short-Term Memory: Context Window Strategies

Managing the context window is the most immediate memory challenge. Here are the three strategies that matter.

Sliding Window

Keep only the N most recent messages. When the window fills, drop the oldest. Simple to implement, appropriate when recent context is what matters and older history is genuinely stale.

The failure mode is losing important early context — a constraint established at the start of a conversation that is no longer visible when a later decision is made. Use sliding windows only when you are confident that early messages will not be referenced later.

Summarization

When the context window is approaching capacity, have the agent summarize everything so far into a compact representation, replace the detailed history with the summary, and continue. This preserves the essential information while reclaiming tokens.

Summarization is more expensive than sliding windows — you pay for an extra LLM call — but the output is much richer than simply dropping old messages. Important decisions and their rationale survive. Summarization works well for long-running research or analysis tasks.

async function summarizeContext(
  messages: Message[],
  client: Anthropic
): Promise<string> {
  const response = await client.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 1024,
    system: `You are summarizing an agent's work session for context compression.
Preserve: key decisions made, important findings, current state of work, open questions.
Be concise but complete. The summary will replace the full conversation history.`,
    messages: [
      {
        role: "user",
        content: `Summarize this conversation:\n\n${messages.map((m) => `${m.role}: ${m.content}`).join("\n\n")}`,
      },
    ],
  });

  return extractText(response);
}

async function manageContext(
  messages: Message[],
  tokenLimit: number,
  client: Anthropic
): Promise<Message[]> {
  const estimatedTokens = estimateTokens(messages);

  if (estimatedTokens < tokenLimit * 0.8) {
    return messages; // Still within safe range
  }

  const summary = await summarizeContext(messages, client);

  // Replace full history with the summary as a system-level context injection
  return [
    {
      role: "user",
      content: `[Previous session summary]\n${summary}\n\n[Continuing from here]`,
    },
    { role: "assistant", content: "Understood. Continuing with that context." },
  ];
}

Importance-Based Pruning

Not all messages are equally valuable. Instead of dropping the oldest or summarizing everything, score each message for relevance to the current task and drop the lowest-scoring entries first.

This requires more upfront work but produces the best results for long-running agents. Messages that established key constraints, resolved hard problems, or contain data you know you will need later are kept. Filler — clarifying questions, polite acknowledgments, redundant confirmations — is dropped.

Long-Term Memory: The save_memory / load_memory Pattern

The pattern we use at The AI University is simple and explicit: agents are given two tools, save_memory and load_memory. Every agent run starts with a load_memory call to retrieve relevant prior context. Every run ends with a save_memory call to persist what was learned.

This makes memory visible and auditable. You can read the memory store and understand exactly what the agent knows. You can delete bad memories. You can seed memories manually. It is the opposite of an opaque fine-tuned model — memory is data you control.

Here is the TypeScript implementation of both tools.

// memory-tools.ts
import fs from "fs";
import path from "path";

interface MemoryEntry {
  id: string;
  agentName: string;
  type: "episodic" | "semantic" | "procedural";
  key: string;
  value: string;
  tags: string[];
  createdAt: string;
  lastAccessedAt: string;
  accessCount: number;
  importance: number; // 1-10, used for pruning
}

interface MemoryStore {
  entries: MemoryEntry[];
  version: number;
  lastUpdated: string;
}

const MEMORY_DIR = "./data/agent-memory";

function getMemoryPath(agentName: string): string {
  return path.join(MEMORY_DIR, `${agentName}.json`);
}

function loadStore(agentName: string): MemoryStore {
  const filePath = getMemoryPath(agentName);
  if (!fs.existsSync(filePath)) {
    return { entries: [], version: 1, lastUpdated: new Date().toISOString() };
  }
  return JSON.parse(fs.readFileSync(filePath, "utf-8"));
}

function saveStore(agentName: string, store: MemoryStore): void {
  fs.mkdirSync(MEMORY_DIR, { recursive: true });
  store.lastUpdated = new Date().toISOString();
  fs.writeFileSync(getMemoryPath(agentName), JSON.stringify(store, null, 2));
}

// The save_memory tool definition
export const saveMemoryTool = {
  name: "save_memory",
  description:
    "Save a memory for future runs. Use this to persist findings, decisions, learnings, and facts that should inform future work.",
  input_schema: {
    type: "object",
    properties: {
      key: {
        type: "string",
        description: "A unique identifier for this memory (e.g., 'prospect_john_doe_status')",
      },
      value: {
        type: "string",
        description: "The memory content to store",
      },
      type: {
        type: "string",
        enum: ["episodic", "semantic", "procedural"],
        description:
          "episodic=event that happened, semantic=fact about the world, procedural=how to do something",
      },
      tags: {
        type: "array",
        items: { type: "string" },
        description: "Tags for filtering and retrieval (e.g., ['prospect', 'outreach', 'cold'])",
      },
      importance: {
        type: "number",
        description: "Importance score 1-10. Higher scores survive pruning longer.",
      },
    },
    required: ["key", "value", "type"],
  },
};

// The load_memory tool definition
export const loadMemoryTool = {
  name: "load_memory",
  description:
    "Load memories from previous runs. Filter by tags or search by key to retrieve relevant context.",
  input_schema: {
    type: "object",
    properties: {
      tags: {
        type: "array",
        items: { type: "string" },
        description: "Filter memories by tags. Returns all entries matching any tag.",
      },
      key: {
        type: "string",
        description: "Retrieve a specific memory by its exact key.",
      },
      limit: {
        type: "number",
        description: "Maximum number of memories to return. Defaults to 20.",
      },
    },
  },
};

// Tool execution handlers
export function executeSaveMemory(
  agentName: string,
  params: {
    key: string;
    value: string;
    type: "episodic" | "semantic" | "procedural";
    tags?: string[];
    importance?: number;
  }
): { success: boolean; message: string } {
  const store = loadStore(agentName);

  const existingIndex = store.entries.findIndex((e) => e.key === params.key);
  const entry: MemoryEntry = {
    id: `${agentName}-${params.key}-${Date.now()}`,
    agentName,
    type: params.type,
    key: params.key,
    value: params.value,
    tags: params.tags ?? [],
    createdAt:
      existingIndex >= 0
        ? store.entries[existingIndex].createdAt
        : new Date().toISOString(),
    lastAccessedAt: new Date().toISOString(),
    accessCount:
      existingIndex >= 0 ? store.entries[existingIndex].accessCount : 0,
    importance: params.importance ?? 5,
  };

  if (existingIndex >= 0) {
    store.entries[existingIndex] = entry;
  } else {
    store.entries.push(entry);
  }

  saveStore(agentName, store);

  return {
    success: true,
    message: `Memory saved: ${params.key}`,
  };
}

export function executeLoadMemory(
  agentName: string,
  params: { tags?: string[]; key?: string; limit?: number }
): MemoryEntry[] {
  const store = loadStore(agentName);
  const limit = params.limit ?? 20;

  let entries = store.entries;

  if (params.key) {
    entries = entries.filter((e) => e.key === params.key);
  }

  if (params.tags && params.tags.length > 0) {
    entries = entries.filter((e) =>
      params.tags!.some((tag) => e.tags.includes(tag))
    );
  }

  // Sort by importance descending, then by recency
  entries = entries
    .sort((a, b) => {
      if (b.importance !== a.importance) return b.importance - a.importance;
      return (
        new Date(b.lastAccessedAt).getTime() -
        new Date(a.lastAccessedAt).getTime()
      );
    })
    .slice(0, limit);

  // Update access metadata
  const accessedKeys = new Set(entries.map((e) => e.key));
  store.entries = store.entries.map((e) =>
    accessedKeys.has(e.key)
      ? {
          ...e,
          lastAccessedAt: new Date().toISOString(),
          accessCount: e.accessCount + 1,
        }
      : e
  );

  saveStore(agentName, store);

  return entries;
}

The agent uses these tools naturally inside its reasoning loop. At the start of a run, it calls load_memory with relevant tags to pull in prior context. As it works, it calls save_memory to persist findings. When the run ends, it saves any final conclusions.

Shared Memory: The Blackboard in Practice

When multiple agents need to read each other's outputs, they all point to the same memory store and use agreed-upon key conventions. There is no direct agent-to-agent communication — just reads and writes against shared storage.

At The AI University, our agents use a three-part key convention: {domain}/{entity}/{attribute}. The outreach agent writes to leads/john-doe-acme/last-contact. The lead scoring agent reads from leads/john-doe-acme/last-contact to factor recency into its score. Neither agent knows the other exists. They coordinate through the store.

// Shared memory with namespaced keys
const SHARED_MEMORY_PATH = "./data/shared-memory.json";

export function readShared(key: string): MemoryEntry | null {
  const store = loadSharedStore();
  const entry = store.entries.find((e) => e.key === key);
  return entry ?? null;
}

export function writeShared(
  agentName: string,
  key: string,
  value: string,
  tags: string[]
): void {
  const store = loadSharedStore();
  const existingIndex = store.entries.findIndex((e) => e.key === key);

  const entry: MemoryEntry = {
    id: `shared-${key}-${Date.now()}`,
    agentName,
    type: "episodic",
    key,
    value,
    tags,
    createdAt:
      existingIndex >= 0
        ? store.entries[existingIndex].createdAt
        : new Date().toISOString(),
    lastAccessedAt: new Date().toISOString(),
    accessCount:
      existingIndex >= 0 ? store.entries[existingIndex].accessCount : 0,
    importance: 5,
  };

  if (existingIndex >= 0) {
    store.entries[existingIndex] = entry;
  } else {
    store.entries.push(entry);
  }

  saveSharedStore(store);
}

Write conflicts are the main risk. If two agents write to the same key simultaneously, the last write wins and one agent's output is silently overwritten. In low-volume systems this is rarely a problem. At scale, use a key naming scheme that includes the writing agent's name (outreach-agent/leads/john-doe/last-contact) and have a dedicated reconciliation step if two agents need to write to semantically related keys.

Vector Similarity Search for Relevant Memories

Key-based retrieval works when you know exactly what you are looking for. But often an agent needs to find memories that are relevant to its current task without knowing their exact keys.

Vector similarity search solves this. You embed each memory as a vector when it is saved, and at retrieval time you embed the current query and find the most similar memories in the vector space. Semantically related memories surface even if they share no keywords with the query.

For a production agent system, you have three practical options:

Embed on save, search on load using a vector database. Services like Pinecone, Weaviate, or Qdrant handle this entirely. Best for large memory stores where full-scan similarity search would be too slow.
Embed on save, store vectors in SQLite with a cosine similarity query. Works well up to tens of thousands of memories without external dependencies.
Use an embedding API at query time with brute-force similarity. Load all memories, embed the query, compute similarity scores in-memory. Simple to implement, fine for stores under a few thousand entries.

The overhead of embedding is worth it when your agents have accumulated thousands of memories across domains. Without semantic search, agents will miss relevant memories because they searched with the wrong keywords.

Time Decay and Memory Pruning

Not all memories stay relevant. A competitor analysis from eight months ago may be actively misleading. A prospect status updated in January is stale if it is now March. Memory stores that grow without pruning become noise — the agent retrieves old, contradictory information and has to reason about whether it is still accurate.

Two mechanisms manage this.

Time decay reduces the effective importance score of memories as they age. A memory saved three months ago with an importance of 7 might be treated as a 4 by the time a new retrieval happens. This pushes older memories lower in ranking without deleting them immediately, allowing recent memories to dominate retrieval results.

function applyTimeDecay(entry: MemoryEntry, halfLifeDays: number = 30): number {
  const ageMs =
    Date.now() - new Date(entry.lastAccessedAt).getTime();
  const ageDays = ageMs / (1000 * 60 * 60 * 24);
  const decayFactor = Math.pow(0.5, ageDays / halfLifeDays);
  return entry.importance * decayFactor;
}

Memory pruning deletes entries that have decayed below a threshold or that have been explicitly superseded. Run a pruning pass periodically — daily or weekly depending on your memory volume. Keep entries above the threshold, delete the rest.

function pruneMemories(
  agentName: string,
  options: { minImportance: number; halfLifeDays: number }
): { pruned: number; kept: number } {
  const store = loadStore(agentName);

  const before = store.entries.length;
  store.entries = store.entries.filter((entry) => {
    const decayedImportance = applyTimeDecay(entry, options.halfLifeDays);
    return decayedImportance >= options.minImportance;
  });

  const after = store.entries.length;
  saveStore(agentName, store);

  return { pruned: before - after, kept: after };
}

For procedural memories — the high-value learnings about how to do things effectively — use a much longer half-life or exempt them from decay entirely. A heuristic that took 50 runs to learn should not disappear in thirty days.

Memory Storage: Choosing Your Backend

The right storage backend depends on your volume, query patterns, and operational complexity tolerance. Here is a direct comparison.

Backend	Best For	Query Capabilities	Scalability	Operational Overhead
JSON files	Small stores, single agent, dev/prototyping	Key lookup, full scan	Up to ~10K entries	None — plain files
SQLite	Medium stores, structured queries, single machine	SQL queries, indexes, full-text search	Up to ~1M entries	Minimal — single file
Vector DB (Pinecone, Weaviate)	Large stores, semantic search, multi-agent	Similarity search, filtered retrieval	Millions of entries	External service to manage
Redis	High-throughput, shared access, real-time	Key lookup, pub/sub, TTL-based expiry	Very high	Requires Redis instance

At The AI University, we use JSON files per agent for individual agent memory (simple, inspectable, no dependencies) and a shared SQLite database for cross-agent memory that needs structured queries. Vector search is layered on top using embeddings stored as columns in SQLite — not as sophisticated as a dedicated vector DB, but sufficient for our volume and avoids adding another managed service.

Start with JSON files. Move to SQLite when you need structured queries or your file sizes exceed a few megabytes. Add a vector database only when semantic search is a real requirement and your store is too large for in-memory similarity computation.

How AI University's Agents Use Memory

Here is how memory flows through three of our production agents.

The outreach sequencing agent runs whenever a new lead is scored above threshold. It calls load_memory with tags ['outreach', 'prospect'] and the prospect's ID key. If there is a prior contact record, it reads what angle was used, what the response (or non-response) was, and how long ago the last touch was. It then writes a sequence that avoids repeating failed approaches. After sending, it saves a new episodic memory: leads/{prospect-id}/contact-history with the date, message sent, and channel.

The content strategy agent runs weekly. It loads semantic memories tagged ['topic-performance', 'audience'] to see which subjects have driven the highest engagement historically. It loads episodic memories tagged ['published', 'content'] to see what has gone out recently and avoid repetition. It saves its recommendations as procedural memory: content/strategy/working-formats with notes on what structures (listicles vs. deep dives vs. case studies) perform best for our audience.

The competitor tracking agent runs twice weekly. It loads semantic memories tagged ['competitor'] to understand the current landscape it has already mapped. It searches for new developments, then writes updates to keys like competitors/{name}/latest-moves. Other agents — particularly the content strategy agent and the outreach agent — load these competitor memories to stay current without re-doing the research themselves.

This is what makes the system compound. Every agent run produces memory. Every future agent run is informed by past memory. The system gets more accurate and less redundant with every cycle.

Key Takeaways

Memory is not a feature — it is the foundation of agent value over time. A memoryless agent is a tool. A memory-enabled agent is a system that learns.

Use explicit tools for memory, not implicit state. The save_memory / load_memory pattern makes memory visible, auditable, and controllable. You can inspect it, correct it, and seed it manually. Opaque memory in fine-tuned models or hidden state gives you none of that control.

Manage context windows deliberately. Do not let context overflow catch you by surprise. Build summarization and pruning into your agent from the start. Know your model's limits and design around them.

Short-term, long-term, and shared memory serve different purposes. Do not conflate them. Short-term memory is managed by the runtime. Long-term memory requires persistence tools. Shared memory requires a common store with agreed-upon key conventions.

Tag and type your memories. An unstructured memory store becomes unusable at scale. Episodic, semantic, and procedural types — plus domain tags — let you retrieve exactly what is relevant without loading everything.

Prune actively. A growing memory store that is never cleaned is a liability. Stale information degrades agent decisions. Build decay and pruning into your memory system before you need it, not after.

Start simple. JSON files per agent, explicit load and save calls, manual key naming. That handles the first several months of a production system. Add SQLite, vector search, and Redis when you have a specific problem that requires them — not before.

The patterns on this page are what we run in production. They are not theoretical. Start with the save/load tool pattern, instrument it so you can see what is being remembered, and build from there.