AI University/Docs

Token Optimization: Reduce Agent Costs by 70%

Token costs spiral fast in agent loops because every iteration adds context. This guide covers seven strategies — prompt compression, context window management, model routing, caching, batch processing, output constraints, and knowing when to stop — with concrete numbers showing what a 70% cost reduction looks like in practice.

Last updated: 2026-03-02

Token Optimization: Reduce Agent Costs by 70%

Most teams building AI agents do not have a spending problem in week one. They have a scaling problem. A single agent run costs a few cents. That feels fine. Then you run it a hundred times a day, loop it across fifty users, add three more agents to the pipeline, and suddenly your monthly bill is five figures and climbing.

The cause is almost always token waste. Not fraud, not model overuse — just inefficiency that compounds across every iteration of every loop. This guide covers seven strategies for eliminating that waste. Implementing all seven together is what produces the 70% reduction in the headline. Most teams get there by applying three or four of them systematically.


Why Token Costs Spiral in Agent Loops

A single agent turn is not what costs you money. The loop is.

Here is what happens inside a typical agentic run. The agent receives a task. It calls a tool. The tool result comes back. That result gets appended to the conversation. The agent thinks, calls another tool. Another result appended. After ten tool calls, the context window contains the original system prompt, the full conversation history, and ten complete tool results — even the ones from step two that have no bearing on what the agent is doing in step ten.

Each model call prices the full context, not just the new content. Input tokens are re-read and re-billed on every turn. A 2,000-token system prompt sent across 15 agent turns costs 30,000 input tokens before the agent has done a single thing of substance. Add a 500-token tool result that stays in context for 12 more turns after it is relevant, and you have added 6,000 tokens of pure waste.

This is the hidden cost of agent loops: context is append-only by default, and every iteration makes the problem worse. The strategies below attack this at every layer.


Understanding Token Pricing

Before optimizing, understand what you are optimizing.

Anthropic prices input tokens and output tokens separately. Input tokens — everything the model reads, including the system prompt, conversation history, and tool results — are cheaper per token than output tokens. But input tokens accumulate across turns in a way output tokens do not, which is why context management matters more than most people expect.

Model tier has an outsized effect on cost. At the time of writing, the rough pricing relationship between model tiers works like this: Claude Haiku is approximately 20x cheaper than Claude Sonnet per million tokens, and Sonnet is approximately 5x cheaper than Claude Opus. Running everything through Opus when Haiku can handle it is not a quality decision — it is a billing decision disguised as one.

Caching complicates the picture in a good way. Anthropic supports prompt caching, which means repeated prefixes (like a long system prompt) can be cached and served at a fraction of the normal input token cost. This single feature can cut costs 40-60% for agents with stable system prompts across many runs.

The strategies below work at different layers of this pricing model. Some reduce the raw number of tokens. Some shift tokens to cheaper cache reads. Some route calls to cheaper model tiers. All of them compound.


Strategy 1: Prompt Compression

The system prompt is the first place to look because it is billed on every single turn.

Most system prompts are written once and never revisited. They accumulate instructions, examples, formatting rules, and caveats — often with significant redundancy. A common first pass through a production system prompt can remove 30-40% of the tokens without changing agent behavior at all.

Practical steps:

Remove filler phrases. "Please make sure that you always remember to" is five tokens. "Always" is one.

Eliminate redundancy. If you have told the agent to respond in JSON three times across different sections of the prompt, one time is enough.

Use references instead of inline content. Instead of embedding a 400-line product catalog into the system prompt, store it in a database and give the agent a tool to query it. The catalog gets loaded only when needed, not on every turn.

Compress examples. Few-shot examples are valuable but expensive. If you have five examples in the system prompt, test whether three work equally well. Often they do.

Separate stable from dynamic content. Put the stable portion (role, rules, output format) in the cacheable prefix. Put the dynamic portion (current date, user context) at the end. This maximizes cache hit rates for the expensive part.


Strategy 2: Context Window Management

The context window fills up. Your job is to actively manage what stays in it.

The default behavior of most agent frameworks is to keep the entire conversation history in context until the window overflows, at which point everything breaks. This is the wrong default. You want to be selective about what stays in context long before you hit limits.

Summarization: When a conversation reaches a certain length — say, 8,000 tokens — generate a summary of what has happened so far and replace the raw history with that summary. A well-written 500-token summary of a 6,000-token conversation retains the decision-relevant information while cutting 90% of the tokens.

Tool result pruning: Tool results are the fastest-growing part of most agent contexts. After a tool result has been used to inform the next action, it often has no further value. Mark results as ephemeral when you call the tool, and trim them from context after two turns. Only keep results that will be directly referenced later.

Sliding window: For long-running agents, implement a sliding window that keeps only the last N turns in full fidelity, with earlier turns summarized. This bounds your context cost linearly regardless of run length.

What to always keep: The system prompt. The current task. The most recent tool results. Any decisions or conclusions the agent has explicitly surfaced. Everything else is a candidate for pruning or summarization.


Strategy 3: Model Routing

Not every task in an agent pipeline requires your best model. Most tasks do not even come close.

Model routing means directing each sub-task to the cheapest model capable of handling it reliably. In practice, this breaks down roughly as follows:

  • Classification and routing decisions: Does this message need a human? Which tool should I call next? Is this question in scope? These are Haiku tasks. They require pattern matching, not reasoning. At 20x cheaper than Sonnet, routing these correctly pays for itself within days.
  • Summarization and extraction: Pull key facts from a document. Summarize this conversation. Extract structured data from this text. These are Sonnet tasks in most cases — they require comprehension and coherence but not deep reasoning.
  • Complex reasoning and synthesis: Evaluate competing options with real tradeoffs. Generate a nuanced recommendation based on conflicting signals. Write code that requires architectural judgment. These are Opus tasks. They warrant the cost because cheaper models produce worse outcomes.

The TypeScript implementation of this pattern is straightforward:

type TaskComplexity = "low" | "medium" | "high";

interface RoutingConfig {
  low: string;
  medium: string;
  high: string;
}

const MODEL_ROUTING: RoutingConfig = {
  low: "claude-haiku-4-5",
  medium: "claude-sonnet-4-5",
  high: "claude-opus-4-5",
};

function classifyTaskComplexity(task: string): TaskComplexity {
  // Simple classification — you can make this as sophisticated as needed.
  // In production, run this classification itself through Haiku.
  const lowPatterns = [
    /^(yes|no|classify|route|is this|does this)/i,
    /^summarize this in one sentence/i,
    /^extract the (name|date|price|email)/i,
  ];

  const highPatterns = [
    /architect/i,
    /evaluate.*tradeoff/i,
    /compare.*options.*recommend/i,
    /debug.*production/i,
  ];

  if (lowPatterns.some((p) => p.test(task))) return "low";
  if (highPatterns.some((p) => p.test(task))) return "high";
  return "medium";
}

async function routedCompletion(
  task: string,
  systemPrompt: string,
  anthropic: Anthropic
): Promise<string> {
  const complexity = classifyTaskComplexity(task);
  const model = MODEL_ROUTING[complexity];

  const response = await anthropic.messages.create({
    model,
    max_tokens: complexity === "low" ? 256 : complexity === "medium" ? 1024 : 4096,
    system: systemPrompt,
    messages: [{ role: "user", content: task }],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

The key insight is that the classification step itself should be cheap. Running a Haiku call to decide whether the next call should be Haiku or Opus costs almost nothing and pays for itself immediately.


Strategy 4: Caching

Anthropic's prompt caching feature is one of the highest-leverage optimizations available if your agents have stable system prompts.

When you mark a portion of the prompt with a cache control header, Anthropic stores the processed KV cache for that prefix. Subsequent requests that share the same prefix pay a cache read price rather than a full input token price. Cache reads cost roughly 10% of normal input token prices. For a 2,000-token system prompt sent across 100 agent turns, this reduces the system prompt cost from the equivalent of 200,000 input tokens to 2,000 tokens plus 198,000 cache read tokens — a reduction of roughly 90% on that portion of the bill.

Beyond prompt caching, cache tool results that do not change often. If your agent fetches a configuration file, a product list, or a knowledge base article at the start of every run, store the result in a fast cache (Redis, in-memory, or even a local file with a TTL) and skip the tool call entirely on subsequent runs. Tool calls that hit external APIs add latency and, if those APIs have costs, add more direct spend as well.


Strategy 5: Batch Processing

Making one API call per item is almost always the wrong approach when you have multiple items to process.

If your agent needs to classify 50 customer messages, the naive implementation makes 50 separate API calls. Each call carries the full overhead of the system prompt, the model loading time, and the HTTP round trip. Batching those 50 messages into a single call with a structured output format reduces that overhead by 49x and typically produces faster wall-clock time as well.

Batch processing works for any task where the items are independent: classification, extraction, summarization, scoring, validation. It does not work well for tasks where the output of one item is input for the next.

When you batch, be explicit in the prompt about the structure. Ask for a JSON array where each element corresponds to an input item by index. Set max_tokens based on the expected output per item multiplied by the number of items, not on a single-item estimate.


Strategy 6: Output Constraints

Most agents produce more output than necessary because nothing tells them to stop.

Set max_tokens deliberately. If the agent is classifying sentiment as positive, neutral, or negative, the maximum useful output is one word. If the agent is writing a summary, you know the maximum length before you call it. Use that knowledge. An unconstrained agent that produces 800 tokens when 150 would have served the purpose is billing you for 650 tokens of waste on every call.

Use structured output to reduce verbosity. When you ask an agent to produce free-form prose, it will include preamble, qualifications, and sign-off. When you ask it for JSON with specific fields, it produces exactly those fields. Structured output is not just easier to parse downstream — it is also shorter, which means cheaper.

When you need prose, give the agent a word budget. "Respond in 100 words or fewer" is a valid instruction. Models follow it reasonably well, and the enforcement is worth the marginal imprecision.


Strategy 7: Stop Early

Agents that do not know when to stop waste tokens on over-research.

The default behavior of an agent given a research task is to keep researching until the context window fills or the tool call limit is hit. This is a problem because the tenth source rarely adds information that changes the answer, but it adds thousands of tokens to the context.

Teach agents to recognize sufficiency. Include a stopping condition in the system prompt: "Stop gathering information when you have enough to answer the question with high confidence. Do not seek additional sources unless your current information is contradictory or clearly incomplete." This single instruction, when followed, routinely cuts research agent runs by 30-40% in length.

You can also implement a programmatic check. After each tool call, have the agent output a confidence score alongside its findings. If confidence exceeds a threshold (say, 0.85), skip the next planned tool call and proceed to synthesis. This trades a small amount of thoroughness for a meaningful reduction in cost and latency.


Before vs. After: Real Cost Comparison

The following table shows a typical research-and-summarize agent run before and after applying the strategies above. The agent reads several sources, synthesizes findings, and produces a structured report.

Cost ComponentBeforeAfterReduction
System prompt tokens (per turn, 15 turns)3,000 tokens x 15 = 45,0001,200 cached tokens x 15 = 18,000 reads-89% on this component
Tool result context accumulation12,000 tokens (all results kept)3,000 tokens (pruned after use)-75%
Model tierSonnet for all tasksHaiku for routing, Sonnet for synthesis~60% blended savings
Number of API calls18 (one per source)6 (batched + early stop)-67%
Output tokens1,800 (unconstrained)900 (structured, max_tokens set)-50%
Total estimated cost per run~$0.14~$0.04-71%

These numbers use approximate mid-2025 pricing and a representative workload. Your actual numbers will differ, but the ratios are consistent with what teams report after systematic optimization.


Claude Max Subscription: Eliminating Per-Token Anxiety

One operational insight worth naming explicitly: for teams running high-volume agent workloads, per-token pricing creates a specific kind of organizational dysfunction. Engineers optimize for fewer tokens not because it produces better agents, but because the bill makes everyone nervous. Features get cut. Experiments do not get run. The agent does less than it could.

Claude Max is Anthropic's flat-rate subscription that removes per-token billing for interactive use via Claude Code and the Claude web interface. For teams using claude -p to power agent runs on a Max subscription, the per-token anxiety goes away entirely. The optimization strategies in this guide still matter — fewer tokens means faster runs and lower latency — but the financial pressure that drives bad architectural decisions disappears.

If your organization is running more than a few hundred agent operations per day, run the math on flat-rate access versus per-call API pricing for your workload. The crossover point is lower than most people expect.


Key Takeaways

Token costs in agent systems are a compounding problem. Each loop iteration adds to context. Each model call prices the full context. The waste multiplies across every run, every user, every day.

The seven strategies in this guide attack that waste at different layers:

  1. Prompt compression reduces the base cost of every single API call by shrinking the system prompt.
  2. Context window management prevents historical context from accumulating indefinitely by summarizing and pruning.
  3. Model routing matches task complexity to model capability, keeping expensive models for work that actually requires them.
  4. Caching turns repeated system prompt reads into cheap cache hits and skips tool calls for data that has not changed.
  5. Batch processing eliminates per-call overhead by grouping independent items into a single API call.
  6. Output constraints stop agents from generating more tokens than the task requires.
  7. Stopping early prevents over-research by teaching agents to recognize when they have enough information.

None of these strategies requires rewriting your agents from scratch. Most can be applied incrementally in a single afternoon of focused work. Start with model routing and prompt compression — they tend to produce the largest individual gains and require the least invasive changes to existing code. Add context management and caching next. The remaining strategies are refinements that push you toward the upper end of the optimization range.

A 70% cost reduction is achievable. The teams that hit it are not the ones with the best engineers — they are the ones who made cost a first-class concern alongside correctness and reliability.