Prompt Engineering for AI Agents

Prompt engineering for chatbots and prompt engineering for agents are two different disciplines. Most people learn prompting by writing instructions for a single-turn interaction: the user says something, the model responds, the conversation is over or the user follows up manually. That mental model breaks down the moment you build an agent.

An agent does not produce one response and stop. It enters a loop. It reasons about a task, selects a tool, interprets the result, decides what to do next, and repeats until the task is complete or a stopping condition is met. That loop might run for two iterations or two hundred. The prompt you write does not control a single output — it controls the decision-making logic of a system that runs autonomously for minutes or hours.

This distinction matters because the failure modes are completely different. A bad chatbot prompt gives you a mediocre answer. A bad agent prompt gives you an agent that calls the wrong tools, loops forever, ignores constraints, or confidently takes destructive actions it was never supposed to take. At The AI University, we run 15 agents in production. Every lesson in this guide comes from something that broke.

Why Agent Prompts Are Different

A chatbot prompt optimizes for a single response. You want the model to be helpful, accurate, and well-formatted. The scope of what can go wrong is limited to the quality of one output.

An agent prompt optimizes for a decision-making loop. You are not writing instructions for one answer — you are writing the operating manual for an autonomous system. The prompt must define:

Identity: What is this agent? What is its role and scope?
Tools: What capabilities does it have? When should it use each one?
Constraints: What must it never do? What requires confirmation?
Decision heuristics: How should it choose between competing actions?
Stopping conditions: When is it done? How does it know?
Error handling: What should it do when a tool fails or returns unexpected data?

A chatbot prompt can afford to be loose. "You are a helpful assistant" works fine for a chat interface because a human is there to correct course on every turn. An agent prompt cannot afford that luxury. There is no human in the loop between tool calls. If the prompt does not specify a constraint, the agent will not invent one. If the prompt leaves a decision ambiguous, the agent will resolve the ambiguity in whatever direction the model's weights happen to push — and that direction will not always be the one you wanted.

The practical consequence is that agent prompts are longer, more structured, and more prescriptive than chatbot prompts. A good agent system prompt typically runs 800 to 2,000 tokens. This is not bloat. It is the minimum viable specification for a system that needs to operate without supervision.

System Prompt Architecture

A production agent system prompt has a consistent internal structure. The order matters because models attend more strongly to the beginning and end of the prompt. Put the most critical instructions — identity, hard constraints, and output format — at the beginning.

Here is the anatomy of a well-structured agent prompt:

const systemPrompt = `
## Role
You are the lead scoring agent for AI University's growth system.
Your job is to evaluate inbound leads and assign a score from 0-100
based on their likelihood to convert to a paying customer.

## Tools Available
You have access to the following tools:
- get_crm_data: Retrieves a lead's profile and history from the CRM.
  Call this first for every lead evaluation.
- get_engagement_history: Returns page views, email opens, and content
  downloads for a lead. Use this to assess behavioral signals.
- get_firmographic_data: Returns company size, industry, and revenue
  for the lead's organization. Use this for B2B leads only.
- update_lead_score: Writes the final score back to the CRM. Call this
  exactly once per lead, after you have completed your analysis.

## Constraints
- Never assign a score above 90 without both behavioral AND firmographic signals.
- Never call update_lead_score more than once per lead.
- If get_crm_data returns no results, stop immediately and report
  "Lead not found" — do not guess or fabricate data.
- Do not access tools unrelated to lead scoring.

## Process
1. Retrieve the lead's CRM profile.
2. Retrieve engagement history.
3. If the lead is B2B (has a company domain), retrieve firmographic data.
4. Analyze all signals and compute a score.
5. Write the score to the CRM with your reasoning.

## Output Format
After writing the score, return a JSON summary:
{
  "lead_id": "string",
  "score": number,
  "signals": ["string"],
  "reasoning": "string",
  "confidence": "high" | "medium" | "low"
}
`;

Each section serves a specific purpose:

Role tells the model who it is and what its job boundaries are. This is not decorative. Without a clear role definition, the agent will drift into adjacent tasks when user input is ambiguous.

Tools Available gives the model a map of its capabilities with usage guidance. This section is the single highest-leverage part of the prompt. We cover it in detail in the next section.

Constraints defines the hard boundaries. These are the things the agent must never do regardless of what the input says. Constraints should be stated as explicit prohibitions, not soft suggestions.

Process gives the agent a default execution plan. This does not prevent the agent from adapting, but it gives it a strong prior for the common case. Without a process section, agents often take unnecessary detours.

Output Format specifies what the final output should look like. Structured formats (JSON, typed objects) are vastly preferable to free-form text because they are parseable by downstream systems and less prone to model improvisation.

Tool Description Design

The model never reads your source code. It reads the tool name and the description string you expose through MCP or whatever tool-calling interface you use. Everything the model needs to know about when to call a tool, what to pass it, and what to expect back must live in that description.

This makes tool descriptions load-bearing infrastructure. A vague description does not just make the model's job harder — it makes it wrong. The model will call tools based on what the description says, not what the tool actually does.

Here is a side-by-side comparison:

// Bad: The model has no idea when to use this vs. other search tools
{
  name: "search",
  description: "Searches for information.",
  parameters: { query: { type: "string" } }
}

// Good: The model knows exactly when this tool is appropriate
{
  name: "search_knowledge_base",
  description: `Search the AI University knowledge base for articles,
tutorials, and documentation. Use this when the user asks about
our courses, pricing, enrollment process, or platform features.
Do NOT use this for general web searches or questions unrelated
to AI University. Returns up to 10 matching articles ranked by
relevance. Each result includes title, URL, and a 200-word excerpt.`,
  parameters: {
    query: {
      type: "string",
      description: "Natural language search query. Be specific — 'enrollment deadline spring 2026' works better than 'enrollment'."
    },
    max_results: {
      type: "number",
      description: "Maximum results to return. Default 5. Use 10 only if the initial results are insufficient."
    }
  }
}

A good tool description answers three questions: What does this tool do? When should I use it (and when should I not)? What will it return?

The "when should I not" part is frequently overlooked and frequently the cause of misrouted tool calls. If you have five tools and the model is unsure which one to call, it will pick the one whose description most loosely matches. Explicit exclusions in descriptions reduce this ambiguity.

In MCP tool definitions, the description field is the primary lever you have for controlling agent behavior. Invest in it accordingly. A description that saves you one misrouted tool call per hundred runs is worth the extra fifty tokens it costs.

For a deeper treatment of parameter design, return values, and error handling in tools, see Designing Tools for AI Agents.

Chain-of-Thought for Agents

Chain-of-thought prompting — instructing the model to reason step by step before producing an answer — is well-established for single-turn tasks. For agents, it is even more important because the cost of a wrong action is higher than the cost of a wrong word.

When an agent calls a tool, that call might send an email, update a database record, or charge a credit card. You want the model to think before it acts, not after. The instruction is simple and remarkably effective:

const systemPrompt = `
## Decision Process
Before calling any tool, explain your reasoning in a <thinking> block:
1. What information do I currently have?
2. What information do I still need?
3. Which tool will get me that information?
4. What could go wrong with this tool call?

Only proceed with the tool call after completing this reasoning.
`;

In our production agents, adding explicit chain-of-thought instructions reduced tool call errors by roughly 40%. The model makes fewer wrong selections because it is forced to articulate why it is making each selection before executing it.

There is a cost tradeoff. Chain-of-thought reasoning generates additional output tokens — typically 50 to 150 per reasoning step. Across a 15-step agent run, that is 750 to 2,250 extra tokens. Whether this is worth it depends on what a wrong tool call costs you. If a wrong tool call means a bad email goes to a prospect, 2,000 tokens of reasoning is cheap insurance.

You can also use chain-of-thought selectively. Instruct the agent to reason explicitly only before high-stakes actions (writes, deletes, external API calls) and skip reasoning for low-stakes reads. This preserves the safety benefit where it matters while reducing token overhead on routine operations.

The ReAct Pattern

ReAct (Reasoning + Acting) is a specific framework for structuring agent behavior as an alternating loop of thinking and doing. The agent observes information, reasons about it, takes an action, observes the result, and repeats.

The loop looks like this:

Observe --> Think --> Act --> Observe --> Think --> Act --> ... --> Final Answer

This is not just a conceptual model. You can implement it explicitly in your agent loop, and doing so makes agent behavior significantly more predictable and debuggable.

Here is a TypeScript implementation:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface ReActStep {
  observation: string;
  thought: string;
  action: string | null;
  actionInput: Record<string, unknown> | null;
}

async function runReActAgent(
  task: string,
  tools: Anthropic.Tool[],
  maxSteps: number = 10
): Promise<{ answer: string; steps: ReActStep[] }> {
  const steps: ReActStep[] = [];
  const messages: Anthropic.MessageParam[] = [];

  const systemPrompt = `You are an AI agent that solves tasks using the ReAct framework.

For each step, you MUST follow this exact format:

Observation: [What you currently know or what the last tool returned]
Thought: [Your reasoning about what to do next]
Action: [The tool to call, or "finish" if you have the answer]
Action Input: [The parameters for the tool, as JSON]

When you have enough information to answer the task, use:
Thought: I now have enough information to answer.
Action: finish
Action Input: {"answer": "your final answer here"}

Important rules:
- Always reason before acting.
- Never call a tool without explaining why in your Thought.
- Stop as soon as you have a confident answer. Do not over-research.`;

  messages.push({ role: "user", content: `Task: ${task}` });

  for (let step = 0; step < maxSteps; step++) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-5",
      max_tokens: 2048,
      system: systemPrompt,
      messages,
      tools,
    });

    // Check if the model wants to use a tool
    const toolUseBlock = response.content.find(
      (block) => block.type === "tool_use"
    );
    const textBlock = response.content.find(
      (block) => block.type === "text"
    );

    const thought = textBlock?.type === "text" ? textBlock.text : "";

    if (!toolUseBlock) {
      // Agent is done — extract the final answer from reasoning
      return { answer: thought, steps };
    }

    // Execute the tool call
    const toolResult = await executeToolCall(
      toolUseBlock.name,
      toolUseBlock.input as Record<string, unknown>
    );

    steps.push({
      observation: typeof toolResult === "string"
        ? toolResult
        : JSON.stringify(toolResult),
      thought,
      action: toolUseBlock.name,
      actionInput: toolUseBlock.input as Record<string, unknown>,
    });

    // Feed results back into the conversation
    messages.push({
      role: "assistant",
      content: response.content,
    });
    messages.push({
      role: "user",
      content: [
        {
          type: "tool_result",
          tool_use_id: toolUseBlock.id,
          content: JSON.stringify(toolResult),
        },
      ],
    });
  }

  return {
    answer: "Max steps reached without resolution.",
    steps,
  };
}

The ReAct pattern gives you two things that a bare tool-calling loop does not. First, every action is preceded by an explicit reasoning trace, which makes debugging straightforward — you can read the agent's thinking and identify exactly where its logic went wrong. Second, the alternating structure naturally prevents the model from making multiple tool calls based on stale assumptions. It must observe the result of each action before deciding the next one.

The tradeoff is latency. Each step requires a full model roundtrip. For tasks that benefit from parallel tool calls, ReAct is slower than a model that fires off three tool calls at once. Use ReAct for tasks where correctness matters more than speed — lead qualification, financial analysis, anything with real-world consequences.

Prompt Caching

Anthropic's prompt caching is one of the most impactful optimizations for agent workloads. It saves 75-90% on input tokens for the static portions of your prompt.

Here is how it works. When you send a request with a cache control header on a portion of the prompt, Anthropic caches the computed key-value attention state for that prefix. On subsequent requests with the same prefix, the cached state is loaded instead of recomputed. You pay a cache read price (roughly 10% of normal input token cost) instead of the full input token price.

For agents, this matters enormously because the system prompt is sent on every single turn of the loop. A 1,500-token system prompt sent across 12 agent turns normally costs the equivalent of 18,000 input tokens. With caching, the first turn pays full price (1,500 tokens) and the remaining 11 turns pay cache read price (effectively ~1,650 equivalent tokens), bringing the total from 18,000 to roughly 3,150 — an 82% reduction on the system prompt alone.

To maximize cache hit rates, structure your prompt as a static prefix followed by a dynamic suffix:

const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 2048,
  system: [
    {
      type: "text",
      text: STATIC_SYSTEM_PROMPT, // Role, tools, constraints, examples
      cache_control: { type: "ephemeral" },
    },
    {
      type: "text",
      text: buildDynamicContext(), // Current date, user info, session state
    },
  ],
  messages: conversationHistory,
});

The static prefix contains everything that does not change between turns: role definition, tool descriptions, constraints, examples. The dynamic suffix contains everything that does: current timestamp, user-specific context, session state.

The rule is simple: anything that is the same across multiple turns goes in the cached prefix. Anything that changes goes after it. Do not interleave static and dynamic content — that breaks the prefix match and invalidates the cache.

For a comprehensive treatment of token cost optimization including caching, model routing, and context window management, see Token Optimization.

Few-Shot Examples in Agent Prompts

Few-shot examples show the agent what correct behavior looks like. For a chatbot, this means showing good answers. For an agent, this means showing good tool usage — the right tool selected for the right reason with the right parameters.

const systemPrompt = `
## Examples of Correct Tool Usage

Example 1: User asks "What courses do we offer on prompt engineering?"
Correct action: Call search_knowledge_base with query "prompt engineering courses"
Why: This is a question about AI University content. The knowledge base
contains our course catalog. Do not use web_search for internal questions.

Example 2: User asks "How does GPT-4 compare to Claude?"
Correct action: Call web_search with query "GPT-4 vs Claude comparison 2026"
Why: This is a general industry question, not about AI University
specifically. The knowledge base would not have this information.

Example 3: User asks "Cancel my subscription"
Correct action: Do NOT call any tool. Respond with:
"I can't modify subscriptions directly. Please contact support@aiuniversity.com
or visit your account settings page."
Why: The agent does not have a subscription management tool. Attempting to
use an unrelated tool to approximate this action is wrong.
`;

Notice that Example 3 shows the agent what not to do. Negative examples are just as valuable as positive ones, sometimes more so. Without Example 3, an agent with access to a CRM tool might try to "cancel" the subscription by updating a CRM field — a creative but destructive interpretation.

The cost tradeoff with few-shot examples is real. Three detailed examples add 300 to 500 tokens to the system prompt, which is billed on every turn. For a 15-turn agent run, that is 4,500 to 7,500 additional input tokens. Prompt caching mitigates this significantly (examples are static and cacheable), but the cost is not zero.

The practical approach: start with 2-3 examples covering your most common failure modes. Measure whether they improve tool selection accuracy. If they do, keep them. If accuracy is already high without them, remove them and save the tokens.

Common Prompt Engineering Mistakes

These are the mistakes we see most often in production agent prompts. Every one of them has caused a real incident in a real system.

Mistake	What Happens	Fix
Vague tool descriptions	Agent calls the wrong tool or calls the right tool with wrong parameters	Write descriptions that answer: what, when, when NOT, and what it returns
No stopping condition	Agent loops indefinitely, burning tokens and hitting rate limits	Add explicit stopping criteria: "Stop when you have a confident answer or after 5 tool calls"
No constraints section	Agent takes destructive actions (deletes, sends emails) without guardrails	List hard prohibitions explicitly: "Never delete records. Never send emails without confirmation."
Soft language in constraints	"Try to avoid" and "preferably" get ignored under pressure	Use absolute language: "Never", "Must not", "Always". Models respect hard rules better than suggestions.
Too many tools loaded at once	Agent gets confused choosing between 30+ similar tools	Group tools by task. Load only the tools relevant to the current agent's scope. 8-12 tools is a practical ceiling.
No error handling instructions	Agent retries failed tool calls infinitely or halts without explanation	Add: "If a tool returns an error, log the error, attempt one retry, and if it fails again, report the failure and move on."
Missing output format spec	Agent returns prose when downstream systems expect JSON	Specify the exact output schema. Include a sample output in the prompt.
Ignoring token cost of examples	System prompt grows to 4,000+ tokens with examples, costing more per turn than the tool calls themselves	Audit example count. Cache the prompt. Remove examples that do not measurably improve accuracy.

The pattern across all of these is the same: agent prompts fail when they leave important decisions implicit. A chatbot can rely on a human to redirect when something goes wrong. An agent cannot. Everything you want the agent to do — and everything you want it to not do — must be stated explicitly in the prompt.

Testing Your Prompts

Writing an agent prompt without testing it is like writing a function without tests. It might work for the demo input, but you have no idea how it handles edge cases.

Eval-driven prompt development follows a simple loop:

Define your evaluation cases before writing the prompt. What inputs should this agent handle? What outputs are correct? What failure modes do you need to guard against? Write these down as test cases with expected outcomes.
Write the initial prompt. Do not try to be perfect. Get something reasonable and move on.
Run the evaluation suite. Feed each test case to the agent and compare its behavior to expected behavior. Track tool selection accuracy, output format compliance, constraint adherence, and final answer correctness.
Analyze failures. Most failures cluster into categories: wrong tool selected, correct tool but wrong parameters, constraint violated, output format wrong. Each category has a different fix.
Iterate on the prompt and re-run. Change one thing at a time. If you change three things and accuracy improves, you do not know which change mattered.

interface PromptEval {
  name: string;
  input: string;
  expectedToolCalls: string[];
  expectedOutput: Record<string, unknown>;
  constraints: string[];
}

const evals: PromptEval[] = [
  {
    name: "basic_lead_scoring",
    input: "Score lead with ID lead_12345",
    expectedToolCalls: [
      "get_crm_data",
      "get_engagement_history",
      "update_lead_score",
    ],
    expectedOutput: { score: "number between 0-100", confidence: "string" },
    constraints: ["Must not call update_lead_score more than once"],
  },
  {
    name: "missing_lead",
    input: "Score lead with ID lead_nonexistent",
    expectedToolCalls: ["get_crm_data"],
    expectedOutput: { error: "Lead not found" },
    constraints: [
      "Must not call update_lead_score",
      "Must not fabricate lead data",
    ],
  },
  {
    name: "b2b_lead_with_firmographics",
    input: "Score lead with ID lead_67890 (company: Acme Corp)",
    expectedToolCalls: [
      "get_crm_data",
      "get_engagement_history",
      "get_firmographic_data",
      "update_lead_score",
    ],
    expectedOutput: { score: "number between 0-100" },
    constraints: ["Must call get_firmographic_data for B2B leads"],
  },
];

async function runEvalSuite(
  agent: AgentFunction,
  evals: PromptEval[]
): Promise<EvalReport> {
  const results = [];

  for (const evalCase of evals) {
    const trace = await agent(evalCase.input);

    const toolCallsCorrect = arraysMatch(
      trace.toolCalls.map((t) => t.name),
      evalCase.expectedToolCalls
    );

    const constraintsPassed = evalCase.constraints.every((constraint) =>
      checkConstraint(trace, constraint)
    );

    results.push({
      name: evalCase.name,
      passed: toolCallsCorrect && constraintsPassed,
      toolCallsCorrect,
      constraintsPassed,
      trace,
    });
  }

  return {
    total: results.length,
    passed: results.filter((r) => r.passed).length,
    failed: results.filter((r) => !r.passed),
    results,
  };
}

The discipline of writing evals first changes how you write prompts. Instead of asking "does this prompt sound good?" you ask "does this prompt pass the test cases?" That shift from subjective quality to objective correctness is the difference between prompt engineering as art and prompt engineering as engineering.

In production, run your eval suite on every prompt change before deploying. A prompt that improves accuracy on new test cases but regresses on existing ones is not ready to ship. Treat prompt changes with the same rigor you treat code changes: test, review, deploy, monitor.

Putting It All Together

Agent prompt engineering is not about clever phrasing. It is about building a reliable specification for an autonomous system. The system prompt is the operating manual. The tool descriptions are the interface contracts. The constraints are the safety rails. The examples are the training data. The evals are the test suite.

When you approach it that way — as engineering, not as writing — the quality of your agents improves dramatically. Models are remarkably good at following well-structured instructions. Most agent failures trace back not to model capability but to the prompt leaving something ambiguous that should have been explicit.

Start with the system prompt architecture from this guide. Add chain-of-thought reasoning for high-stakes decisions. Use the ReAct pattern when you need interpretable, debuggable agent behavior. Cache your prompts aggressively. Test against eval cases before you deploy. And when something breaks in production — and it will — read the agent's reasoning trace before you blame the model. Nine times out of ten, the fix is in the prompt.