Monitoring and Debugging AI Agents

Traditional software fails loudly. An exception is thrown, a stack trace appears, and you know exactly what line caused the problem. Agents fail quietly. The agent completes its run, returns a response, and everything looks fine — until you realize it called the wrong tool three times, hallucinated a parameter value, and the task was never actually done.

Building observable agents is not optional. It is the difference between a system you can operate in production and one that silently accumulates technical debt in the form of failed tasks you do not know about yet.

Log Every Tool Call and Decision

The minimum viable logging surface for an agent system captures two things: every tool call (with inputs and outputs) and every model decision (which tool to call and why). Without this, debugging is guesswork.

Structured Log Format

Use structured JSON logs, not free-text strings. Structured logs can be queried, aggregated, and alerted on. Free-text logs can only be read.

interface AgentLogEntry {
  timestamp: string;          // ISO 8601
  session_id: string;         // unique per agent run
  agent_id: string;           // which agent in your system
  event_type: "tool_call" | "tool_response" | "model_decision" | "error" | "completion";
  step: number;               // which step in the agent loop
  data: Record<string, unknown>;
}

// Tool call log entry
{
  "timestamp": "2026-03-02T14:23:11.441Z",
  "session_id": "sess_abc123",
  "agent_id": "lead-scoring-agent",
  "event_type": "tool_call",
  "step": 3,
  "data": {
    "tool_name": "get_contact_by_id",
    "parameters": { "contact_id": "c_xyz789" },
    "duration_ms": null  // filled in on response
  }
}

// Tool response log entry
{
  "timestamp": "2026-03-02T14:23:11.698Z",
  "session_id": "sess_abc123",
  "agent_id": "lead-scoring-agent",
  "event_type": "tool_response",
  "step": 3,
  "data": {
    "tool_name": "get_contact_by_id",
    "success": true,
    "duration_ms": 257,
    "response_size_bytes": 843
  }
}

Notice the response log does not include the full tool output. Logging full responses bloats your storage quickly. Log the shape and metadata. Log the full content only when there is an error.

Logging Wrapper

Wrap your tool execution in a logger so every tool call is captured without modifying the tool itself.

async function withLogging<T>(
  sessionId: string,
  agentId: string,
  step: number,
  toolName: string,
  parameters: Record<string, unknown>,
  fn: () => Promise<T>
): Promise<T> {
  const startTime = Date.now();

  log({
    session_id: sessionId,
    agent_id: agentId,
    event_type: "tool_call",
    step,
    data: { tool_name: toolName, parameters }
  });

  try {
    const result = await fn();
    const duration = Date.now() - startTime;

    log({
      session_id: sessionId,
      agent_id: agentId,
      event_type: "tool_response",
      step,
      data: {
        tool_name: toolName,
        success: true,
        duration_ms: duration,
        response_size_bytes: JSON.stringify(result).length
      }
    });

    return result;
  } catch (err) {
    const duration = Date.now() - startTime;

    log({
      session_id: sessionId,
      agent_id: agentId,
      event_type: "error",
      step,
      data: {
        tool_name: toolName,
        success: false,
        duration_ms: duration,
        error: err instanceof Error ? err.message : String(err),
        stack: err instanceof Error ? err.stack : undefined
      }
    });

    throw err;
  }
}

Use this wrapper at the orchestrator level so it applies to every agent in your system uniformly.

Debugging the Agent Loop

When an agent does something unexpected, you need to answer two questions: why did it call that tool, and why did it stop?

Why Did It Call That Tool?

The model chooses tools based on the system prompt, the conversation history, and the tool descriptions available. When the wrong tool is called:

Check the tool description for ambiguity. Two tools with overlapping descriptions will cause the model to pick randomly between them.
Check the conversation history at the point of the decision. What information did the model have? Was it missing context that would have changed the decision?
Check the system prompt. Is there an instruction that would logically lead to this behavior?

Log the model's reasoning when it is available. When using extended thinking or chain-of-thought, capture that output:

{
  "event_type": "model_decision",
  "step": 4,
  "data": {
    "chosen_tool": "send_email",
    "reasoning_summary": "User requested notification be sent. Email is the appropriate channel based on contact preferences.",
    "alternatives_considered": ["notify_slack", "send_sms"],
    "confidence": "high"
  }
}

Why Did It Stop?

An agent loop terminates when the model decides the task is done, when it hits a step limit, or when an error is not handled. Premature termination is one of the most common production issues.

Log the completion event with a reason:

{
  "event_type": "completion",
  "data": {
    "reason": "model_stop",        // "model_stop" | "step_limit" | "error" | "guardrail_triggered"
    "total_steps": 7,
    "total_tool_calls": 5,
    "total_duration_ms": 4823,
    "final_output_length": 312
  }
}

If reason is step_limit, the agent ran out of steps before finishing. Either your step limit is too low for this task, or the agent is stuck in a loop. Both are worth investigating.

Common Failure Patterns

These are the failure modes that appear most frequently in production agent systems and what to look for in your logs.

Infinite Tool Loops

The agent calls the same tool repeatedly with the same parameters, never progressing. This happens when the tool returns an error or unexpected output and the agent does not know how to handle it — so it retries the same call.

Detection: in your logs, flag any session where the same tool_name + parameters combination appears more than 3 times in a row.

function detectLoop(entries: AgentLogEntry[]): boolean {
  const toolCalls = entries
    .filter(e => e.event_type === "tool_call")
    .map(e => `${e.data.tool_name}:${JSON.stringify(e.data.parameters)}`);

  for (let i = 0; i < toolCalls.length - 2; i++) {
    if (toolCalls[i] === toolCalls[i+1] && toolCalls[i+1] === toolCalls[i+2]) {
      return true;
    }
  }
  return false;
}

Fix: add a retry budget to your tool executor. After 3 identical calls, return a hard error that the model cannot retry past.

Wrong Tool Selection

The agent consistently picks the wrong tool for a task type. This is a tool design problem, not a model problem. Common cause: two tools have similar names or overlapping descriptions.

Detection: look at sessions where event_type: "error" appears immediately after a tool call. The model called a tool it should not have, and the tool rejected the input.

Fix: audit the descriptions of all tools in the category and make the distinction explicit. Add a "Do NOT use this for X — use Y instead" line to the description.

Hallucinated Parameters

The model constructs a parameter value that looks syntactically correct but does not exist in your system. Example: generating a contact_id that follows the right format but belongs to no real record.

Detection: log all tool responses where success: false and error: "NOT_FOUND". A high rate of NOT_FOUND errors for ID parameters indicates the model is generating IDs rather than retrieving them.

Fix: make sure IDs always come from a previous tool call. In your system prompt, instruct the agent to never construct IDs — only use IDs returned by previous tool results. Consider adding ID format validation that returns an error before hitting the database.

Context Window Overflow

On long agent runs, the conversation history grows until it exceeds the model's context window. The model starts dropping earlier context, which causes it to forget decisions it made earlier in the session.

Detection: track total_steps and total_tool_calls in your completion logs. If your longest sessions are also your highest error-rate sessions, context overflow is likely the cause.

Fix: implement a summarization step. After every N tool calls, summarize the session state into a compact representation and truncate the conversation history.

Tool Timeout Cascades

One slow tool holds up the entire agent loop. If your orchestrator has a global timeout, a single slow database query can kill a 10-step agent run at step 2.

Detection: log duration_ms on every tool call. Build an alert for any single tool call exceeding your p95 baseline by more than 3x.

Fix: set per-tool timeouts independent of the session timeout. Let slow tools fail fast rather than dragging down the whole run.

Activity Logs

Beyond per-session debugging logs, run a persistent activity log that tracks aggregate behavior across all agent runs. This gives you the operational view.

interface ActivityLogEntry {
  date: string;           // YYYY-MM-DD
  agent_id: string;
  total_runs: number;
  successful_runs: number;
  failed_runs: number;
  avg_steps: number;
  avg_duration_ms: number;
  most_called_tools: Array<{ tool: string; count: number }>;
  error_breakdown: Record<string, number>;
}

Query this daily. If failed_runs spikes for a specific agent, you have a problem to investigate before it compounds.

Alerting on Failures

Not every failure needs an alert. Alert on failures that are actionable, persistent, or high-impact.

Alert immediately when:

Any agent session ends with reason: "error" after a guardrail is triggered
The error rate for any agent exceeds 10% over a rolling 1-hour window
Any tool call returns a 5xx error (infrastructure failure, not user error)
A session exceeds the step limit — this always means something is wrong

Do not alert on:

Individual tool validation errors (these are expected and handled by the agent)
Single failed sessions below your error rate threshold
NOT_FOUND errors below your established baseline rate

Build your alerts around session-level outcomes, not individual tool call failures. One bad tool call in a successful session is noise. Three bad tool calls that end in a failed session is signal.

The 20-Point Debugging Checklist

When an agent is misbehaving in production, work through this list in order:

Tool Design

Is the failing tool's description specific enough to distinguish it from similar tools?
Do all parameters have descriptions with format examples?
Are constrained values using enums?
Do error responses include the error code, field name, and received value?
Are success and failure return shapes consistent?

Agent Loop 6. What step number did the failure occur on? 7. What tool was called immediately before the failure? 8. Is the same tool being called repeatedly with the same parameters? 9. Did the model have the correct context at the point of the bad decision? 10. Did the session hit the step limit?

Context and Prompts 11. Does the system prompt include instructions that could cause this behavior? 12. Is the conversation history at the failure point longer than expected? 13. Were there any NOT_FOUND errors in the session that could have confused the model? 14. Are there any ambiguous instructions in the prompt that could be interpreted multiple ways? 15. Is the task complexity within the expected range for this agent?

Infrastructure 16. Did any tool call exceed 5 seconds? 17. Are there any 5xx errors in the tool response logs? 18. Is the model API responding within normal latency? 19. Did the session run during a period of elevated traffic? 20. Is this failure pattern appearing across multiple agents or isolated to one?

Working through this list systematically will resolve the majority of production agent issues without needing to read raw model outputs.

Next Steps

Monitoring tells you what is happening. Guardrails tell you what is allowed to happen. See Guardrails and Safety for how to build hard limits around your agents so that when monitoring catches a problem, the blast radius is already contained.