Deployment Checklist: From Prototype to Production

Most agents that work in development never make it to production. Of the ones that do, most are quietly shut down within three months. The failure rate is not a secret — analysts consistently put it at 80% or higher for AI projects generally, and autonomous agents have an even harder time because they amplify every weakness in the system beneath them.

The gap between a prototype that works in a notebook and an agent that runs reliably for six months is not model quality. The model is usually fine. The gap is everything around the model: error handling, cost controls, observability, security. The things that feel like overhead when you are moving fast are exactly what determines whether you ship something that lasts.

This checklist is what we work through before deploying any agent at The AI University. We have broken things in every way covered here. Use it.

Why Most Agents Fail in Production

Agents fail in production for a predictable set of reasons.

Tool calls are unreliable. APIs go down. Rate limits get hit. Responses come back malformed. In any real workload, expect 3-15% of tool calls to fail in some way. An agent with no retry logic will fail on the first bad response it gets.

Token costs spiral without limits. An agent that works fine on a test case can blow through your monthly budget in an afternoon if it hits a pathological input or loops. Without explicit token budgets, there is nothing to stop it.

Errors are invisible. If an agent does not log its tool calls, decisions, and errors, you have no way to know what went wrong. An agent that fails silently is worse than one that never ran.

State disappears between runs. Production agents run on schedules and restart constantly. Without explicit memory persistence, every run starts from scratch.

Security is an afterthought. Prompt injection, leaked API keys, unrestricted tool access — these are not theoretical. They are what gets agent deployments shut down.

No one defined what "stuck" looks like. Without a clear escalation definition, a looping agent will keep trying until it exhausts its token budget.

Every item in the checklist below addresses one of these failure modes directly.

The Pre-Deployment Checklist

Work through this in order. Each item is a requirement, not a suggestion.

1. Error Handling and Retry Logic

Tool calls fail. This is not an edge case — it is the baseline. Network timeouts, rate limit responses, malformed JSON, API outages: plan for all of them.

Your agent needs at minimum:

A retry wrapper around every external tool call with exponential backoff
A maximum retry count (3 attempts is standard; more than that usually means a systemic problem, not transient noise)
Error classification: distinguish between retryable errors (rate limit, timeout, 5xx) and terminal errors (invalid credentials, 4xx client errors, schema violations)
A fallback path when a tool is permanently unavailable for the current run
Logged error context: which tool, what input, what error, how many retries

Here is a production-grade error handling wrapper we use for all tool calls:

// lib/tool-runner.ts

interface ToolCallOptions {
  maxRetries?: number;
  baseDelayMs?: number;
  retryableStatusCodes?: number[];
}

interface ToolCallResult<T> {
  success: boolean;
  data?: T;
  error?: string;
  attempts: number;
  durationMs: number;
}

async function runToolCall<T>(
  toolName: string,
  fn: () => Promise<T>,
  options: ToolCallOptions = {}
): Promise<ToolCallResult<T>> {
  const {
    maxRetries = 3,
    baseDelayMs = 500,
    retryableStatusCodes = [429, 500, 502, 503, 504],
  } = options;

  const startTime = Date.now();
  let lastError: Error | null = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await fn();
      const durationMs = Date.now() - startTime;

      console.log(JSON.stringify({
        event: "tool_call_success",
        tool: toolName,
        attempt,
        durationMs,
      }));

      return { success: true, data, attempts: attempt, durationMs };
    } catch (error) {
      lastError = error as Error;
      const statusCode = (error as any)?.status ?? (error as any)?.statusCode;
      const isRetryable = retryableStatusCodes.includes(statusCode) ||
        (error as any)?.code === "ECONNRESET" ||
        (error as any)?.code === "ETIMEDOUT";

      console.log(JSON.stringify({
        event: "tool_call_error",
        tool: toolName,
        attempt,
        statusCode,
        isRetryable,
        error: lastError.message,
      }));

      if (!isRetryable || attempt === maxRetries) break;

      const delay = baseDelayMs * Math.pow(2, attempt - 1);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  return {
    success: false,
    error: lastError?.message ?? "Unknown error",
    attempts: maxRetries,
    durationMs: Date.now() - startTime,
  };
}

Wrap every tool call with this. No exceptions. A tool call without retry logic is a reliability risk waiting to surface.

2. Token Budget Management

Set max_tokens on every API call. Without it, a single runaway agent can consume your entire monthly allocation in one session.

Set max_tokens on every messages.create call. Start conservative — 4,096 for most agents, 8,192 for long-form generation.
Track cumulative tokens across all API calls in a single run. If the run approaches a threshold (we use 100,000 tokens as a soft ceiling), log a warning and short-circuit gracefully.
Log input and output token counts for every call to catch agents stuffing too much context into prompts.
Review per-agent token usage weekly in the first month. Actual usage almost always differs from estimates.

// Track token usage per run
interface TokenUsage {
  inputTokens: number;
  outputTokens: number;
  calls: number;
}

const runUsage: TokenUsage = { inputTokens: 0, outputTokens: 0, calls: 0 };
const MAX_RUN_INPUT_TOKENS = 100_000;

function trackUsage(usage: { input_tokens: number; output_tokens: number }) {
  runUsage.inputTokens += usage.input_tokens;
  runUsage.outputTokens += usage.output_tokens;
  runUsage.calls += 1;

  if (runUsage.inputTokens > MAX_RUN_INPUT_TOKENS) {
    throw new Error(
      `Token budget exceeded: ${runUsage.inputTokens} input tokens consumed`
    );
  }
}

3. Rate Limiting

You will hit rate limits. Plan before it happens.

Beyond the API-level limits, impose self-limits per agent: maximum API calls per run, maximum concurrent runs if you parallelize, minimum delay between runs of the same agent to prevent cascading re-runs on failure, and a circuit breaker that pauses an agent after three consecutive failures before retrying.

Self-imposed rate limits are not just defensive — they are how you make cost predictable.

4. Guardrails

Guardrails are the controls that prevent your agent from doing things it should not do. There are three layers:

Input validation. Validate every input before the agent sees it. Check required fields, expected value ranges, and obvious injection patterns. Reject malformed inputs early.

Output validation. Validate every structured output before acting on it. If the agent returns JSON matching a schema, verify that schema before passing the output downstream. Do not discover the format is wrong at the point of a database write.

Tool access control. Agents should only have access to the tools they need. A content agent does not need database write access. A reporting agent does not need to send emails. Scope tool access tightly — the smaller the blast radius of a misbehaving agent, the better.

These guardrails are covered in depth in the Guardrails and Safety guide.

5. Logging and Observability

If it is not logged, it did not happen — at least not in any form you can reason about later.

Every agent in production needs to log: every tool call (name, input, output or error, duration, attempt count), every LLM call (model, token counts, stop reason), every decision point, every error with full context, and run start/end with total token usage and cost.

Log as structured JSON, not prose. Structured logs can be queried.

function logEvent(event: Record<string, unknown>) {
  console.log(JSON.stringify({
    ...event,
    timestamp: new Date().toISOString(),
    agentId: process.env.AGENT_ID,
    runId: process.env.RUN_ID,
  }));
}

Ship logs somewhere you can query them. A local file is fine for the first week. After that you need a real log store — Axiom, Datadog, CloudWatch, or similar. You cannot debug what you cannot search.

6. Memory Persistence

Where does your agent's state live between runs?

In a prototype, the answer is usually memory or the context window. Neither works in production — both vanish when the process ends. Before deploying, answer four questions: What state does this agent need to persist? Where does it live (Postgres, Redis, SQLite, Upstash)? How is it loaded at run start? How is it written at run end?

State updates should be committed atomically at the end of a successful run, not incrementally during execution. Partial writes on a failed run are how state corruption happens.

An agent without persistent memory is a stateless function with an expensive prompt. Real agent value comes from accumulation over time.

7. Graceful Degradation

What happens when a tool the agent depends on is unavailable?

The wrong answer: the agent crashes, the run fails, and you find out when someone notices the output is missing.

The right behavior depends on the tool:

Enrichment data (external API, third-party service): log the unavailability, continue with what you have, flag the output as incomplete.
Required write target (database, email): queue the write for retry rather than failing the whole run.
Core capability (no alternative): fail gracefully, log the cause, skip to the next item in the queue.

Define the degradation behavior for each tool before you deploy. "What does this agent do if this tool is down?" is a design-time question, not an incident-time one.

8. Cost Controls

Token costs and third-party API costs both need hard limits.

Set daily budget caps per agent and alert when you are approaching them. The mechanism does not need to be sophisticated: a simple check at the start of each run against a daily usage counter in your database is enough.

async function checkDailyBudget(agentId: string, estimatedCost: number) {
  const today = new Date().toISOString().split("T")[0];
  const dailySpend = await getDailySpend(agentId, today);
  const DAILY_CAP_USD = parseFloat(process.env.AGENT_DAILY_CAP_USD ?? "5.00");

  if (dailySpend + estimatedCost > DAILY_CAP_USD) {
    await sendAlert({
      type: "budget_cap_reached",
      agentId,
      dailySpend,
      estimatedCost,
      cap: DAILY_CAP_USD,
    });
    throw new Error(`Daily budget cap reached for agent ${agentId}`);
  }
}

Set alert thresholds at 70% and 90% of your caps so you can react before the cap is hit. A cost spike that you catch at 70% usage is a warning. One you catch after the cap is already exceeded is an incident.

9. Security

Security for production agents has four distinct concerns:

API key management. Never hardcode keys. Never log them. Load from environment variables. Rotate on a schedule. Have a runbook for emergency rotation.

Prompt injection defense. Any user-controlled input that enters an agent's context is an injection surface. Sanitize inputs before they hit the prompt. Clearly delimit user content from instructions. Validate that outputs conform to expected patterns rather than trusting them blindly.

Tool sandboxing. Agents that execute code or shell commands must do so in isolated environments. Containerize code execution. Restrict filesystem access. Limit network access to explicitly allowed domains. The agent should never reach your production database from a code execution tool.

Output sanitization. Treat agent outputs with the same skepticism as user-generated content. An agent writing to a web page without sanitization is an XSS vector.

10. Human Escalation

Every agent needs a defined escalation path: the conditions under which it stops, surfaces the situation to a human, and waits.

Define these conditions before deployment: task complexity exceeds the agent's scope; confidence is below a defined threshold; the task involves an irreversible action (deleting records, bulk emails, purchases) above a risk threshold; the agent has retried the same step more than N times; the input contains sensitive categories outside the agent's authorization.

Escalation is not a failure mode — it is a feature. Implement it as a first-class tool the agent can call, not a catch-all error handler:

const escalateToHumanTool = {
  name: "escalate_to_human",
  description: "Stop the current task and escalate to a human operator. Use when the task is outside your scope, confidence is low, or an irreversible action requires human approval.",
  input_schema: {
    type: "object",
    properties: {
      reason: { type: "string", description: "Why escalation is needed" },
      context: { type: "string", description: "Full context for the human reviewer" },
      urgency: { type: "string", enum: ["low", "medium", "high"] },
    },
    required: ["reason", "context", "urgency"],
  },
};

Infrastructure Requirements

Beyond the agent code itself, production deployment requires:

Server or VPS. A small VPS (2 vCPU, 4GB RAM) handles most agent workloads. We run our 12-agent system on a single DigitalOcean droplet. Containerize with Docker so agents are portable and isolated from each other.

Cron scheduling. Use system cron or a dedicated job scheduler (Inngest, Trigger.dev) to trigger runs. Do not rely on a web framework's background job system — it restarts unexpectedly and jobs get lost.

Environment variables. Every secret and configurable value should be an environment variable. Use .env locally and your hosting provider's secrets management in production. Never commit .env to version control.

Monitoring. At minimum: an alert channel where run failures are posted, a dashboard showing success/failure rates over time, and a watchdog alert for when a scheduled agent has not reported success within its expected window.

Deployment pipeline. Even a shell script that pulls latest code, runs tests, and restarts the process is better than manual deploys. Automate so deployments are repeatable and auditable.

Deployment Readiness Scoring

Use this table to score your agent before you ship it. Each item is worth one point. A score of 8 or below means the agent is not ready for production.

Requirement	Not Done (0)	Partial (0.5)	Complete (1)
Error handling and retry logic	No retry logic	Some tool calls wrapped	All tool calls wrapped with backoff
Token budget management	No max_tokens set	max_tokens set, no tracking	max_tokens set, usage tracked per run
Rate limiting	No limits	API limits only	API limits plus self-imposed per-agent limits
Guardrails	None	Input validation only	Input + output validation + tool access control
Logging and observability	No logging	Some logging	Structured logs for all tool calls, decisions, errors
Memory persistence	In-memory only	Files on disk	Database with atomic writes
Graceful degradation	Crashes on tool failure	Logs failure, still crashes	Defined fallback behavior per tool
Cost controls	No limits	Informal monitoring	Hard daily caps with alerts
Security	Keys in code	Keys in env vars	Keys in env vars + prompt injection defense + tool sandboxing
Human escalation	No escalation path	Error-handler escalation	First-class escalation tool with defined triggers

Score 10: Ship it. Stay close for the first 48 hours.

Score 8-9: Acceptable for low-stakes workloads. Do not use for anything with write access to critical systems.

Score 6-7: Not production-ready. The gaps you have are the ones that will cause incidents.

Score 5 or below: This is a prototype. Do not call it a production deployment.

Common Deployment Failures and Fixes

Failure	Symptom	Root Cause	Fix
Agent stops mid-task silently	Run completes but output is missing	Tool call failed with no retry or logging	Wrap all tool calls with the retry wrapper; add structured logging
Cost spike	Monthly bill 5-10x expected	Agent stuck in loop or hitting pathological input	Add per-run token budget with hard ceiling; add loop detection
Rate limit cascade	Multiple agents fail within minutes of each other	Agents share API quota and trigger simultaneously	Stagger cron schedules; add jitter to retry logic
State corruption between runs	Agent references data from two runs ago or wrong entity	Partial state write on prior failed run	Use atomic state writes; load state from DB at run start
Prompt injection from user data	Agent takes actions that were not in your system prompt	User-controlled data entered the prompt unvalidated	Sanitize and delimit all user inputs before they enter context
Runaway agent	Run exceeds 30 minutes, still going	No maximum iteration count or step limit	Add step counter; kill run after maximum steps regardless of completion
No alert on failure	Agent has not run in 3 days; nobody noticed	No monitoring on scheduled run success	Set up watchdog alert if agent has not reported success within expected window
Wrong tool access	Agent modified data it should only read	Tools were not scoped to agent's actual needs	Audit tool list; remove anything the agent does not explicitly need

Key Takeaways

Ship nothing that scores below 8 on the readiness table. The items you skip are exactly the ones that will cause the incident that gets the project cancelled.

The 80% failure rate for AI agent projects is not inevitable. It is the predictable result of treating infrastructure as an afterthought. Every item on this checklist has a real failure mode behind it. The checklist exists because we encountered most of them.

Error handling is not optional. Tool calls fail 3-15% of the time in normal operation. An agent with no retry logic has a meaningful probability of failing on any given run.

Cost controls are not optional. Token costs compound. A daily cap that feels conservative will feel essential the first time an agent runs pathologically.

Escalation is a feature. The agents you trust are the ones that know when to stop. Define escalation conditions before you ship, not after your first incident.

Start with one agent. Get it to 10 on the readiness table. Run it for two weeks. Then add the next one. Each agent you add before the first is solid makes the whole system harder to debug.

For operational guidance after shipping, see the Monitoring and Debugging guide.