AI University/Docs

Deployment Checklist: From Prototype to Production

80% of AI agent projects never make it to production. This guide covers the ten non-negotiable requirements for deploying agents that actually survive contact with the real world: error handling, token budgets, rate limiting, guardrails, logging, memory persistence, graceful degradation, cost controls, security, and human escalation. Includes TypeScript code, a readiness scoring table, and a failures reference.

Last updated: 2026-03-02

Deployment Checklist: From Prototype to Production

Most agents that work in development never make it to production. Of the ones that do, most are quietly shut down within three months. The failure rate is not a secret — analysts consistently put it at 80% or higher for AI projects generally, and autonomous agents have an even harder time because they amplify every weakness in the system beneath them.

The gap between a prototype that works in a notebook and an agent that runs reliably for six months is not model quality. The model is usually fine. The gap is everything around the model: error handling, cost controls, observability, security. The things that feel like overhead when you are moving fast are exactly what determines whether you ship something that lasts.

This checklist is what we work through before deploying any agent at The AI University. We have broken things in every way covered here. Use it.


Why Most Agents Fail in Production

Agents fail in production for a predictable set of reasons.

Tool calls are unreliable. APIs go down. Rate limits get hit. Responses come back malformed. In any real workload, expect 3-15% of tool calls to fail in some way. An agent with no retry logic will fail on the first bad response it gets.

Token costs spiral without limits. An agent that works fine on a test case can blow through your monthly budget in an afternoon if it hits a pathological input or loops. Without explicit token budgets, there is nothing to stop it.

Errors are invisible. If an agent does not log its tool calls, decisions, and errors, you have no way to know what went wrong. An agent that fails silently is worse than one that never ran.

State disappears between runs. Production agents run on schedules and restart constantly. Without explicit memory persistence, every run starts from scratch.

Security is an afterthought. Prompt injection, leaked API keys, unrestricted tool access — these are not theoretical. They are what gets agent deployments shut down.

No one defined what "stuck" looks like. Without a clear escalation definition, a looping agent will keep trying until it exhausts its token budget.

Every item in the checklist below addresses one of these failure modes directly.


The Pre-Deployment Checklist

Work through this in order. Each item is a requirement, not a suggestion.


1. Error Handling and Retry Logic

Tool calls fail. This is not an edge case — it is the baseline. Network timeouts, rate limit responses, malformed JSON, API outages: plan for all of them.

Your agent needs at minimum:

  • A retry wrapper around every external tool call with exponential backoff
  • A maximum retry count (3 attempts is standard; more than that usually means a systemic problem, not transient noise)
  • Error classification: distinguish between retryable errors (rate limit, timeout, 5xx) and terminal errors (invalid credentials, 4xx client errors, schema violations)
  • A fallback path when a tool is permanently unavailable for the current run
  • Logged error context: which tool, what input, what error, how many retries

Here is a production-grade error handling wrapper we use for all tool calls:

// lib/tool-runner.ts

interface ToolCallOptions {
  maxRetries?: number;
  baseDelayMs?: number;
  retryableStatusCodes?: number[];
}

interface ToolCallResult<T> {
  success: boolean;
  data?: T;
  error?: string;
  attempts: number;
  durationMs: number;
}

async function runToolCall<T>(
  toolName: string,
  fn: () => Promise<T>,
  options: ToolCallOptions = {}
): Promise<ToolCallResult<T>> {
  const {
    maxRetries = 3,
    baseDelayMs = 500,
    retryableStatusCodes = [429, 500, 502, 503, 504],
  } = options;

  const startTime = Date.now();
  let lastError: Error | null = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await fn();
      const durationMs = Date.now() - startTime;

      console.log(JSON.stringify({
        event: "tool_call_success",
        tool: toolName,
        attempt,
        durationMs,
      }));

      return { success: true, data, attempts: attempt, durationMs };
    } catch (error) {
      lastError = error as Error;
      const statusCode = (error as any)?.status ?? (error as any)?.statusCode;
      const isRetryable = retryableStatusCodes.includes(statusCode) ||
        (error as any)?.code === "ECONNRESET" ||
        (error as any)?.code === "ETIMEDOUT";

      console.log(JSON.stringify({
        event: "tool_call_error",
        tool: toolName,
        attempt,
        statusCode,
        isRetryable,
        error: lastError.message,
      }));

      if (!isRetryable || attempt === maxRetries) break;

      const delay = baseDelayMs * Math.pow(2, attempt - 1);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  return {
    success: false,
    error: lastError?.message ?? "Unknown error",
    attempts: maxRetries,
    durationMs: Date.now() - startTime,
  };
}

Wrap every tool call with this. No exceptions. A tool call without retry logic is a reliability risk waiting to surface.


2. Token Budget Management

Set max_tokens on every API call. Without it, a single runaway agent can consume your entire monthly allocation in one session.

  • Set max_tokens on every messages.create call. Start conservative — 4,096 for most agents, 8,192 for long-form generation.
  • Track cumulative tokens across all API calls in a single run. If the run approaches a threshold (we use 100,000 tokens as a soft ceiling), log a warning and short-circuit gracefully.
  • Log input and output token counts for every call to catch agents stuffing too much context into prompts.
  • Review per-agent token usage weekly in the first month. Actual usage almost always differs from estimates.
// Track token usage per run
interface TokenUsage {
  inputTokens: number;
  outputTokens: number;
  calls: number;
}

const runUsage: TokenUsage = { inputTokens: 0, outputTokens: 0, calls: 0 };
const MAX_RUN_INPUT_TOKENS = 100_000;

function trackUsage(usage: { input_tokens: number; output_tokens: number }) {
  runUsage.inputTokens += usage.input_tokens;
  runUsage.outputTokens += usage.output_tokens;
  runUsage.calls += 1;

  if (runUsage.inputTokens > MAX_RUN_INPUT_TOKENS) {
    throw new Error(
      `Token budget exceeded: ${runUsage.inputTokens} input tokens consumed`
    );
  }
}

3. Rate Limiting

You will hit rate limits. Plan before it happens.

Beyond the API-level limits, impose self-limits per agent: maximum API calls per run, maximum concurrent runs if you parallelize, minimum delay between runs of the same agent to prevent cascading re-runs on failure, and a circuit breaker that pauses an agent after three consecutive failures before retrying.

Self-imposed rate limits are not just defensive — they are how you make cost predictable.


4. Guardrails

Guardrails are the controls that prevent your agent from doing things it should not do. There are three layers:

Input validation. Validate every input before the agent sees it. Check required fields, expected value ranges, and obvious injection patterns. Reject malformed inputs early.

Output validation. Validate every structured output before acting on it. If the agent returns JSON matching a schema, verify that schema before passing the output downstream. Do not discover the format is wrong at the point of a database write.

Tool access control. Agents should only have access to the tools they need. A content agent does not need database write access. A reporting agent does not need to send emails. Scope tool access tightly — the smaller the blast radius of a misbehaving agent, the better.

These guardrails are covered in depth in the Guardrails and Safety guide.


5. Logging and Observability

If it is not logged, it did not happen — at least not in any form you can reason about later.

Every agent in production needs to log: every tool call (name, input, output or error, duration, attempt count), every LLM call (model, token counts, stop reason), every decision point, every error with full context, and run start/end with total token usage and cost.

Log as structured JSON, not prose. Structured logs can be queried.

function logEvent(event: Record<string, unknown>) {
  console.log(JSON.stringify({
    ...event,
    timestamp: new Date().toISOString(),
    agentId: process.env.AGENT_ID,
    runId: process.env.RUN_ID,
  }));
}

Ship logs somewhere you can query them. A local file is fine for the first week. After that you need a real log store — Axiom, Datadog, CloudWatch, or similar. You cannot debug what you cannot search.


6. Memory Persistence

Where does your agent's state live between runs?

In a prototype, the answer is usually memory or the context window. Neither works in production — both vanish when the process ends. Before deploying, answer four questions: What state does this agent need to persist? Where does it live (Postgres, Redis, SQLite, Upstash)? How is it loaded at run start? How is it written at run end?

State updates should be committed atomically at the end of a successful run, not incrementally during execution. Partial writes on a failed run are how state corruption happens.

An agent without persistent memory is a stateless function with an expensive prompt. Real agent value comes from accumulation over time.


7. Graceful Degradation

What happens when a tool the agent depends on is unavailable?

The wrong answer: the agent crashes, the run fails, and you find out when someone notices the output is missing.

The right behavior depends on the tool:

  • Enrichment data (external API, third-party service): log the unavailability, continue with what you have, flag the output as incomplete.
  • Required write target (database, email): queue the write for retry rather than failing the whole run.
  • Core capability (no alternative): fail gracefully, log the cause, skip to the next item in the queue.

Define the degradation behavior for each tool before you deploy. "What does this agent do if this tool is down?" is a design-time question, not an incident-time one.


8. Cost Controls

Token costs and third-party API costs both need hard limits.

Set daily budget caps per agent and alert when you are approaching them. The mechanism does not need to be sophisticated: a simple check at the start of each run against a daily usage counter in your database is enough.

async function checkDailyBudget(agentId: string, estimatedCost: number) {
  const today = new Date().toISOString().split("T")[0];
  const dailySpend = await getDailySpend(agentId, today);
  const DAILY_CAP_USD = parseFloat(process.env.AGENT_DAILY_CAP_USD ?? "5.00");

  if (dailySpend + estimatedCost > DAILY_CAP_USD) {
    await sendAlert({
      type: "budget_cap_reached",
      agentId,
      dailySpend,
      estimatedCost,
      cap: DAILY_CAP_USD,
    });
    throw new Error(`Daily budget cap reached for agent ${agentId}`);
  }
}

Set alert thresholds at 70% and 90% of your caps so you can react before the cap is hit. A cost spike that you catch at 70% usage is a warning. One you catch after the cap is already exceeded is an incident.


9. Security

Security for production agents has four distinct concerns:

API key management. Never hardcode keys. Never log them. Load from environment variables. Rotate on a schedule. Have a runbook for emergency rotation.

Prompt injection defense. Any user-controlled input that enters an agent's context is an injection surface. Sanitize inputs before they hit the prompt. Clearly delimit user content from instructions. Validate that outputs conform to expected patterns rather than trusting them blindly.

Tool sandboxing. Agents that execute code or shell commands must do so in isolated environments. Containerize code execution. Restrict filesystem access. Limit network access to explicitly allowed domains. The agent should never reach your production database from a code execution tool.

Output sanitization. Treat agent outputs with the same skepticism as user-generated content. An agent writing to a web page without sanitization is an XSS vector.


10. Human Escalation

Every agent needs a defined escalation path: the conditions under which it stops, surfaces the situation to a human, and waits.

Define these conditions before deployment: task complexity exceeds the agent's scope; confidence is below a defined threshold; the task involves an irreversible action (deleting records, bulk emails, purchases) above a risk threshold; the agent has retried the same step more than N times; the input contains sensitive categories outside the agent's authorization.

Escalation is not a failure mode — it is a feature. Implement it as a first-class tool the agent can call, not a catch-all error handler:

const escalateToHumanTool = {
  name: "escalate_to_human",
  description: "Stop the current task and escalate to a human operator. Use when the task is outside your scope, confidence is low, or an irreversible action requires human approval.",
  input_schema: {
    type: "object",
    properties: {
      reason: { type: "string", description: "Why escalation is needed" },
      context: { type: "string", description: "Full context for the human reviewer" },
      urgency: { type: "string", enum: ["low", "medium", "high"] },
    },
    required: ["reason", "context", "urgency"],
  },
};

Infrastructure Requirements

Beyond the agent code itself, production deployment requires:

Server or VPS. A small VPS (2 vCPU, 4GB RAM) handles most agent workloads. We run our 12-agent system on a single DigitalOcean droplet. Containerize with Docker so agents are portable and isolated from each other.

Cron scheduling. Use system cron or a dedicated job scheduler (Inngest, Trigger.dev) to trigger runs. Do not rely on a web framework's background job system — it restarts unexpectedly and jobs get lost.

Environment variables. Every secret and configurable value should be an environment variable. Use .env locally and your hosting provider's secrets management in production. Never commit .env to version control.

Monitoring. At minimum: an alert channel where run failures are posted, a dashboard showing success/failure rates over time, and a watchdog alert for when a scheduled agent has not reported success within its expected window.

Deployment pipeline. Even a shell script that pulls latest code, runs tests, and restarts the process is better than manual deploys. Automate so deployments are repeatable and auditable.


Deployment Readiness Scoring

Use this table to score your agent before you ship it. Each item is worth one point. A score of 8 or below means the agent is not ready for production.

RequirementNot Done (0)Partial (0.5)Complete (1)
Error handling and retry logicNo retry logicSome tool calls wrappedAll tool calls wrapped with backoff
Token budget managementNo max_tokens setmax_tokens set, no trackingmax_tokens set, usage tracked per run
Rate limitingNo limitsAPI limits onlyAPI limits plus self-imposed per-agent limits
GuardrailsNoneInput validation onlyInput + output validation + tool access control
Logging and observabilityNo loggingSome loggingStructured logs for all tool calls, decisions, errors
Memory persistenceIn-memory onlyFiles on diskDatabase with atomic writes
Graceful degradationCrashes on tool failureLogs failure, still crashesDefined fallback behavior per tool
Cost controlsNo limitsInformal monitoringHard daily caps with alerts
SecurityKeys in codeKeys in env varsKeys in env vars + prompt injection defense + tool sandboxing
Human escalationNo escalation pathError-handler escalationFirst-class escalation tool with defined triggers

Score 10: Ship it. Stay close for the first 48 hours.

Score 8-9: Acceptable for low-stakes workloads. Do not use for anything with write access to critical systems.

Score 6-7: Not production-ready. The gaps you have are the ones that will cause incidents.

Score 5 or below: This is a prototype. Do not call it a production deployment.


Common Deployment Failures and Fixes

FailureSymptomRoot CauseFix
Agent stops mid-task silentlyRun completes but output is missingTool call failed with no retry or loggingWrap all tool calls with the retry wrapper; add structured logging
Cost spikeMonthly bill 5-10x expectedAgent stuck in loop or hitting pathological inputAdd per-run token budget with hard ceiling; add loop detection
Rate limit cascadeMultiple agents fail within minutes of each otherAgents share API quota and trigger simultaneouslyStagger cron schedules; add jitter to retry logic
State corruption between runsAgent references data from two runs ago or wrong entityPartial state write on prior failed runUse atomic state writes; load state from DB at run start
Prompt injection from user dataAgent takes actions that were not in your system promptUser-controlled data entered the prompt unvalidatedSanitize and delimit all user inputs before they enter context
Runaway agentRun exceeds 30 minutes, still goingNo maximum iteration count or step limitAdd step counter; kill run after maximum steps regardless of completion
No alert on failureAgent has not run in 3 days; nobody noticedNo monitoring on scheduled run successSet up watchdog alert if agent has not reported success within expected window
Wrong tool accessAgent modified data it should only readTools were not scoped to agent's actual needsAudit tool list; remove anything the agent does not explicitly need

Key Takeaways

Ship nothing that scores below 8 on the readiness table. The items you skip are exactly the ones that will cause the incident that gets the project cancelled.

The 80% failure rate for AI agent projects is not inevitable. It is the predictable result of treating infrastructure as an afterthought. Every item on this checklist has a real failure mode behind it. The checklist exists because we encountered most of them.

Error handling is not optional. Tool calls fail 3-15% of the time in normal operation. An agent with no retry logic has a meaningful probability of failing on any given run.

Cost controls are not optional. Token costs compound. A daily cap that feels conservative will feel essential the first time an agent runs pathologically.

Escalation is a feature. The agents you trust are the ones that know when to stop. Define escalation conditions before you ship, not after your first incident.

Start with one agent. Get it to 10 on the readiness table. Run it for two weeks. Then add the next one. Each agent you add before the first is solid makes the whole system harder to debug.

For operational guidance after shipping, see the Monitoring and Debugging guide.