The Agent Builder's Stack: Every Layer You Need to Master

AI engineers command a 56% wage premium over their non-AI counterparts and average $206K in total compensation in 2026. Those numbers attract a lot of people. They also mislead a lot of people into thinking that the job is "talk to a model and ship it." The job is building systems. Systems that reason, act, remember, learn, and survive production for longer than a week.

Building a production AI agent is not a single skill. It is a stack of interlocking disciplines, each one load-bearing. Miss the foundation models layer and your agent reasons poorly. Miss the tools layer and it cannot act. Miss memory and it forgets everything between runs. Miss deployment and it never leaves your laptop.

This page maps the complete agent builder's stack as it exists in 2026 — seven layers, from the LLM at the bottom to observability at the top. Each layer is a chapter you need to understand. Together they form the curriculum for becoming the kind of engineer that the 88% of enterprises now implementing AI are trying to hire.

Less than 10% of those enterprises have successfully scaled AI agents to production. The gap is not model quality. It is everything around the model. This is the guide to everything around the model.

The One-Person AI Agent Company

This is what a one-person AI agent company looks like in 2026:

6 AI agents. 20 cron jobs. 0 human employees.

Every role is a folder. Every job description is a markdown file. No standups. No Slack. No payroll. Just a directory on a Mac that runs the whole thing.

~/company/
├── agents/
│   ├── growth/            # Head of growth
│   ├── content/           # Content team
│   ├── outreach/          # Sales development
│   ├── research/          # Market analyst
│   ├── support/           # Customer success
│   └── ops/               # Operations manager
├── schedules/
│   ├── morning-research.sh
│   ├── daily-outreach.sh
│   ├── engagement-check.sh
│   ├── weekly-report.sh
│   └── ... (20 cron jobs)
├── .claude/
│   ├── CLAUDE.md          # Company-wide instructions
│   └── skills/            # The capabilities library
│       ├── lead-scoring-engine/
│       ├── churn-predictor/
│       ├── a-b-test-analyzer/
│       └── ... (16 skills)
└── data/
    ├── visitors.db
    ├── contacts.db
    └── memory/            # Each agent's persistent memory

Each agent has a system prompt that is its job description. Each skill in .claude/skills/ is a capability any agent can use — lead scoring, churn prediction, A/B test analysis, send timing optimization. The cron jobs are the work schedule: the research agent runs at 7am, outreach fires every 4 hours, the growth agent runs twice daily, the weekly report compiles on Friday afternoon.

The person who built this is not managing employees. They are managing a directory structure. They add a new "hire" by creating a folder and writing a markdown file. They give an agent a new capability by adding a Python script. They change the work schedule by editing a crontab. They "fire" an agent by commenting out a line.

This is not a thought experiment. This is the AI University system described throughout this documentation — 15 agents, 52 tools, 16 computational skills, running on a Claude Max subscription with zero per-token API costs. One person built it. One person maintains it. It runs 24/7.

The only reason it works is the stack on this page. Take away any single layer — the models, the orchestration, the tools, the memory, the skills, the deployment infrastructure — and the system collapses into a pile of markdown files that do nothing. Each layer is load-bearing. Together, they turn a directory on a Mac into a company that never sleeps.

The Stack at a Glance

+-------------------------------------------------------+
|  Layer 7: Deployment & Observability                   |
|  Docker, K8s, monitoring, logging, cost tracking       |
+-------------------------------------------------------+
|  Layer 6: Computational Skills                         |
|  Python scripts, lead scoring, A/B testing, forecasts  |
+-------------------------------------------------------+
|  Layer 5: RAG & Knowledge                              |
|  Chunking, embedding, retrieval, knowledge graphs      |
+-------------------------------------------------------+
|  Layer 4: Memory Systems                               |
|  Short-term, long-term, semantic, shared multi-agent   |
+-------------------------------------------------------+
|  Layer 3: Tools & MCP                                  |
|  Tool definitions, MCP protocol, API access, compute   |
+-------------------------------------------------------+
|  Layer 2: Orchestration Frameworks                     |
|  LangChain, LangGraph, CrewAI, Anthropic SDK, custom   |
+-------------------------------------------------------+
|  Layer 1: Foundation Models (LLMs)                     |
|  Claude, GPT-4, Gemini, Llama, Mistral                 |
+-------------------------------------------------------+

Every layer depends on the ones below it. You cannot build reliable tools without understanding the model that calls them. You cannot build memory systems without understanding the tools that read and write them. You cannot deploy what you cannot observe.

Learn the stack bottom-up. Build it top-down. That is the paradox: you need to understand the foundation first, but in practice you start from the deployment target and work your way down to the model.

Layer 1: Foundation Models (LLMs)

The foundation model is the reasoning engine at the core of every agent. It reads instructions, reasons over context, decides which tools to call, interprets results, and produces outputs. Everything else in the stack serves this layer or depends on it.

The Model Landscape in 2026

The market has consolidated around four serious players for agent workloads, plus a growing open-source tier that handles specific use cases well.

Provider	Models	Strengths	Context Window	Best For
Anthropic	Claude Opus, Sonnet, Haiku	Instruction following, safety, tool use, long context	Up to 200K tokens	Production agents, complex reasoning, agentic workflows
OpenAI	GPT-4o, GPT-4 Turbo, o1/o3	Broad capability, ecosystem, vision	Up to 128K tokens	General-purpose agents, multimodal tasks
Google	Gemini 2.0 Pro, Flash	Multimodal, massive context, speed	Up to 2M tokens	Long-document processing, multimodal workflows
Meta	Llama 3.3 (70B, 405B)	Open weights, self-hostable, no API costs	Up to 128K tokens	Cost-sensitive workloads, on-premise requirements
Mistral	Mistral Large, Medium	European hosting, competitive reasoning	Up to 128K tokens	EU data residency, cost-efficient classification

The Claude Model Tiers

At The AI University, we run our entire 15-agent system on Claude. Understanding the three tiers is essential to building cost-effective agents.

Claude Opus is the reasoning heavyweight. Use it when the task requires multi-step analysis, synthesis of conflicting information, nuanced judgment, or when errors are expensive. Our growth orchestrator and research agents run on Opus because the decisions they make cascade through the entire system. A bad routing decision from the orchestrator corrupts every downstream agent's work.

Claude Sonnet is the workhorse. It handles the majority of agent tasks — structured output generation, tool-calling sequences, moderate reasoning, content drafting — at significantly lower cost and latency. Most of our 15 agents run on Sonnet for their primary workloads. It is the right default choice until you hit a task it cannot handle.

Claude Haiku is the classifier. Fast, cheap, and effective for tasks that require pattern matching rather than deep reasoning: routing decisions, sentiment classification, entity extraction, data validation, and schema conformance checks. Anywhere you need a quick binary or categorical answer, Haiku is the right choice. Our router layer uses Haiku to classify incoming signals before dispatching to more capable agents.

Model Selection Is Architecture

The choice of model is not a configuration flag you set once. It is an architectural decision that affects cost, latency, reliability, and output quality at every layer above.

A 15-agent system where every agent runs on Opus will produce excellent output and cost ten times more than it needs to. A system where every agent runs on Haiku will be fast and cheap and produce unreliable output on anything requiring real reasoning. The engineering skill is matching model capability to task complexity.

// Model routing based on task complexity
function selectModel(task: AgentTask): string {
  // High-stakes decisions, multi-source synthesis, strategic planning
  if (task.complexity === "high" || task.requiresReasoning) {
    return "claude-opus-4-6";
  }

  // Standard agent work: tool calling, content generation, analysis
  if (task.complexity === "medium") {
    return "claude-sonnet-4-20250514";
  }

  // Classification, routing, extraction, validation
  return "claude-haiku-4-20250514";
}

For a deeper dive into model routing and the cost math behind these decisions, see our Model Selection Guide.

How We Run Models

Our agents run via the claude -p CLI with a Claude Max subscription. This is a critical architectural decision: Max subscription means no per-token API costs. The agents can run as frequently as needed without the cost anxiety that comes with per-call pricing. This changes the economics of agent design fundamentally — you can afford to have agents reason more thoroughly, retry more aggressively, and run more often.

If you are building on the API directly, every design decision has a cost dimension. With Max, the constraint shifts from cost to throughput and rate limits. Design accordingly.

Layer 2: Orchestration Frameworks

The orchestration layer is the control plane that coordinates agents, manages execution flow, handles errors, and stitches individual agent capabilities into a coherent system.

The Framework Landscape

Framework	Approach	Strengths	Weaknesses	Best For
LangChain	Chain-based composition	Massive ecosystem, many integrations	Abstraction-heavy, can obscure what happens	Rapid prototyping, integration-heavy workflows
LangGraph	Graph-based state machines	Explicit state management, cycles, branching	Steeper learning curve than LangChain	Complex multi-step workflows with conditional logic
CrewAI	Role-based multi-agent	Intuitive mental model, agent personas	Less control over execution details	Team-based agent collaboration, delegation
AutoGen (Microsoft)	Conversational multi-agent	Agents converse to solve problems	Can be unpredictable, hard to constrain	Research, open-ended problem solving
Anthropic SDK (direct)	Raw API with custom orchestration	Full control, no abstraction overhead	You build everything yourself	Production systems where you need to own every layer

When to Use a Framework vs. Going Direct

Use a framework when you are prototyping, exploring integrations, or building a system where the framework's abstractions match your workflow pattern. LangGraph is excellent for state machines. CrewAI is excellent for role-based delegation. Both save you significant time if your problem matches their model.

Go direct when you need full control, when framework abstractions get in the way, or when you are building something the framework was not designed for. The cost is writing more code. The benefit is understanding every line of your system.

What We Use

At The AI University, we use the Anthropic SDK directly with a custom orchestrator. Our orchestrator lives at src/lib/agent-sdk/orchestrator.ts and handles agent dispatch, tool routing, memory management, error handling, and cross-agent coordination.

We chose this approach because we needed precise control over how agents interact, what tools they can access, how memory flows between them, and how errors propagate. No existing framework gave us that level of control without fighting its abstractions.

// Simplified orchestrator dispatch
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface AgentConfig {
  id: string;
  model: string;
  systemPrompt: string;
  allowedTools: string[];
  maxTokens: number;
}

async function dispatchAgent(
  config: AgentConfig,
  input: string
): Promise<string> {
  const tools = getToolsForAgent(config.allowedTools);

  const response = await client.messages.create({
    model: config.model,
    max_tokens: config.maxTokens,
    system: config.systemPrompt,
    messages: [{ role: "user", content: input }],
    tools,
  });

  return handleAgentResponse(response, config);
}

The important lesson: you do not need a framework to build production agents. You need to understand the patterns (supervisor, router, handoff, pipeline, blackboard) and implement the one that fits. See Architecture Patterns for the complete pattern catalog with code.

Layer 3: Tools & MCP

Tools are what separate an LLM from an agent. Without tools, a model can only produce text. With tools, it can search the web, send emails, query databases, write files, call APIs, and take action in the world. Tools are the hands of the agent.

What Is MCP?

The Model Context Protocol (MCP) is an open standard developed by Anthropic that defines how AI models communicate with external tools and data sources. Before MCP, every team invented their own tool interface. MCP standardizes the contract: tools are defined with a name, description, and JSON Schema for inputs. The model discovers tools, decides when to use them, emits structured calls, and receives structured results.

The cycle is: reason, call, observe, reason. MCP provides the contract that makes it reliable.

Tool Categories

Tools fall into natural categories based on what they enable:

Category	What It Enables	Examples
Data Access	Reading from and writing to persistent storage	query_visitors, save_memory, load_memory, get_health_scores
External APIs	Interacting with third-party services	search_web, fetch_url, enrich_lead, search_twitter
Communication	Taking action in the external world	send_email, publish_to_linkedin, notify_owner
Computation	Running deterministic calculations	run_skill_script, generate_conversion_plan
Coordination	Managing multi-agent workflows	emit_event, read_events, claim_work, check_contact

At The AI University, our MCP server wraps 52 tools across these categories. Each of our 15 agents sees only the tools it is permitted to use, enforced by an allowlist system that prevents agents from accessing capabilities outside their scope.

Tool Design Principles

Good tool design is the single most underrated skill in agent engineering. The model never reads your source code — it reads the tool name and description you expose through MCP. If those are vague, the model will call tools incorrectly or not at all.

The five rules:

The description is the API. Write it like a contract, not a comment.
Use enums for constrained values. Never let the model guess from an open string.
Return errors the model can act on. Include the error code, reason, and field that failed.
Return structured JSON. Give the model data, not text to parse.
Name tools like API endpoints. Verb-noun, specific, no abbreviations.

// The difference between a tool agents use correctly and one they do not
// Bad: vague name, thin description
{
  name: "search",
  description: "Searches for things.",
  parameters: { query: { type: "string" } }
}

// Good: specific name, rich description, typed parameters
{
  name: "search_web",
  description:
    "Search the web using DuckDuckGo. Returns titles, URLs, and snippets. " +
    "Use to find competitor news, reviews, customer mentions, feature announcements. " +
    "Returns max 15 results. For deep reading of a specific URL, use fetch_url instead.",
  parameters: {
    query: { type: "string", description: "Search query" },
    limit: { type: "number", description: "Max results (default 8, max 15)" }
  }
}

For the complete guide to tool design, see Designing Tools for AI Agents. For the full catalog of our 52 tools, see the Tools Overview.

Layer 4: Memory Systems

An agent without memory is a stateless function that costs more than it should. Every run starts from zero. It does not know what it did yesterday. It cannot learn from mistakes because it has no record of having made them.

Memory is what turns an agent from a tool into a system that compounds value over time.

The Four Types of Memory

Short-term memory is the conversation context within a single run. The LLM runtime manages this for you. Your job is managing it: context windows are finite, and as runs get longer, you either hit limits or pay for tokens that dilute rather than help. Techniques include sliding windows, summarization, and importance-based pruning.

Long-term memory persists across runs. It lives in a storage system and is explicitly loaded at run start and saved at run end. This is the memory you have to build. An agent without save_memory and load_memory tools has no long-term memory, period.

Semantic memory uses embeddings to store and retrieve knowledge by meaning rather than exact key. When an agent needs to find memories relevant to its current task without knowing their exact keys, vector similarity search surfaces semantically related entries even when they share no keywords with the query.

Shared memory is long-term memory that multiple agents can read from and write to. It is the blackboard pattern applied to memory — Agent A saves a finding, Agent B reads it and builds on it. No direct agent-to-agent communication. They coordinate through the store.

Vector Databases

Semantic memory requires a vector store. The options in 2026:

Database	Type	Strengths	Best For
Pinecone	Managed cloud	Fast, scalable, simple API	Production workloads, minimal ops overhead
Weaviate	Self-hosted or cloud	Rich filtering, hybrid search	Complex queries combining vector + keyword search
FAISS	Library (Meta)	Extremely fast, runs locally	Prototyping, embedded systems, no-ops scenarios
Qdrant	Self-hosted or cloud	Rust performance, rich payload filtering	High-throughput filtering + vector search
ChromaDB	Embedded	Simple API, Python-native	Quick prototyping, small datasets
pgvector	PostgreSQL extension	Uses existing Postgres, no new infrastructure	Teams already on PostgreSQL

Start simple. At The AI University, we use JSON files per agent for individual memory and a shared SQLite database for cross-agent memory. Vector search is layered on using embeddings stored as columns in SQLite. This handles our volume without adding a managed service. Move to a dedicated vector database when semantic search is a real requirement and your store exceeds what in-memory similarity can handle.

For the complete implementation guide, see Agent Memory and Context Management.

Layer 5: RAG & Knowledge

Retrieval-Augmented Generation (RAG) is how you give an agent access to knowledge that is not in its training data — your company's documentation, your product catalog, your customer records, your proprietary research. Instead of fine-tuning the model on this data (expensive, slow, static), you retrieve relevant chunks at query time and inject them into the context.

RAG vs. Fine-Tuning

This is the most common architectural decision in knowledge-intensive agent systems. The answer is almost always RAG, with fine-tuning reserved for specific situations.

Factor	RAG	Fine-Tuning
Data freshness	Real-time (re-index and it is live)	Stale (retrain on every update)
Cost	Retrieval cost per query	Training cost upfront, then inference
Transparency	You can see exactly what was retrieved	Opaque — knowledge baked into weights
Data volume	Scales to millions of documents	Limited by training data size and budget
Accuracy	High when retrieval is good	High when training data is clean
Setup time	Hours to days	Days to weeks
Best for	Dynamic knowledge, documents, FAQs, catalogs	Consistent tone, domain-specific language, behavioral patterns

Use RAG when the knowledge changes, when you need to know what informed the answer, and when you need to scale to large document sets. Use fine-tuning when you need to change how the model behaves (tone, style, domain vocabulary) rather than what it knows.

The RAG Pipeline

A RAG system has four stages:

1. Chunking. Split documents into pieces small enough to fit in a context window and cohesive enough to be meaningful. Common approaches: fixed-size chunks (512-1024 tokens), semantic chunks (split at paragraph or section boundaries), and recursive splitting (try larger chunks first, split further only if they exceed the limit). Overlap between chunks (50-100 tokens) prevents information from being lost at boundaries.

2. Embedding. Convert each chunk into a vector that captures its semantic meaning. Use an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or open-source alternatives like BGE or E5) to produce vectors. Store the vectors alongside the original text and any metadata (source document, page number, creation date).

3. Retrieval. When a query arrives, embed it using the same model and find the most similar chunks in the vector store. Top-k retrieval (typically k=5 to k=20) returns the chunks most likely to contain relevant information. Hybrid search — combining vector similarity with keyword matching — often outperforms either approach alone.

4. Generation. Inject the retrieved chunks into the model's context along with the original query. The model reasons over both to produce a grounded answer. The key is formatting the retrieved context clearly so the model knows what it is working with and can cite or reference specific chunks.

// Simplified RAG retrieval + generation
async function ragQuery(question: string): Promise<string> {
  // Step 1: Embed the question
  const queryEmbedding = await embedText(question);

  // Step 2: Retrieve relevant chunks
  const chunks = await vectorStore.search(queryEmbedding, { topK: 10 });

  // Step 3: Generate answer with retrieved context
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    system: `Answer the user's question using only the provided context.
If the context does not contain enough information, say so.
Cite the source document for each claim.`,
    messages: [
      {
        role: "user",
        content: `Context:\n${chunks.map((c) =>
          `[Source: ${c.metadata.source}]\n${c.text}`
        ).join("\n\n")}\n\nQuestion: ${question}`,
      },
    ],
  });

  return extractText(response);
}

Knowledge Graphs

For domains with structured relationships — product catalogs, organizational hierarchies, regulatory frameworks — knowledge graphs complement vector search. A knowledge graph encodes entities and their relationships explicitly, enabling queries like "find all products related to this customer's industry that were updated in the last 90 days." Vector search finds semantically similar content. Knowledge graphs navigate known relationships. The most sophisticated RAG systems use both.

Layer 6: Computational Skills

The Phone Analogy

An LLM without skills is a smartphone with no apps installed.

Out of the box, the phone can make calls and send texts. That is impressive technology. But nobody buys a phone for the dialer. You buy it for the camera, the maps, the email, the calendar, the banking app, the music player, the fitness tracker. The hardware is capable of all those things — but without the apps, it is just a capable piece of glass sitting on your desk.

The same is true for AI agents. A bare Claude model can reason, write, and answer questions. That is genuinely impressive. But an agent that can only talk is like a phone that can only call. Install the apps — lead scoring, churn prediction, web search, email sending, database queries, A/B testing — and suddenly the same model becomes a growth engine, a research analyst, a sales team.

Empty Phone	Phone With Apps	Bare LLM	Agent With Skills
Makes calls	+ Camera, photos, video	Answers questions	+ Scores leads (0-100)
Sends texts	+ Maps, navigation	Writes text	+ Predicts churn probability
Shows time	+ Email, calendar	Summarizes content	+ Runs statistical A/B tests
Has a browser	+ Banking, payments	Generates ideas	+ Optimizes send timing
	+ Fitness tracking		+ Forecasts revenue
	+ Music, podcasts		+ Classifies buyer personas

The skills are the apps. The MCP server is the app store. The Python scripts are the actual code running behind each app icon.

This is why the one-person company from the top of this page works. The founder does not need a data analyst — they install the trend-detection skill. They do not need a sales ops person — they install lead-scoring-engine. They do not need a statistician for campaign analysis — they install a-b-test-analyzer. Each skill is a Python script that takes JSON in and returns JSON out. Each one gives the agent a precise, deterministic capability that language alone cannot provide.

Why Skills Exist

Language models are reasoning engines, not calculators. When an agent needs to score a lead, run a statistical test, forecast a metric, or optimize a schedule, asking it to reason through the math produces unreliable results. The right approach is to move computations that require precision, consistency, or speed out of the model and into deterministic code.

This is what computational skills are: Python scripts that an agent can invoke to perform a computation it cannot do reliably through language alone. They are the difference between an agent that talks about lead scoring and an agent that actually computes a score.

The Skills Pattern

A skill has three components:

SKILL.md — a meta prompt the agent reads to understand the skill's purpose, inputs, and outputs
scripts/ — Python scripts that perform the computation (pure stdlib, JSON in, JSON out)
catalog.json — a registry entry the MCP server reads to make the skill discoverable

The agent never writes Python. It reads the SKILL.md, then calls the run_skill_script MCP tool with the skill name and input data. The tool executor runs the script as a subprocess and returns the result as structured JSON.

// What a skill invocation looks like from the agent's perspective
await runSkillScript({
  skill_name: "lead-scoring-engine",
  input: {
    contact: {
      email: "cto@acme.com",
      company_size: 500,
      title: "Chief Technology Officer",
      website_visits_30d: 12,
      email_opens_30d: 4,
    },
  },
});
// Returns: { score: 82, tier: "hot", signals: [...], confidence: "high" }

The AI University Skills Library

Our system ships with 16 computational skills covering the most common agent workloads:

Skill	What It Computes
lead-scoring-engine	0-100 lead scores from firmographic and behavioral signals
churn-predictor	Churn probability from activity decay and engagement patterns
a-b-test-analyzer	Statistical significance, confidence intervals, winner determination
send-timing-optimizer	Optimal send times from historical engagement data
trend-detection	Statistically significant trends and anomalies in time-series data
cross-agent-intelligence	Cross-session patterns from multi-agent signal aggregation
adaptive-feedback-loop	Behavioral parameter adjustments from outcome feedback
data-pruner	Stale, duplicate, or low-quality record identification
linkedin-prospector	LinkedIn profile scoring against ideal customer profiles
cohort-analyzer	Retention, engagement, and value metrics per user cohort
revenue-attribution	Multi-touch revenue attribution (first-touch, last-touch, linear)
persona-classifier	Buyer persona segmentation from title, company, and behavioral data
email-health-scorer	Email list health from bounce rates and engagement decay
content-performance-ranker	Predicted content performance from historical engagement
sequence-optimizer	Optimal outreach step order and timing from conversion data
forecast-modeler	Forecast projections with confidence bands from historical data

Every script follows the same pattern: pure Python standard library, JSON input via stdin, JSON output via stdout, deterministic behavior, no side effects. This makes them fast, portable, and independently testable.

# Test any skill from the command line
echo '{"contact": {"email": "test@example.com", "company_size": 200}}' \
  | python3 .claude/skills/lead-scoring-engine/scripts/score_lead.py

For the full Skills Library documentation and instructions for building your own skills, see the Skills Library Overview.

Layer 7: Deployment & Observability

This is where most agent projects die. The model works. The tools work. The prototype demos well. Then someone tries to run it in production and discovers that none of the operational infrastructure exists.

80% of AI agent projects never make it to production. The failure is almost never the model. It is error handling, cost controls, monitoring, and the other operational concerns that feel like overhead when you are building but become load-bearing the moment you deploy.

Infrastructure Requirements

Containerization. Docker is non-negotiable for production agents. It provides isolation between agents, reproducible environments, and portability across hosting providers. Each agent should be its own container with its own resource limits.

Orchestration. For systems running more than a few agents, Kubernetes provides scheduling, scaling, restart policies, and resource management. For smaller deployments, Docker Compose or a single VPS with cron scheduling is sufficient. We run our 15-agent system on a single DigitalOcean droplet. You do not need Kubernetes on day one.

Scheduling. Agents run on schedules and event triggers. Use system cron, or a dedicated job scheduler like Inngest or Trigger.dev. Do not rely on a web framework's background job system — it restarts unpredictably and jobs get lost.

Monitoring and Logging

Every agent in production must log:

Every tool call: name, input, output or error, duration, attempt count
Every LLM call: model, token counts, stop reason
Every decision point and its reasoning
Every error with full context
Run start and end with total token usage

Log as structured JSON. Structured logs can be queried and aggregated. Ship them somewhere searchable: Axiom, Datadog, CloudWatch, or a self-hosted Loki/Grafana stack. You cannot debug what you cannot search.

function logEvent(event: Record<string, unknown>) {
  console.log(
    JSON.stringify({
      ...event,
      timestamp: new Date().toISOString(),
      agentId: process.env.AGENT_ID,
      runId: process.env.RUN_ID,
    })
  );
}

Cost Tracking

Token costs compound. A daily cap that feels conservative will feel essential the first time an agent runs pathologically. Set per-agent daily budgets, alert at 70% and 90% thresholds, and hard-stop at 100%.

If you are on the API (not Max subscription), track token usage per run and per day. Our system logs input tokens, output tokens, and estimated cost for every LLM call. A weekly cost review in the first month of deployment catches structural inefficiencies before they become budget problems.

The Deployment Checklist

Before shipping any agent, score it against these ten requirements:

Error handling and retry logic — all tool calls wrapped with exponential backoff
Token budget management — max_tokens set on every call, usage tracked per run
Rate limiting — API limits plus self-imposed per-agent limits
Guardrails — input validation, output validation, tool access control
Logging and observability — structured logs for all tool calls, decisions, errors
Memory persistence — database with atomic writes, not in-memory
Graceful degradation — defined fallback behavior per tool
Cost controls — hard daily caps with alerts
Security — keys in env vars, prompt injection defense, tool sandboxing
Human escalation — first-class escalation tool with defined triggers

Score each item 0 to 1. Ship nothing that scores below 8 out of 10. The items you skip are exactly the ones that cause incidents.

For the complete deployment walkthrough with code for each requirement, see Deployment Checklist: From Prototype to Production. For post-deployment operations, see Monitoring and Debugging.

The Full-Stack AI Engineer Roadmap

You cannot learn all seven layers at once. Here is the order that builds competence fastest, with each level opening the door to the next.

Beginner: Weeks 1-4

Goal: Call a model, get useful output, understand the basics.

Learn prompt engineering. Understand system prompts, few-shot examples, chain-of-thought reasoning, and structured output. This is the foundation. Every layer above depends on your ability to communicate with models effectively. Start with Claude and the Anthropic SDK.
Build a single agent. One agent, one job, a few tools. A support bot that answers questions using a search tool. A research agent that takes a topic and returns a summary. Keep it simple. Get the agent loop working: reason, call tool, observe result, reason again.
Understand tool calling. Define a tool, give it to the model, and see what happens. Break things on purpose. Pass bad descriptions and watch the model misuse the tool. This teaches you more about tool design than any tutorial.

// Your first agent: research assistant with one tool
const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 2048,
  system: "You are a research assistant. Use the search tool to find information, then synthesize a clear answer.",
  messages: [{ role: "user", content: "What are the latest developments in AI agent frameworks?" }],
  tools: [searchWebTool],
});

Intermediate: Weeks 5-12

Goal: Build multi-agent systems with memory, proper error handling, and basic RAG.

Add memory. Give your agent save_memory and load_memory tools. Run it twice and verify it remembers what happened the first time. This is the moment an agent becomes a system instead of a script.
Build a multi-agent system. Start with the supervisor pattern: one orchestrator that delegates to two specialists. Implement tool allowlists so each agent only accesses what it needs. Handle the coordination problems that arise when agents need to share context.
Implement RAG. Chunk a document set, embed the chunks, store them in a vector database (start with ChromaDB or FAISS), and build a retrieval pipeline. Integrate it as a tool your agent can call when it needs knowledge.
Add error handling and cost controls. Wrap every tool call with retry logic. Set token budgets. Add structured logging. This is the work that makes the difference between a demo and a deployable system.

Advanced: Weeks 13-24

Goal: Production-grade systems with computational skills, observability, and real deployment.

Build computational skills. Write Python scripts for computations your agents need: scoring, classification, statistical analysis. Integrate them through the run_skill_script pattern. Test them independently from the command line before wiring them into agents.
Deploy to production. Containerize your agents with Docker. Set up scheduling with cron or a job scheduler. Configure monitoring and alerting. Ship it. Stay close for the first two weeks.
Implement observability. Build dashboards showing agent success rates, token costs, latency, and error patterns over time. Set up alerts for anomalies. Use the data to optimize: which agents run too frequently? Which tools fail most? Where is context being wasted?
Design for scale. Implement model routing to use the right model tier for each task. Add knowledge graphs for structured data alongside vector search for unstructured data. Build adaptive feedback loops that use outcome data to improve agent behavior over time.

The Career Math

The roadmap above takes roughly six months of focused work. The market it prepares you for:

AI engineer average total compensation: $206K (2026)
AI skills wage premium over non-AI roles: 56%
Enterprises implementing AI: 88%
Enterprises that have successfully scaled AI agents: less than 10%

That last number is the opportunity. The supply of engineers who can build production AI agent systems is dramatically smaller than the demand. The seven layers on this page are what separates the 10% that ship from the 90% that prototype.

Where to Go From Here

This page is the map. The territory is in the detailed guides for each layer. Start with whichever layer is most relevant to where you are right now:

New to agents? Start with What Are AI Agents? and the Quickstart.
Building your first multi-agent system? Read Architecture Patterns for the six patterns with full TypeScript code.
Designing tools? See Tool Design for the principles that make tools reliable, and Tools Overview for our full 52-tool catalog.
Adding memory? Memory and Context covers the save/load pattern, vector search, time decay, and shared memory.
Building computational skills? The Skills Library has the complete pattern with all 16 built-in skills.
Shipping to production? The Deployment Checklist has the ten-point scoring system and code for every requirement.
Optimizing costs? The Model Selection Guide covers model routing and the real cost math.

The stack is learnable. The market is ready. The gap between the people who understand these seven layers and the people who do not is the defining career opportunity in software engineering right now. Start at Layer 1. Build upward. Ship something real.