The Agent Builder's Stack: Every Layer You Need to Master
A comprehensive map of every technology layer in the modern AI agent stack — LLMs, orchestration frameworks, tools and MCP, memory systems, RAG, vector databases, deployment, and observability.
The Agent Builder's Stack: Every Layer You Need to Master
AI engineers command a 56% wage premium over their non-AI counterparts and average $206K in total compensation in 2026. Those numbers attract a lot of people. They also mislead a lot of people into thinking that the job is "talk to a model and ship it." The job is building systems. Systems that reason, act, remember, learn, and survive production for longer than a week.
Building a production AI agent is not a single skill. It is a stack of interlocking disciplines, each one load-bearing. Miss the foundation models layer and your agent reasons poorly. Miss the tools layer and it cannot act. Miss memory and it forgets everything between runs. Miss deployment and it never leaves your laptop.
This page maps the complete agent builder's stack as it exists in 2026 — seven layers, from the LLM at the bottom to observability at the top. Each layer is a chapter you need to understand. Together they form the curriculum for becoming the kind of engineer that the 88% of enterprises now implementing AI are trying to hire.
Less than 10% of those enterprises have successfully scaled AI agents to production. The gap is not model quality. It is everything around the model. This is the guide to everything around the model.
The Stack at a Glance
+-------------------------------------------------------+
| Layer 7: Deployment & Observability |
| Docker, K8s, monitoring, logging, cost tracking |
+-------------------------------------------------------+
| Layer 6: Computational Skills |
| Python scripts, lead scoring, A/B testing, forecasts |
+-------------------------------------------------------+
| Layer 5: RAG & Knowledge |
| Chunking, embedding, retrieval, knowledge graphs |
+-------------------------------------------------------+
| Layer 4: Memory Systems |
| Short-term, long-term, semantic, shared multi-agent |
+-------------------------------------------------------+
| Layer 3: Tools & MCP |
| Tool definitions, MCP protocol, API access, compute |
+-------------------------------------------------------+
| Layer 2: Orchestration Frameworks |
| LangChain, LangGraph, CrewAI, Anthropic SDK, custom |
+-------------------------------------------------------+
| Layer 1: Foundation Models (LLMs) |
| Claude, GPT-4, Gemini, Llama, Mistral |
+-------------------------------------------------------+
Every layer depends on the ones below it. You cannot build reliable tools without understanding the model that calls them. You cannot build memory systems without understanding the tools that read and write them. You cannot deploy what you cannot observe.
Learn the stack bottom-up. Build it top-down. That is the paradox: you need to understand the foundation first, but in practice you start from the deployment target and work your way down to the model.
Layer 1: Foundation Models (LLMs)
The foundation model is the reasoning engine at the core of every agent. It reads instructions, reasons over context, decides which tools to call, interprets results, and produces outputs. Everything else in the stack serves this layer or depends on it.
The Model Landscape in 2026
The market has consolidated around four serious players for agent workloads, plus a growing open-source tier that handles specific use cases well.
| Provider | Models | Strengths | Context Window | Best For |
|---|---|---|---|---|
| Anthropic | Claude Opus, Sonnet, Haiku | Instruction following, safety, tool use, long context | Up to 200K tokens | Production agents, complex reasoning, agentic workflows |
| OpenAI | GPT-4o, GPT-4 Turbo, o1/o3 | Broad capability, ecosystem, vision | Up to 128K tokens | General-purpose agents, multimodal tasks |
| Gemini 2.0 Pro, Flash | Multimodal, massive context, speed | Up to 2M tokens | Long-document processing, multimodal workflows | |
| Meta | Llama 3.3 (70B, 405B) | Open weights, self-hostable, no API costs | Up to 128K tokens | Cost-sensitive workloads, on-premise requirements |
| Mistral | Mistral Large, Medium | European hosting, competitive reasoning | Up to 128K tokens | EU data residency, cost-efficient classification |
The Claude Model Tiers
At The AI University, we run our entire 15-agent system on Claude. Understanding the three tiers is essential to building cost-effective agents.
Claude Opus is the reasoning heavyweight. Use it when the task requires multi-step analysis, synthesis of conflicting information, nuanced judgment, or when errors are expensive. Our growth orchestrator and research agents run on Opus because the decisions they make cascade through the entire system. A bad routing decision from the orchestrator corrupts every downstream agent's work.
Claude Sonnet is the workhorse. It handles the majority of agent tasks — structured output generation, tool-calling sequences, moderate reasoning, content drafting — at significantly lower cost and latency. Most of our 15 agents run on Sonnet for their primary workloads. It is the right default choice until you hit a task it cannot handle.
Claude Haiku is the classifier. Fast, cheap, and effective for tasks that require pattern matching rather than deep reasoning: routing decisions, sentiment classification, entity extraction, data validation, and schema conformance checks. Anywhere you need a quick binary or categorical answer, Haiku is the right choice. Our router layer uses Haiku to classify incoming signals before dispatching to more capable agents.
Model Selection Is Architecture
The choice of model is not a configuration flag you set once. It is an architectural decision that affects cost, latency, reliability, and output quality at every layer above.
A 15-agent system where every agent runs on Opus will produce excellent output and cost ten times more than it needs to. A system where every agent runs on Haiku will be fast and cheap and produce unreliable output on anything requiring real reasoning. The engineering skill is matching model capability to task complexity.
// Model routing based on task complexity
function selectModel(task: AgentTask): string {
// High-stakes decisions, multi-source synthesis, strategic planning
if (task.complexity === "high" || task.requiresReasoning) {
return "claude-opus-4-6";
}
// Standard agent work: tool calling, content generation, analysis
if (task.complexity === "medium") {
return "claude-sonnet-4-20250514";
}
// Classification, routing, extraction, validation
return "claude-haiku-4-20250514";
}
For a deeper dive into model routing and the cost math behind these decisions, see our Model Selection Guide.
How We Run Models
Our agents run via the claude -p CLI with a Claude Max subscription. This is a critical architectural decision: Max subscription means no per-token API costs. The agents can run as frequently as needed without the cost anxiety that comes with per-call pricing. This changes the economics of agent design fundamentally — you can afford to have agents reason more thoroughly, retry more aggressively, and run more often.
If you are building on the API directly, every design decision has a cost dimension. With Max, the constraint shifts from cost to throughput and rate limits. Design accordingly.
Layer 2: Orchestration Frameworks
The orchestration layer is the control plane that coordinates agents, manages execution flow, handles errors, and stitches individual agent capabilities into a coherent system.
The Framework Landscape
| Framework | Approach | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| LangChain | Chain-based composition | Massive ecosystem, many integrations | Abstraction-heavy, can obscure what happens | Rapid prototyping, integration-heavy workflows |
| LangGraph | Graph-based state machines | Explicit state management, cycles, branching | Steeper learning curve than LangChain | Complex multi-step workflows with conditional logic |
| CrewAI | Role-based multi-agent | Intuitive mental model, agent personas | Less control over execution details | Team-based agent collaboration, delegation |
| AutoGen (Microsoft) | Conversational multi-agent | Agents converse to solve problems | Can be unpredictable, hard to constrain | Research, open-ended problem solving |
| Anthropic SDK (direct) | Raw API with custom orchestration | Full control, no abstraction overhead | You build everything yourself | Production systems where you need to own every layer |
When to Use a Framework vs. Going Direct
Use a framework when you are prototyping, exploring integrations, or building a system where the framework's abstractions match your workflow pattern. LangGraph is excellent for state machines. CrewAI is excellent for role-based delegation. Both save you significant time if your problem matches their model.
Go direct when you need full control, when framework abstractions get in the way, or when you are building something the framework was not designed for. The cost is writing more code. The benefit is understanding every line of your system.
What We Use
At The AI University, we use the Anthropic SDK directly with a custom orchestrator. Our orchestrator lives at src/lib/agent-sdk/orchestrator.ts and handles agent dispatch, tool routing, memory management, error handling, and cross-agent coordination.
We chose this approach because we needed precise control over how agents interact, what tools they can access, how memory flows between them, and how errors propagate. No existing framework gave us that level of control without fighting its abstractions.
// Simplified orchestrator dispatch
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface AgentConfig {
id: string;
model: string;
systemPrompt: string;
allowedTools: string[];
maxTokens: number;
}
async function dispatchAgent(
config: AgentConfig,
input: string
): Promise<string> {
const tools = getToolsForAgent(config.allowedTools);
const response = await client.messages.create({
model: config.model,
max_tokens: config.maxTokens,
system: config.systemPrompt,
messages: [{ role: "user", content: input }],
tools,
});
return handleAgentResponse(response, config);
}
The important lesson: you do not need a framework to build production agents. You need to understand the patterns (supervisor, router, handoff, pipeline, blackboard) and implement the one that fits. See Architecture Patterns for the complete pattern catalog with code.
Layer 3: Tools & MCP
Tools are what separate an LLM from an agent. Without tools, a model can only produce text. With tools, it can search the web, send emails, query databases, write files, call APIs, and take action in the world. Tools are the hands of the agent.
What Is MCP?
The Model Context Protocol (MCP) is an open standard developed by Anthropic that defines how AI models communicate with external tools and data sources. Before MCP, every team invented their own tool interface. MCP standardizes the contract: tools are defined with a name, description, and JSON Schema for inputs. The model discovers tools, decides when to use them, emits structured calls, and receives structured results.
The cycle is: reason, call, observe, reason. MCP provides the contract that makes it reliable.
Tool Categories
Tools fall into natural categories based on what they enable:
| Category | What It Enables | Examples |
|---|---|---|
| Data Access | Reading from and writing to persistent storage | query_visitors, save_memory, load_memory, get_health_scores |
| External APIs | Interacting with third-party services | search_web, fetch_url, enrich_lead, search_twitter |
| Communication | Taking action in the external world | send_email, publish_to_linkedin, notify_owner |
| Computation | Running deterministic calculations | run_skill_script, generate_conversion_plan |
| Coordination | Managing multi-agent workflows | emit_event, read_events, claim_work, check_contact |
At The AI University, our MCP server wraps 52 tools across these categories. Each of our 15 agents sees only the tools it is permitted to use, enforced by an allowlist system that prevents agents from accessing capabilities outside their scope.
Tool Design Principles
Good tool design is the single most underrated skill in agent engineering. The model never reads your source code — it reads the tool name and description you expose through MCP. If those are vague, the model will call tools incorrectly or not at all.
The five rules:
- The description is the API. Write it like a contract, not a comment.
- Use enums for constrained values. Never let the model guess from an open string.
- Return errors the model can act on. Include the error code, reason, and field that failed.
- Return structured JSON. Give the model data, not text to parse.
- Name tools like API endpoints. Verb-noun, specific, no abbreviations.
// The difference between a tool agents use correctly and one they do not
// Bad: vague name, thin description
{
name: "search",
description: "Searches for things.",
parameters: { query: { type: "string" } }
}
// Good: specific name, rich description, typed parameters
{
name: "search_web",
description:
"Search the web using DuckDuckGo. Returns titles, URLs, and snippets. " +
"Use to find competitor news, reviews, customer mentions, feature announcements. " +
"Returns max 15 results. For deep reading of a specific URL, use fetch_url instead.",
parameters: {
query: { type: "string", description: "Search query" },
limit: { type: "number", description: "Max results (default 8, max 15)" }
}
}
For the complete guide to tool design, see Designing Tools for AI Agents. For the full catalog of our 52 tools, see the Tools Overview.
Layer 4: Memory Systems
An agent without memory is a stateless function that costs more than it should. Every run starts from zero. It does not know what it did yesterday. It cannot learn from mistakes because it has no record of having made them.
Memory is what turns an agent from a tool into a system that compounds value over time.
The Four Types of Memory
Short-term memory is the conversation context within a single run. The LLM runtime manages this for you. Your job is managing it: context windows are finite, and as runs get longer, you either hit limits or pay for tokens that dilute rather than help. Techniques include sliding windows, summarization, and importance-based pruning.
Long-term memory persists across runs. It lives in a storage system and is explicitly loaded at run start and saved at run end. This is the memory you have to build. An agent without save_memory and load_memory tools has no long-term memory, period.
Semantic memory uses embeddings to store and retrieve knowledge by meaning rather than exact key. When an agent needs to find memories relevant to its current task without knowing their exact keys, vector similarity search surfaces semantically related entries even when they share no keywords with the query.
Shared memory is long-term memory that multiple agents can read from and write to. It is the blackboard pattern applied to memory — Agent A saves a finding, Agent B reads it and builds on it. No direct agent-to-agent communication. They coordinate through the store.
Vector Databases
Semantic memory requires a vector store. The options in 2026:
| Database | Type | Strengths | Best For |
|---|---|---|---|
| Pinecone | Managed cloud | Fast, scalable, simple API | Production workloads, minimal ops overhead |
| Weaviate | Self-hosted or cloud | Rich filtering, hybrid search | Complex queries combining vector + keyword search |
| FAISS | Library (Meta) | Extremely fast, runs locally | Prototyping, embedded systems, no-ops scenarios |
| Qdrant | Self-hosted or cloud | Rust performance, rich payload filtering | High-throughput filtering + vector search |
| ChromaDB | Embedded | Simple API, Python-native | Quick prototyping, small datasets |
| pgvector | PostgreSQL extension | Uses existing Postgres, no new infrastructure | Teams already on PostgreSQL |
Start simple. At The AI University, we use JSON files per agent for individual memory and a shared SQLite database for cross-agent memory. Vector search is layered on using embeddings stored as columns in SQLite. This handles our volume without adding a managed service. Move to a dedicated vector database when semantic search is a real requirement and your store exceeds what in-memory similarity can handle.
For the complete implementation guide, see Agent Memory and Context Management.
Layer 5: RAG & Knowledge
Retrieval-Augmented Generation (RAG) is how you give an agent access to knowledge that is not in its training data — your company's documentation, your product catalog, your customer records, your proprietary research. Instead of fine-tuning the model on this data (expensive, slow, static), you retrieve relevant chunks at query time and inject them into the context.
RAG vs. Fine-Tuning
This is the most common architectural decision in knowledge-intensive agent systems. The answer is almost always RAG, with fine-tuning reserved for specific situations.
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Data freshness | Real-time (re-index and it is live) | Stale (retrain on every update) |
| Cost | Retrieval cost per query | Training cost upfront, then inference |
| Transparency | You can see exactly what was retrieved | Opaque — knowledge baked into weights |
| Data volume | Scales to millions of documents | Limited by training data size and budget |
| Accuracy | High when retrieval is good | High when training data is clean |
| Setup time | Hours to days | Days to weeks |
| Best for | Dynamic knowledge, documents, FAQs, catalogs | Consistent tone, domain-specific language, behavioral patterns |
Use RAG when the knowledge changes, when you need to know what informed the answer, and when you need to scale to large document sets. Use fine-tuning when you need to change how the model behaves (tone, style, domain vocabulary) rather than what it knows.
The RAG Pipeline
A RAG system has four stages:
1. Chunking. Split documents into pieces small enough to fit in a context window and cohesive enough to be meaningful. Common approaches: fixed-size chunks (512-1024 tokens), semantic chunks (split at paragraph or section boundaries), and recursive splitting (try larger chunks first, split further only if they exceed the limit). Overlap between chunks (50-100 tokens) prevents information from being lost at boundaries.
2. Embedding. Convert each chunk into a vector that captures its semantic meaning. Use an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or open-source alternatives like BGE or E5) to produce vectors. Store the vectors alongside the original text and any metadata (source document, page number, creation date).
3. Retrieval. When a query arrives, embed it using the same model and find the most similar chunks in the vector store. Top-k retrieval (typically k=5 to k=20) returns the chunks most likely to contain relevant information. Hybrid search — combining vector similarity with keyword matching — often outperforms either approach alone.
4. Generation. Inject the retrieved chunks into the model's context along with the original query. The model reasons over both to produce a grounded answer. The key is formatting the retrieved context clearly so the model knows what it is working with and can cite or reference specific chunks.
// Simplified RAG retrieval + generation
async function ragQuery(question: string): Promise<string> {
// Step 1: Embed the question
const queryEmbedding = await embedText(question);
// Step 2: Retrieve relevant chunks
const chunks = await vectorStore.search(queryEmbedding, { topK: 10 });
// Step 3: Generate answer with retrieved context
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 2048,
system: `Answer the user's question using only the provided context.
If the context does not contain enough information, say so.
Cite the source document for each claim.`,
messages: [
{
role: "user",
content: `Context:\n${chunks.map((c) =>
`[Source: ${c.metadata.source}]\n${c.text}`
).join("\n\n")}\n\nQuestion: ${question}`,
},
],
});
return extractText(response);
}
Knowledge Graphs
For domains with structured relationships — product catalogs, organizational hierarchies, regulatory frameworks — knowledge graphs complement vector search. A knowledge graph encodes entities and their relationships explicitly, enabling queries like "find all products related to this customer's industry that were updated in the last 90 days." Vector search finds semantically similar content. Knowledge graphs navigate known relationships. The most sophisticated RAG systems use both.
Layer 6: Computational Skills
Language models are reasoning engines, not calculators. When an agent needs to score a lead, run a statistical test, forecast a metric, or optimize a schedule, asking it to reason through the math produces unreliable results. The right approach is to move computations that require precision, consistency, or speed out of the model and into deterministic code.
This is what computational skills are: Python scripts that an agent can invoke to perform a computation it cannot do reliably through language alone.
The Skills Pattern
A skill has three components:
- SKILL.md — a meta prompt the agent reads to understand the skill's purpose, inputs, and outputs
- scripts/ — Python scripts that perform the computation (pure stdlib, JSON in, JSON out)
- catalog.json — a registry entry the MCP server reads to make the skill discoverable
The agent never writes Python. It reads the SKILL.md, then calls the run_skill_script MCP tool with the skill name and input data. The tool executor runs the script as a subprocess and returns the result as structured JSON.
// What a skill invocation looks like from the agent's perspective
await runSkillScript({
skill_name: "lead-scoring-engine",
input: {
contact: {
email: "cto@acme.com",
company_size: 500,
title: "Chief Technology Officer",
website_visits_30d: 12,
email_opens_30d: 4,
},
},
});
// Returns: { score: 82, tier: "hot", signals: [...], confidence: "high" }
The AI University Skills Library
Our system ships with 16 computational skills covering the most common agent workloads:
| Skill | What It Computes |
|---|---|
| lead-scoring-engine | 0-100 lead scores from firmographic and behavioral signals |
| churn-predictor | Churn probability from activity decay and engagement patterns |
| a-b-test-analyzer | Statistical significance, confidence intervals, winner determination |
| send-timing-optimizer | Optimal send times from historical engagement data |
| trend-detection | Statistically significant trends and anomalies in time-series data |
| cross-agent-intelligence | Cross-session patterns from multi-agent signal aggregation |
| adaptive-feedback-loop | Behavioral parameter adjustments from outcome feedback |
| data-pruner | Stale, duplicate, or low-quality record identification |
| linkedin-prospector | LinkedIn profile scoring against ideal customer profiles |
| cohort-analyzer | Retention, engagement, and value metrics per user cohort |
| revenue-attribution | Multi-touch revenue attribution (first-touch, last-touch, linear) |
| persona-classifier | Buyer persona segmentation from title, company, and behavioral data |
| email-health-scorer | Email list health from bounce rates and engagement decay |
| content-performance-ranker | Predicted content performance from historical engagement |
| sequence-optimizer | Optimal outreach step order and timing from conversion data |
| forecast-modeler | Forecast projections with confidence bands from historical data |
Every script follows the same pattern: pure Python standard library, JSON input via stdin, JSON output via stdout, deterministic behavior, no side effects. This makes them fast, portable, and independently testable.
# Test any skill from the command line
echo '{"contact": {"email": "test@example.com", "company_size": 200}}' \
| python3 .claude/skills/lead-scoring-engine/scripts/score_lead.py
For the full Skills Library documentation and instructions for building your own skills, see the Skills Library Overview.
Layer 7: Deployment & Observability
This is where most agent projects die. The model works. The tools work. The prototype demos well. Then someone tries to run it in production and discovers that none of the operational infrastructure exists.
80% of AI agent projects never make it to production. The failure is almost never the model. It is error handling, cost controls, monitoring, and the other operational concerns that feel like overhead when you are building but become load-bearing the moment you deploy.
Infrastructure Requirements
Containerization. Docker is non-negotiable for production agents. It provides isolation between agents, reproducible environments, and portability across hosting providers. Each agent should be its own container with its own resource limits.
Orchestration. For systems running more than a few agents, Kubernetes provides scheduling, scaling, restart policies, and resource management. For smaller deployments, Docker Compose or a single VPS with cron scheduling is sufficient. We run our 15-agent system on a single DigitalOcean droplet. You do not need Kubernetes on day one.
Scheduling. Agents run on schedules and event triggers. Use system cron, or a dedicated job scheduler like Inngest or Trigger.dev. Do not rely on a web framework's background job system — it restarts unpredictably and jobs get lost.
Monitoring and Logging
Every agent in production must log:
- Every tool call: name, input, output or error, duration, attempt count
- Every LLM call: model, token counts, stop reason
- Every decision point and its reasoning
- Every error with full context
- Run start and end with total token usage
Log as structured JSON. Structured logs can be queried and aggregated. Ship them somewhere searchable: Axiom, Datadog, CloudWatch, or a self-hosted Loki/Grafana stack. You cannot debug what you cannot search.
function logEvent(event: Record<string, unknown>) {
console.log(
JSON.stringify({
...event,
timestamp: new Date().toISOString(),
agentId: process.env.AGENT_ID,
runId: process.env.RUN_ID,
})
);
}
Cost Tracking
Token costs compound. A daily cap that feels conservative will feel essential the first time an agent runs pathologically. Set per-agent daily budgets, alert at 70% and 90% thresholds, and hard-stop at 100%.
If you are on the API (not Max subscription), track token usage per run and per day. Our system logs input tokens, output tokens, and estimated cost for every LLM call. A weekly cost review in the first month of deployment catches structural inefficiencies before they become budget problems.
The Deployment Checklist
Before shipping any agent, score it against these ten requirements:
- Error handling and retry logic — all tool calls wrapped with exponential backoff
- Token budget management — max_tokens set on every call, usage tracked per run
- Rate limiting — API limits plus self-imposed per-agent limits
- Guardrails — input validation, output validation, tool access control
- Logging and observability — structured logs for all tool calls, decisions, errors
- Memory persistence — database with atomic writes, not in-memory
- Graceful degradation — defined fallback behavior per tool
- Cost controls — hard daily caps with alerts
- Security — keys in env vars, prompt injection defense, tool sandboxing
- Human escalation — first-class escalation tool with defined triggers
Score each item 0 to 1. Ship nothing that scores below 8 out of 10. The items you skip are exactly the ones that cause incidents.
For the complete deployment walkthrough with code for each requirement, see Deployment Checklist: From Prototype to Production. For post-deployment operations, see Monitoring and Debugging.
The Full-Stack AI Engineer Roadmap
You cannot learn all seven layers at once. Here is the order that builds competence fastest, with each level opening the door to the next.
Beginner: Weeks 1-4
Goal: Call a model, get useful output, understand the basics.
-
Learn prompt engineering. Understand system prompts, few-shot examples, chain-of-thought reasoning, and structured output. This is the foundation. Every layer above depends on your ability to communicate with models effectively. Start with Claude and the Anthropic SDK.
-
Build a single agent. One agent, one job, a few tools. A support bot that answers questions using a search tool. A research agent that takes a topic and returns a summary. Keep it simple. Get the agent loop working: reason, call tool, observe result, reason again.
-
Understand tool calling. Define a tool, give it to the model, and see what happens. Break things on purpose. Pass bad descriptions and watch the model misuse the tool. This teaches you more about tool design than any tutorial.
// Your first agent: research assistant with one tool
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 2048,
system: "You are a research assistant. Use the search tool to find information, then synthesize a clear answer.",
messages: [{ role: "user", content: "What are the latest developments in AI agent frameworks?" }],
tools: [searchWebTool],
});
Intermediate: Weeks 5-12
Goal: Build multi-agent systems with memory, proper error handling, and basic RAG.
-
Add memory. Give your agent
save_memoryandload_memorytools. Run it twice and verify it remembers what happened the first time. This is the moment an agent becomes a system instead of a script. -
Build a multi-agent system. Start with the supervisor pattern: one orchestrator that delegates to two specialists. Implement tool allowlists so each agent only accesses what it needs. Handle the coordination problems that arise when agents need to share context.
-
Implement RAG. Chunk a document set, embed the chunks, store them in a vector database (start with ChromaDB or FAISS), and build a retrieval pipeline. Integrate it as a tool your agent can call when it needs knowledge.
-
Add error handling and cost controls. Wrap every tool call with retry logic. Set token budgets. Add structured logging. This is the work that makes the difference between a demo and a deployable system.
Advanced: Weeks 13-24
Goal: Production-grade systems with computational skills, observability, and real deployment.
-
Build computational skills. Write Python scripts for computations your agents need: scoring, classification, statistical analysis. Integrate them through the
run_skill_scriptpattern. Test them independently from the command line before wiring them into agents. -
Deploy to production. Containerize your agents with Docker. Set up scheduling with cron or a job scheduler. Configure monitoring and alerting. Ship it. Stay close for the first two weeks.
-
Implement observability. Build dashboards showing agent success rates, token costs, latency, and error patterns over time. Set up alerts for anomalies. Use the data to optimize: which agents run too frequently? Which tools fail most? Where is context being wasted?
-
Design for scale. Implement model routing to use the right model tier for each task. Add knowledge graphs for structured data alongside vector search for unstructured data. Build adaptive feedback loops that use outcome data to improve agent behavior over time.
The Career Math
The roadmap above takes roughly six months of focused work. The market it prepares you for:
- AI engineer average total compensation: $206K (2026)
- AI skills wage premium over non-AI roles: 56%
- Enterprises implementing AI: 88%
- Enterprises that have successfully scaled AI agents: less than 10%
That last number is the opportunity. The supply of engineers who can build production AI agent systems is dramatically smaller than the demand. The seven layers on this page are what separates the 10% that ship from the 90% that prototype.
Where to Go From Here
This page is the map. The territory is in the detailed guides for each layer. Start with whichever layer is most relevant to where you are right now:
- New to agents? Start with What Are AI Agents? and the Quickstart.
- Building your first multi-agent system? Read Architecture Patterns for the six patterns with full TypeScript code.
- Designing tools? See Tool Design for the principles that make tools reliable, and Tools Overview for our full 52-tool catalog.
- Adding memory? Memory and Context covers the save/load pattern, vector search, time decay, and shared memory.
- Building computational skills? The Skills Library has the complete pattern with all 16 built-in skills.
- Shipping to production? The Deployment Checklist has the ten-point scoring system and code for every requirement.
- Optimizing costs? The Model Selection Guide covers model routing and the real cost math.
The stack is learnable. The market is ready. The gap between the people who understand these seven layers and the people who do not is the defining career opportunity in software engineering right now. Start at Layer 1. Build upward. Ship something real.